Project Problem and Hypothesis¶
Overall, I am interested in using comprehensive molecular and clinical data in order to make predictions for patient treatment and prognosis for diseases such as cancer. Significant advances in technology have not only advanced the ability to process large amounts of data but in the field of biomedicine, has also advanced the speed at which data and the type of data that can be gathered at low costs. High-throughput sequencing technologies allow researchers to extract information on the expression of entire genomes. For example, gene expression profiling by DNA microarray technology has enabled the classification of different cancers into distinct subgroups, providing insight in areas including cancer cell origin and tumor environment status[1-2]. For breast cancer alone, at least five molecular subgroups have been identified (basal-like, normal-like, HER2-positive, and luminal A and luminal B). One reason understanding the various subgroups of a disease based on its molecular signature is important is because it suggests that there may be therapeutic treatments that can be developed to specifically target each subtype. Breast cancer is the second most common cancer in women after skin cancer, and it is estimated that in 2016 there will be 246,600 new cases of breast cancer in the US in 2016, contributing to 14.6% of all new cases of cancer and 6.8% of deaths in 2016. Using I would like to perform a proof-of-concept study to determine the genomic, histological, and clinical factors to predict breast cancer type and prognosis. I will use logistic regression and KMeans clustering of molecular and clinical data on ~1000 Breast cancer patients to make the following predictions:
- Breast cancer subtype, tumor site of origin, and age based on mRNA genetic signature/pattern
- Survival based on mRNA signature and clinical data
I hypothesize that there is differential mRNA expression pattern between breast cancer histological subtype.
I hypotheize that there is differential mRNA expression pattern between breast cancers based on the tumor site of origin.
I hypthesize that there is differential mRNA expression based on age of tumor onset.
Last, I hypthesize that survival can be predicted based on a combination of clinical and mRNA expression features.
- Eisen, M.B. and Patrick O. Brown. DNA arrays for analysis of gene expression. Methods in Enzymology http://www.sciencedirect.com/science/article/pii/S0076687999030141
- Perou, C.M. et al. Molecular portraits of human breast tumours. Nature http://www.nature.com/nature/journal/v406/n6797/full/406747a0.html
- Winslow, S. et al. Prognostic stromal gene signatures in breast cancer. Breast Cancer Research https://breast-cancer-research.biomedcentral.com/articles/10.1186/s13058-015-0530-2
- Sorlie, T. Repeated observation of breast tumor subtypes in independent gene expression data sets. PNAS https://www.ncbi.nlm.nih.gov/pmc/articles/PMC166244/
Datasets¶
The Cancer Genome Atlas (TCGA) is a National Institutes of Health and National Human Genome Institute project using high-throughput genomics technologies to characterize the genetic landscape of various human cancers in order to understand the genetic bases for human cancers, with the ultimate goal of improving diagnosis, treatment, and prevention. Over two petabytes of data spanning 33 tumor types and 10 rare cancers from over 11,000 eligible volunteers has been gathered. Biotechniques used to gather molecular profiles on participant biospecimen include gene expression profiling, copy number variation profiling, SNP genotyping, genome wide DNA methylation profiling, microRNA profiling, and exon sequencing of at least 1,200 genes.
This dataset consists of several tables:
import datalab.bigquery as bq
import datalab.bigquery as bq
d = bq.Dataset('isb-cgc:tcga_201607_beta')
for t in d.tables():
print '%10d rows %s' % (t.metadata.rows, t.name.table_id)
I will be focusing on the following tables:
| Table Name | Info |
|---|---|
| Annotations | annotations and related information |
| Biospecimen_data | (sample-centric table) contains one row of information for each sample |
| Clinical_data | (patient-centric table) contains one row of information for each TCGA patient |
| mRNA_UNC_HiSeq_RSEM | contains gene expression data fromsamples assayed on the Illumina HiSeq platform and processed through the UNC “RNASeqV2” RSEM pipeline. Each row in this table contains the RSEM expression estimate for a single gene in a single aliquot. |
Here's a peek into each of the four tables:
Annotations:
bq.Table('isb-cgc:tcga_201607_beta.Annotations').schema
bq.Table('isb-cgc:tcga_201607_beta.Annotations').sample(count=2)
Biospecimen Data:
bq.Table('isb-cgc:tcga_201607_beta.Biospecimen_data').schema
bq.Table('isb-cgc:tcga_201607_beta.Biospecimen_data').sample(count=2)
Clinical Data:
bq.Table('isb-cgc:tcga_201607_beta.Clinical_data').schema
bq.Table('isb-cgc:tcga_201607_beta.Clinical_data').sample(count=2)
mRNA Expression data:
bq.Table('isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM').schema
bq.Table('isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM').sample(count=2)
print bq.Table('isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM').schema
Domain knowledge¶
While my PhD is in cancer biology, I studied melanoma (a type of skin cancer). I'm familiar with the proper resources to fill in gaps in knowledge (PubMed). I need to conduct additional research on how to normalize mRNA expression data.
Resources:
- Tutorial: Machine Learning For Cancer Classification - Part 1 - Preparing The Data Sets https://www.biostars.org/p/85124/
- Tutorial: Machine Learning For Cancer Classification - Part 2 - Building A Random Forest Classifier https://www.biostars.org/p/86981/
- Tutorial: Machine Learning For Cancer Classification - Part 3 - Predicting With A Random Forest Classifier https://www.biostars.org/p/87110/
- Tutorial: Machine Learning For Cancer Classification - Part 4 - Plotting A Kaplan-Meier Curve For Survival Analysis https://www.biostars.org/p/87580/
- Seq-ing improved gene expression estimates from microarrays using machine learning https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0712-z
Project Concerns¶
- Given time, what's a reasonable, attainable goal?
- How should I further narrow the scope of the project without losing the project's appeal (to me)?
- Ensuring results are scientifically sound
- trying to pinpoint the correct way to normalize the mRNA expression data
- Finding the best references to which I can compare my results
Outcomes¶
I expect the outcomes to hopefully mirror previous research. I believe that this will be a marker of success. Any deviations from what has been published should at least make biological sense. If the project is a bust, which, I believe may be due to mRNA expression issues,then i will see what can be uncovered solely from the clinical data. My target audience would be other scientists and people interested in cancer research.
early data exploration:
bq.Table('isb-cgc:tcga_201607_beta.Clinical_data').sample(count=2)
ds = bq.Table('isb-cgc:tcga_201607_beta.Clinical_data').to_dataframe()
ds.Study.unique()
import pandas as pd
pd.options.display.max_columns = 100
ds[(ds.Study == 'BRCA')].histological_type.unique()
print ds[(ds.Study == 'BRCA')].pathologic_stage.unique()
ds[(ds.Study == 'BRCA')].count()
new_ds = ds[['ParticipantBarcode', 'Study', 'Project', 'ParticipantUUID', 'TSSCode', 'age_at_initial_pathologic_diagnosis', 'anatomic_neoplasm_subdivision', 'batch_number',
'bcr','vital_status', 'days_to_birth', 'days_to_last_known_alive', 'days_to_initial_pathologic_diagnosis', 'gender', 'histological_type',
'history_of_neoadjuvant_treatment', 'year_of_initial_pathologic_diagnosis', 'pathologic_M', 'pathologic_N', 'pathologic_T', 'pathologic_stage',
'person_neoplasm_cancer_status', 'race', 'tumor_tissue_site', 'other_dx'
]]
new_ds.tumor_tissue_site.value_counts()
new_ds.age_at_initial_pathologic_diagnosis.describe()
new_ds.age_at_initial_pathologic_diagnosis.plot.hist()
Comments
comments powered by Disqus