Project Problem and Hypothesis


Overall, I am interested in using comprehensive molecular and clinical data in order to make predictions for patient treatment and prognosis for diseases such as cancer. Significant advances in technology have not only advanced the ability to process large amounts of data but in the field of biomedicine, has also advanced the speed at which data and the type of data that can be gathered at low costs. High-throughput sequencing technologies allow researchers to extract information on the expression of entire genomes. For example, gene expression profiling by DNA microarray technology has enabled the classification of different cancers into distinct subgroups, providing insight in areas including cancer cell origin and tumor environment status[1-2]. For breast cancer alone, at least five molecular subgroups have been identified (basal-like, normal-like, HER2-positive, and luminal A and luminal B). One reason understanding the various subgroups of a disease based on its molecular signature is important is because it suggests that there may be therapeutic treatments that can be developed to specifically target each subtype. Breast cancer is the second most common cancer in women after skin cancer, and it is estimated that in 2016 there will be 246,600 new cases of breast cancer in the US in 2016, contributing to 14.6% of all new cases of cancer and 6.8% of deaths in 2016. Using I would like to perform a proof-of-concept study to determine the genomic, histological, and clinical factors to predict breast cancer type and prognosis. I will use logistic regression and KMeans clustering of molecular and clinical data on ~1000 Breast cancer patients to make the following predictions:

  • Breast cancer subtype, tumor site of origin, and age based on mRNA genetic signature/pattern
  • Survival based on mRNA signature and clinical data

I hypothesize that there is differential mRNA expression pattern between breast cancer histological subtype.
I hypotheize that there is differential mRNA expression pattern between breast cancers based on the tumor site of origin. I hypthesize that there is differential mRNA expression based on age of tumor onset.
Last, I hypthesize that survival can be predicted based on a combination of clinical and mRNA expression features.

  1. Eisen, M.B. and Patrick O. Brown. DNA arrays for analysis of gene expression. Methods in Enzymology http://www.sciencedirect.com/science/article/pii/S0076687999030141
  2. Perou, C.M. et al. Molecular portraits of human breast tumours. Nature http://www.nature.com/nature/journal/v406/n6797/full/406747a0.html
  3. Winslow, S. et al. Prognostic stromal gene signatures in breast cancer. Breast Cancer Research https://breast-cancer-research.biomedcentral.com/articles/10.1186/s13058-015-0530-2
  4. Sorlie, T. Repeated observation of breast tumor subtypes in independent gene expression data sets. PNAS https://www.ncbi.nlm.nih.gov/pmc/articles/PMC166244/

Datasets


The Cancer Genome Atlas (TCGA) is a National Institutes of Health and National Human Genome Institute project using high-throughput genomics technologies to characterize the genetic landscape of various human cancers in order to understand the genetic bases for human cancers, with the ultimate goal of improving diagnosis, treatment, and prevention. Over two petabytes of data spanning 33 tumor types and 10 rare cancers from over 11,000 eligible volunteers has been gathered. Biotechniques used to gather molecular profiles on participant biospecimen include gene expression profiling, copy number variation profiling, SNP genotyping, genome wide DNA methylation profiling, microRNA profiling, and exon sequencing of at least 1,200 genes.

This dataset consists of several tables:

In [24]:
import datalab.bigquery as bq
In [25]:
import datalab.bigquery as bq

d = bq.Dataset('isb-cgc:tcga_201607_beta')
for t in d.tables():
  print '%10d rows   %s' % (t.metadata.rows, t.name.table_id)
      6322 rows   Annotations
     23797 rows   Biospecimen_data
     11160 rows   Clinical_data
   2646095 rows   Copy_Number_segments
3944304319 rows   DNA_Methylation_betas
 382335670 rows   DNA_Methylation_chr1
 197519895 rows   DNA_Methylation_chr10
 235823572 rows   DNA_Methylation_chr11
 198050739 rows   DNA_Methylation_chr12
  97301675 rows   DNA_Methylation_chr13
 123239379 rows   DNA_Methylation_chr14
 124566185 rows   DNA_Methylation_chr15
 179772812 rows   DNA_Methylation_chr16
 234003341 rows   DNA_Methylation_chr17
  50216619 rows   DNA_Methylation_chr18
 211386795 rows   DNA_Methylation_chr19
 279668485 rows   DNA_Methylation_chr2
  86858120 rows   DNA_Methylation_chr20
  35410447 rows   DNA_Methylation_chr21
  70676468 rows   DNA_Methylation_chr22
 201119616 rows   DNA_Methylation_chr3
 159148744 rows   DNA_Methylation_chr4
 195864180 rows   DNA_Methylation_chr5
 290275524 rows   DNA_Methylation_chr6
 240010275 rows   DNA_Methylation_chr7
 164810092 rows   DNA_Methylation_chr8
  81260723 rows   DNA_Methylation_chr9
  98082681 rows   DNA_Methylation_chrX
   2330426 rows   DNA_Methylation_chrY
   1867233 rows   Protein_RPPA_data
   5356089 rows   Somatic_Mutation_calls
   5738048 rows   mRNA_BCGSC_GA_RPKM
  38299138 rows   mRNA_BCGSC_HiSeq_RPKM
  44037186 rows   mRNA_BCGSC_RPKM
  16794358 rows   mRNA_UNC_GA_RSEM
 211284521 rows   mRNA_UNC_HiSeq_RSEM
 228078879 rows   mRNA_UNC_RSEM
  11997545 rows   miRNA_BCGSC_GA_isoform
   4503046 rows   miRNA_BCGSC_GA_mirna
  90237323 rows   miRNA_BCGSC_HiSeq_isoform
  28207741 rows   miRNA_BCGSC_HiSeq_mirna
 102234868 rows   miRNA_BCGSC_isoform
  32710787 rows   miRNA_BCGSC_mirna
  26763022 rows   miRNA_Expression

I will be focusing on the following tables:

Table Name Info
Annotations annotations and related information
Biospecimen_data (sample-centric table) contains one row of information for each sample
Clinical_data (patient-centric table) contains one row of information for each TCGA patient
mRNA_UNC_HiSeq_RSEM contains gene expression data fromsamples assayed on the Illumina HiSeq platform and processed through the UNC “RNASeqV2” RSEM pipeline. Each row in this table contains the RSEM expression estimate for a single gene in a single aliquot.

Here's a peek into each of the four tables:

Annotations:

In [26]:
bq.Table('isb-cgc:tcga_201607_beta.Annotations').schema
Out[26]:
In [27]:
bq.Table('isb-cgc:tcga_201607_beta.Annotations').sample(count=2)
Out[27]:
annotationIdannotationCategoryIdannotationCategoryNameannotationClassificationannotationNoteTextStudyitemTypeNameitemBarcodeAliquotBarcodeParticipantBarcodeSampleBarcodedateAddeddateCreateddateEdited
7111Tumor tissue origin incorrectRedaction[intgen.org]: Case was of non-ovarian originGBMPatientTCGA-01-0629 TCGA-01-0629 2010-09-02T00:00:00-04:002010-09-02T00:00:00-04:00 
7131Tumor tissue origin incorrectRedaction[intgen.org]: Case was of non-ovarian originOVPatientTCGA-13-1479 TCGA-13-1479 2010-09-02T00:00:00-04:002010-09-02T00:00:00-04:00 

(rows: 2, time: 0.3s, cached, job: job_x-npL55LbsPWRiB94uaX0YelIe4)

Biospecimen Data:

In [28]:
bq.Table('isb-cgc:tcga_201607_beta.Biospecimen_data').schema
Out[28]:
In [29]:
bq.Table('isb-cgc:tcga_201607_beta.Biospecimen_data').sample(count=2)
Out[29]:
ParticipantBarcodeSampleBarcodeSampleTypeLetterCodeSampleTypeStudyProjectSampleTypeCodeavg_percent_lymphocyte_infiltrationavg_percent_monocyte_infiltrationavg_percent_necrosisavg_percent_neutrophil_infiltrationavg_percent_normal_cellsavg_percent_stromal_cellsavg_percent_tumor_cellsavg_percent_tumor_nucleibatch_numberbcrdays_to_collectiondays_to_sample_procurementis_ffpemax_percent_lymphocyte_infiltrationmax_percent_monocyte_infiltrationmax_percent_necrosismax_percent_neutrophil_infiltrationmax_percent_normal_cellsmax_percent_stromal_cellsmax_percent_tumor_cellsmax_percent_tumor_nucleimin_percent_lymphocyte_infiltrationmin_percent_monocyte_infiltrationmin_percent_necrosismin_percent_neutrophil_infiltrationmin_percent_normal_cellsmin_percent_stromal_cellsmin_percent_tumor_cellsmin_percent_tumor_nucleinum_portionsnum_slidesSampleUUID
TCGA-LP-A4AVTCGA-LP-A4AV-10ANBBlood Derived NormalCESCTCGA10        256Nationwide Children's Hospital413.0364.0NO                10EC08F8EE-DA6A-4CFA-AEE9-07754224FC58
TCGA-LP-A4AXTCGA-LP-A4AX-10ANBBlood Derived NormalCESCTCGA10        256Nationwide Children's Hospital263.02.0NO                10AA734CF9-0C3C-4CE2-9C1E-3E3F9866C6D1

(rows: 2, time: 0.2s, cached, job: job_whovtix7IItkwqG1vSnrnb8i_6E)

Clinical Data:

In [30]:
bq.Table('isb-cgc:tcga_201607_beta.Clinical_data').schema
Out[30]:
In [31]:
bq.Table('isb-cgc:tcga_201607_beta.Clinical_data').sample(count=2)
Out[31]:
ParticipantBarcodeStudyProjectParticipantUUIDTSSCodeage_at_initial_pathologic_diagnosisanatomic_neoplasm_subdivisionbatch_numberbcrclinical_Mclinical_Nclinical_Tclinical_stagecolorectal_cancercountryvital_statusdays_to_birthdays_to_deathdays_to_last_known_alivedays_to_last_followupdays_to_initial_pathologic_diagnosisdays_to_submitted_specimen_dxethnicitygendergleason_score_combinedhistological_typehistory_of_colon_polypshistory_of_neoadjuvant_treatmenthpv_callshpv_statusicd_10icd_o_3_histologyicd_o_3_sitelymphatic_invasionlymphnodes_examinedlymphovascular_invasion_presentmenopause_statusmononucleotide_and_dinucleotide_marker_panel_analysis_statusneoplasm_histologic_gradenew_tumor_event_after_initial_treatmentnumber_of_lymphnodes_examinednumber_of_lymphnodes_positive_by_henumber_pack_years_smokedyear_of_initial_pathologic_diagnosispathologic_Mpathologic_Npathologic_Tpathologic_stageperson_neoplasm_cancer_statuspregnanciesprimary_neoplasm_melanoma_dxprimary_therapy_outcome_successpsa_valueraceresidual_tumortobacco_smoking_historytumor_tissue_sitetumor_typeweightheightBMIage_began_smoking_in_yearsh_pylori_infectionother_dxother_malignancy_anatomic_siteother_malignancy_histological_typeother_malignancy_malignancy_typestopped_smoking_yearvenous_invasionyear_of_tobacco_smoking_onset
TCGA-AB-2959LAMLTCGA31ac4afd-8195-479a-97da-92edaa89a85eAB71 25Washington University      Dead-26114489489 0 NOT HISPANIC OR LATINOMALE   No               2007         WHITE  Bone Marrow             
TCGA-AB-2957LAMLTCGAc887d68c-da8e-4f15-a524-3b6674b145edAB31 25Washington University      Alive-11626 6096090 NOT HISPANIC OR LATINOMALE   Yes               2007         WHITE  Bone Marrow             

(rows: 2, time: 0.2s, cached, job: job_4VTDt58ZS8d2xs3wuraVobAfY-4)

mRNA Expression data:

In [32]:
bq.Table('isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM').schema
Out[32]:
In [33]:
bq.Table('isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM').sample(count=2)
Out[33]:
ParticipantBarcodeSampleBarcodeAliquotBarcodeStudySampleTypeLetterCodePlatformoriginal_gene_symbolHGNC_gene_symbolgene_idnormalized_count
TCGA-04-1348TCGA-04-1348-01ATCGA-04-1348-01A-01R-1565-13OVTPIlluminaHiSeqAADACL2AADACL23447520.0
TCGA-04-1348TCGA-04-1348-01ATCGA-04-1348-01A-01R-1565-13OVTPIlluminaHiSeqABCA5ABCA52346162.6513

(rows: 2, time: 0.2s, cached, job: job_rT8Jr7cKx_p6yJ8020HO1D5QYnA)
In [34]:
print bq.Table('isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM').schema
[{u'type': u'STRING', u'name': u'ParticipantBarcode'}, {u'type': u'STRING', u'name': u'SampleBarcode'}, {u'type': u'STRING', u'name': u'AliquotBarcode'}, {u'type': u'STRING', u'name': u'Study'}, {u'type': u'STRING', u'name': u'SampleTypeLetterCode'}, {u'type': u'STRING', u'name': u'Platform'}, {u'type': u'STRING', u'name': u'original_gene_symbol'}, {u'type': u'STRING', u'name': u'HGNC_gene_symbol'}, {u'type': u'INTEGER', u'name': u'gene_id'}, {u'type': u'FLOAT', u'name': u'normalized_count'}]

Domain knowledge


While my PhD is in cancer biology, I studied melanoma (a type of skin cancer). I'm familiar with the proper resources to fill in gaps in knowledge (PubMed). I need to conduct additional research on how to normalize mRNA expression data.

Resources:

Project Concerns


  • Given time, what's a reasonable, attainable goal?
  • How should I further narrow the scope of the project without losing the project's appeal (to me)?
  • Ensuring results are scientifically sound
    • trying to pinpoint the correct way to normalize the mRNA expression data
  • Finding the best references to which I can compare my results

Outcomes


I expect the outcomes to hopefully mirror previous research. I believe that this will be a marker of success. Any deviations from what has been published should at least make biological sense. If the project is a bust, which, I believe may be due to mRNA expression issues,then i will see what can be uncovered solely from the clinical data. My target audience would be other scientists and people interested in cancer research.


early data exploration:

In [35]:
bq.Table('isb-cgc:tcga_201607_beta.Clinical_data').sample(count=2)
Out[35]:
ParticipantBarcodeStudyProjectParticipantUUIDTSSCodeage_at_initial_pathologic_diagnosisanatomic_neoplasm_subdivisionbatch_numberbcrclinical_Mclinical_Nclinical_Tclinical_stagecolorectal_cancercountryvital_statusdays_to_birthdays_to_deathdays_to_last_known_alivedays_to_last_followupdays_to_initial_pathologic_diagnosisdays_to_submitted_specimen_dxethnicitygendergleason_score_combinedhistological_typehistory_of_colon_polypshistory_of_neoadjuvant_treatmenthpv_callshpv_statusicd_10icd_o_3_histologyicd_o_3_sitelymphatic_invasionlymphnodes_examinedlymphovascular_invasion_presentmenopause_statusmononucleotide_and_dinucleotide_marker_panel_analysis_statusneoplasm_histologic_gradenew_tumor_event_after_initial_treatmentnumber_of_lymphnodes_examinednumber_of_lymphnodes_positive_by_henumber_pack_years_smokedyear_of_initial_pathologic_diagnosispathologic_Mpathologic_Npathologic_Tpathologic_stageperson_neoplasm_cancer_statuspregnanciesprimary_neoplasm_melanoma_dxprimary_therapy_outcome_successpsa_valueraceresidual_tumortobacco_smoking_historytumor_tissue_sitetumor_typeweightheightBMIage_began_smoking_in_yearsh_pylori_infectionother_dxother_malignancy_anatomic_siteother_malignancy_histological_typeother_malignancy_malignancy_typestopped_smoking_yearvenous_invasionyear_of_tobacco_smoking_onset
TCGA-AB-2959LAMLTCGA31ac4afd-8195-479a-97da-92edaa89a85eAB71 25Washington University      Dead-26114489489 0 NOT HISPANIC OR LATINOMALE   No               2007         WHITE  Bone Marrow             
TCGA-AB-2957LAMLTCGAc887d68c-da8e-4f15-a524-3b6674b145edAB31 25Washington University      Alive-11626 6096090 NOT HISPANIC OR LATINOMALE   Yes               2007         WHITE  Bone Marrow             

(rows: 2, time: 1.1s, cached, job: job_85ES9EiNbkthmYbYU0XJ-j3gGRc)
In [36]:
ds = bq.Table('isb-cgc:tcga_201607_beta.Clinical_data').to_dataframe()
In [37]:
ds.Study.unique()
Out[37]:
array(['LAML', 'HNSC', 'SKCM', 'COAD', 'READ', 'THYM', 'SARC', 'UCS',
       'PRAD', 'THCA', 'DLBC', 'ESCA', 'STAD', 'LIHC', 'CHOL', 'PAAD',
       'LUAD', 'LUSC', 'PCPG', 'MESO', 'OV', 'BRCA', 'CESC', 'UCEC',
       'TGCT', 'KIRP', 'KIRC', 'KICH', 'BLCA', 'LGG', 'GBM', 'ACC', 'UVM'], dtype=object)
In [38]:
import pandas as pd
pd.options.display.max_columns = 100
In [39]:
ds[(ds.Study == 'BRCA')].histological_type.unique()
Out[39]:
array(['Infiltrating Ductal Carcinoma', 'Other, specify',
       'Mucinous Carcinoma', 'Infiltrating Lobular Carcinoma',
       'Mixed Histology (please specify)', 'Metaplastic Carcinoma',
       'Infiltrating Carcinoma NOS', 'Medullary Carcinoma'], dtype=object)
In [40]:
print ds[(ds.Study == 'BRCA')].pathologic_stage.unique()
['Stage IIA' 'Stage IIB' 'Stage I' 'Stage IIIA' 'Stage IA' 'Stage X'
 'Stage IV' 'Stage IIIB' 'Stage IIIC' 'Stage IB' None 'Stage III'
 'Stage II']
In [41]:
ds[(ds.Study == 'BRCA')].count()
Out[41]:
ParticipantBarcode                      1097
Study                                   1097
Project                                 1097
ParticipantUUID                         1097
TSSCode                                 1097
age_at_initial_pathologic_diagnosis     1096
anatomic_neoplasm_subdivision           1097
batch_number                            1097
bcr                                     1097
clinical_M                                 0
clinical_N                                 0
clinical_T                                 0
clinical_stage                             0
colorectal_cancer                          0
country                                  696
vital_status                            1097
days_to_birth                           1081
days_to_death                            149
days_to_last_known_alive                1096
days_to_last_followup                    947
days_to_initial_pathologic_diagnosis    1096
days_to_submitted_specimen_dx              0
ethnicity                                923
gender                                  1097
gleason_score_combined                     0
histological_type                       1097
history_of_colon_polyps                    0
history_of_neoadjuvant_treatment        1097
hpv_calls                                  0
hpv_status                                 0
                                        ... 
number_of_lymphnodes_examined            971
number_of_lymphnodes_positive_by_he      929
number_pack_years_smoked                   0
year_of_initial_pathologic_diagnosis    1094
pathologic_M                            1097
pathologic_N                            1097
pathologic_T                            1097
pathologic_stage                        1086
person_neoplasm_cancer_status           1061
pregnancies                                0
primary_neoplasm_melanoma_dx               0
primary_therapy_outcome_success            0
psa_value                                  0
race                                    1002
residual_tumor                             0
tobacco_smoking_history                    0
tumor_tissue_site                       1097
tumor_type                                 0
weight                                     0
height                                     0
BMI                                        0
age_began_smoking_in_years                 0
h_pylori_infection                         0
other_dx                                1087
other_malignancy_anatomic_site            76
other_malignancy_histological_type        64
other_malignancy_malignancy_type          77
stopped_smoking_year                       0
venous_invasion                            0
year_of_tobacco_smoking_onset              0
dtype: int64
In [42]:
new_ds = ds[['ParticipantBarcode', 'Study', 'Project', 'ParticipantUUID', 'TSSCode', 'age_at_initial_pathologic_diagnosis', 'anatomic_neoplasm_subdivision', 'batch_number',
            'bcr','vital_status', 'days_to_birth', 'days_to_last_known_alive', 'days_to_initial_pathologic_diagnosis', 'gender', 'histological_type', 
             'history_of_neoadjuvant_treatment', 'year_of_initial_pathologic_diagnosis', 'pathologic_M', 'pathologic_N', 'pathologic_T', 'pathologic_stage',
             'person_neoplasm_cancer_status', 'race', 'tumor_tissue_site', 'other_dx'
            ]]
In [43]:
new_ds.tumor_tissue_site.value_counts()
Out[43]:
Breast                                                     1098
Lung                                                       1026
Kidney                                                      941
Brain                                                       599
Ovary                                                       582
Head and Neck                                               565
Endometrial                                                 547
Central nervous system                                      515
Thyroid                                                     509
Prostate                                                    500
Colon                                                       460
Stomach                                                     445
Bladder                                                     412
Liver                                                       378
Cervical                                                    307
Bone Marrow                                                 199
Extremities                                                 196
Pancreas                                                    185
Esophagus                                                   185
Trunk                                                       172
Rectum                                                      167
Adrenal gland                                               147
Testes                                                      134
Thymus                                                       97
Adrenal                                                      94
Pleura                                                       87
Choroid                                                      80
Retroperitoneum/Upper abdominal - Retroperitoneum            74
Uterus                                                       57
Bile duct                                                    45
                                                           ... 
Bone                                                          4
Lower Extremity - Foot/ankle                                  4
Superficial Trunk - Buttock                                   4
Retroperitoneum/Upper abdominal - Colon                       4
Omentum                                                       3
Small intestine                                               3
Retroperitoneum/Upper abdominal - Small Intestines            3
Chest - Other (please specify                                 2
Retroperitoneum/Upper abdominal - Other (please specify       2
Peritoneum ovary                                              2
Superficial Trunk - Abdominal wall                            2
Head and Neck - Other (please specify                         2
Chest - Lung/pleura                                           2
Lower Extremity - Groin                                       2
Retroperitoneum/Upper abdominal - Gastric                     2
Primary Tumor                                                 2
Head and Neck - Head                                          2
Lower abdominal/Pelvic - Spermatic Cord                       2
Lower abdominal/Pelvic - Other (please specify                2
Ascites/Peritoneum                                            2
Retroperitoneum/Upper abdominal - Pancreas                    1
Parotid Gland                                                 1
Soft Tissue (muscle  ligaments  subcutaneous)                 1
Head and Neck - Neck                                          1
Gynecological - Ovary                                         1
Superficial Trunk - Flank                                     1
Lower abdominal/Pelvic - Bladder                              1
Chest - Breast                                                1
Other Extranodal Site                                         1
Chest - Mediastinum                                           1
Name: tumor_tissue_site, dtype: int64
In [44]:
new_ds.age_at_initial_pathologic_diagnosis.describe()
Out[44]:
count    11109.000000
mean        59.099019
std         14.415440
min         10.000000
25%         50.000000
50%         60.000000
75%         70.000000
max         90.000000
Name: age_at_initial_pathologic_diagnosis, dtype: float64
In [45]:
new_ds.age_at_initial_pathologic_diagnosis.plot.hist()
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f49c325be10>
In [ ]:
 

Comments

comments powered by Disqus