Project Problem and Hypothesis¶

Overall, I am interested in using comprehensive molecular and clinical data in order to make predictions for patient treatment and prognosis for diseases such as cancer. Significant advances in technology have not only advanced the ability to process large amounts of data but in the field of biomedicine, has also advanced the speed at which data and the type of data that can be gathered at low costs. High-throughput sequencing technologies allow researchers to extract information on the expression of entire genomes. For example, gene expression profiling by DNA microarray technology has enabled the classification of different cancers into distinct subgroups, providing insight in areas including cancer cell origin and tumor environment status[1-2]. For breast cancer alone, at least five molecular subgroups have been identified (basal-like, normal-like, HER2-positive, and luminal A and luminal B). One reason understanding the various subgroups of a disease based on its molecular signature is important is because it suggests that there may be therapeutic treatments that can be developed to specifically target each subtype. Breast cancer is the second most common cancer in women after skin cancer, and it is estimated that in 2016 there will be 246,600 new cases of breast cancer in the US in 2016, contributing to 14.6% of all new cases of cancer and 6.8% of deaths in 2016. Using I would like to perform a proof-of-concept study to determine the genomic, histological, and clinical factors to predict breast cancer type and prognosis. I will use logistic regression and KMeans clustering of molecular and clinical data on ~1000 Breast cancer patients to make the following predictions:

Breast cancer subtype, tumor site of origin, and age based on mRNA genetic signature/pattern
Survival based on mRNA signature and clinical data

I hypothesize that there is differential mRNA expression pattern between breast cancer histological subtype.
I hypotheize that there is differential mRNA expression pattern between breast cancers based on the tumor site of origin. I hypthesize that there is differential mRNA expression based on age of tumor onset.
Last, I hypthesize that survival can be predicted based on a combination of clinical and mRNA expression features.

Eisen, M.B. and Patrick O. Brown. DNA arrays for analysis of gene expression. Methods in Enzymology http://www.sciencedirect.com/science/article/pii/S0076687999030141
Perou, C.M. et al. Molecular portraits of human breast tumours. Nature http://www.nature.com/nature/journal/v406/n6797/full/406747a0.html
Winslow, S. et al. Prognostic stromal gene signatures in breast cancer. Breast Cancer Research https://breast-cancer-research.biomedcentral.com/articles/10.1186/s13058-015-0530-2
Sorlie, T. Repeated observation of breast tumor subtypes in independent gene expression data sets. PNAS https://www.ncbi.nlm.nih.gov/pmc/articles/PMC166244/

Datasets¶

The Cancer Genome Atlas (TCGA) is a National Institutes of Health and National Human Genome Institute project using high-throughput genomics technologies to characterize the genetic landscape of various human cancers in order to understand the genetic bases for human cancers, with the ultimate goal of improving diagnosis, treatment, and prevention. Over two petabytes of data spanning 33 tumor types and 10 rare cancers from over 11,000 eligible volunteers has been gathered. Biotechniques used to gather molecular profiles on participant biospecimen include gene expression profiling, copy number variation profiling, SNP genotyping, genome wide DNA methylation profiling, microRNA profiling, and exon sequencing of at least 1,200 genes.

This dataset consists of several tables:

In [24]:

import datalab.bigquery as bq

In [25]:

import datalab.bigquery as bq

d = bq.Dataset('isb-cgc:tcga_201607_beta')
for t in d.tables():
  print '%10d rows   %s' % (t.metadata.rows, t.name.table_id)

      6322 rows   Annotations
     23797 rows   Biospecimen_data
     11160 rows   Clinical_data
   2646095 rows   Copy_Number_segments
3944304319 rows   DNA_Methylation_betas
 382335670 rows   DNA_Methylation_chr1
 197519895 rows   DNA_Methylation_chr10
 235823572 rows   DNA_Methylation_chr11
 198050739 rows   DNA_Methylation_chr12
  97301675 rows   DNA_Methylation_chr13
 123239379 rows   DNA_Methylation_chr14
 124566185 rows   DNA_Methylation_chr15
 179772812 rows   DNA_Methylation_chr16
 234003341 rows   DNA_Methylation_chr17
  50216619 rows   DNA_Methylation_chr18
 211386795 rows   DNA_Methylation_chr19
 279668485 rows   DNA_Methylation_chr2
  86858120 rows   DNA_Methylation_chr20
  35410447 rows   DNA_Methylation_chr21
  70676468 rows   DNA_Methylation_chr22
 201119616 rows   DNA_Methylation_chr3
 159148744 rows   DNA_Methylation_chr4
 195864180 rows   DNA_Methylation_chr5
 290275524 rows   DNA_Methylation_chr6
 240010275 rows   DNA_Methylation_chr7
 164810092 rows   DNA_Methylation_chr8
  81260723 rows   DNA_Methylation_chr9
  98082681 rows   DNA_Methylation_chrX
   2330426 rows   DNA_Methylation_chrY
   1867233 rows   Protein_RPPA_data
   5356089 rows   Somatic_Mutation_calls
   5738048 rows   mRNA_BCGSC_GA_RPKM
  38299138 rows   mRNA_BCGSC_HiSeq_RPKM
  44037186 rows   mRNA_BCGSC_RPKM
  16794358 rows   mRNA_UNC_GA_RSEM
 211284521 rows   mRNA_UNC_HiSeq_RSEM
 228078879 rows   mRNA_UNC_RSEM
  11997545 rows   miRNA_BCGSC_GA_isoform
   4503046 rows   miRNA_BCGSC_GA_mirna
  90237323 rows   miRNA_BCGSC_HiSeq_isoform
  28207741 rows   miRNA_BCGSC_HiSeq_mirna
 102234868 rows   miRNA_BCGSC_isoform
  32710787 rows   miRNA_BCGSC_mirna
  26763022 rows   miRNA_Expression

I will be focusing on the following tables:

Table Name	Info
Annotations	annotations and related information
Biospecimen_data	(sample-centric table) contains one row of information for each sample
Clinical_data	(patient-centric table) contains one row of information for each TCGA patient
mRNA_UNC_HiSeq_RSEM	contains gene expression data fromsamples assayed on the Illumina HiSeq platform and processed through the UNC “RNASeqV2” RSEM pipeline. Each row in this table contains the RSEM expression estimate for a single gene in a single aliquot.

Here's a peek into each of the four tables:

Annotations:

In [26]:

bq.Table('isb-cgc:tcga_201607_beta.Annotations').schema

Out[26]:

In [27]:

bq.Table('isb-cgc:tcga_201607_beta.Annotations').sample(count=2)

Out[27]:

annotationId	annotationCategoryId	annotationCategoryName	annotationClassification	annotationNoteText	Study	itemTypeName	itemBarcode	AliquotBarcode	ParticipantBarcode	SampleBarcode	dateAdded	dateCreated	dateEdited
711	1	Tumor tissue origin incorrect	Redaction	[intgen.org]: Case was of non-ovarian origin	GBM	Patient	TCGA-01-0629		TCGA-01-0629		2010-09-02T00:00:00-04:00	2010-09-02T00:00:00-04:00
713	1	Tumor tissue origin incorrect	Redaction	[intgen.org]: Case was of non-ovarian origin	OV	Patient	TCGA-13-1479		TCGA-13-1479		2010-09-02T00:00:00-04:00	2010-09-02T00:00:00-04:00

(rows: 2, time: 0.3s, cached, job: job_x-npL55LbsPWRiB94uaX0YelIe4)

Biospecimen Data:

In [28]:

bq.Table('isb-cgc:tcga_201607_beta.Biospecimen_data').schema

Out[28]:

In [29]:

bq.Table('isb-cgc:tcga_201607_beta.Biospecimen_data').sample(count=2)

Out[29]:

ParticipantBarcode	SampleBarcode	SampleTypeLetterCode	SampleType	Study	Project	SampleTypeCode	avg_percent_lymphocyte_infiltration	avg_percent_monocyte_infiltration	avg_percent_necrosis	avg_percent_neutrophil_infiltration	avg_percent_normal_cells	avg_percent_stromal_cells	avg_percent_tumor_cells	avg_percent_tumor_nuclei	batch_number	bcr	days_to_collection	days_to_sample_procurement	is_ffpe	max_percent_lymphocyte_infiltration	max_percent_monocyte_infiltration	max_percent_necrosis	max_percent_neutrophil_infiltration	max_percent_normal_cells	max_percent_stromal_cells	max_percent_tumor_cells	max_percent_tumor_nuclei	min_percent_lymphocyte_infiltration	min_percent_monocyte_infiltration	min_percent_necrosis	min_percent_neutrophil_infiltration	min_percent_normal_cells	min_percent_stromal_cells	min_percent_tumor_cells	min_percent_tumor_nuclei	num_portions	num_slides	SampleUUID
TCGA-LP-A4AV	TCGA-LP-A4AV-10A	NB	Blood Derived Normal	CESC	TCGA	10									256	Nationwide Children's Hospital	413.0	364.0	NO																	1	0	EC08F8EE-DA6A-4CFA-AEE9-07754224FC58
TCGA-LP-A4AX	TCGA-LP-A4AX-10A	NB	Blood Derived Normal	CESC	TCGA	10									256	Nationwide Children's Hospital	263.0	2.0	NO																	1	0	AA734CF9-0C3C-4CE2-9C1E-3E3F9866C6D1

(rows: 2, time: 0.2s, cached, job: job_whovtix7IItkwqG1vSnrnb8i_6E)

Clinical Data:

In [30]:

bq.Table('isb-cgc:tcga_201607_beta.Clinical_data').schema

Out[30]:

In [31]:

bq.Table('isb-cgc:tcga_201607_beta.Clinical_data').sample(count=2)

Out[31]:

ParticipantBarcode	Study	Project	ParticipantUUID	TSSCode	age_at_initial_pathologic_diagnosis	anatomic_neoplasm_subdivision	batch_number	bcr	clinical_M	clinical_N	clinical_T	clinical_stage	colorectal_cancer	country	vital_status	days_to_birth	days_to_death	days_to_last_known_alive	days_to_last_followup	days_to_initial_pathologic_diagnosis	days_to_submitted_specimen_dx	ethnicity	gender	gleason_score_combined	histological_type	history_of_colon_polyps	history_of_neoadjuvant_treatment	hpv_calls	hpv_status	icd_10	icd_o_3_histology	icd_o_3_site	lymphatic_invasion	lymphnodes_examined	lymphovascular_invasion_present	menopause_status	mononucleotide_and_dinucleotide_marker_panel_analysis_status	neoplasm_histologic_grade	new_tumor_event_after_initial_treatment	number_of_lymphnodes_examined	number_of_lymphnodes_positive_by_he	number_pack_years_smoked	year_of_initial_pathologic_diagnosis	pathologic_M	pathologic_N	pathologic_T	pathologic_stage	person_neoplasm_cancer_status	pregnancies	primary_neoplasm_melanoma_dx	primary_therapy_outcome_success	psa_value	race	residual_tumor	tobacco_smoking_history	tumor_tissue_site	tumor_type	weight	height	BMI	age_began_smoking_in_years	h_pylori_infection	other_dx	other_malignancy_anatomic_site	other_malignancy_histological_type	other_malignancy_malignancy_type	stopped_smoking_year	venous_invasion	year_of_tobacco_smoking_onset
TCGA-AB-2959	LAML	TCGA	31ac4afd-8195-479a-97da-92edaa89a85e	AB	71		25	Washington University							Dead	-26114	489	489		0		NOT HISPANIC OR LATINO	MALE				No																2007										WHITE			Bone Marrow
TCGA-AB-2957	LAML	TCGA	c887d68c-da8e-4f15-a524-3b6674b145ed	AB	31		25	Washington University							Alive	-11626		609	609	0		NOT HISPANIC OR LATINO	MALE				Yes																2007										WHITE			Bone Marrow

(rows: 2, time: 0.2s, cached, job: job_4VTDt58ZS8d2xs3wuraVobAfY-4)

mRNA Expression data:

In [32]:

bq.Table('isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM').schema

Out[32]:

In [33]:

bq.Table('isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM').sample(count=2)

Out[33]:

ParticipantBarcode	SampleBarcode	AliquotBarcode	Study	SampleTypeLetterCode	Platform	original_gene_symbol	HGNC_gene_symbol	gene_id	normalized_count
TCGA-04-1348	TCGA-04-1348-01A	TCGA-04-1348-01A-01R-1565-13	OV	TP	IlluminaHiSeq	AADACL2	AADACL2	344752	0.0
TCGA-04-1348	TCGA-04-1348-01A	TCGA-04-1348-01A-01R-1565-13	OV	TP	IlluminaHiSeq	ABCA5	ABCA5	23461	62.6513

(rows: 2, time: 0.2s, cached, job: job_rT8Jr7cKx_p6yJ8020HO1D5QYnA)

In [34]:

print bq.Table('isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM').schema

[{u'type': u'STRING', u'name': u'ParticipantBarcode'}, {u'type': u'STRING', u'name': u'SampleBarcode'}, {u'type': u'STRING', u'name': u'AliquotBarcode'}, {u'type': u'STRING', u'name': u'Study'}, {u'type': u'STRING', u'name': u'SampleTypeLetterCode'}, {u'type': u'STRING', u'name': u'Platform'}, {u'type': u'STRING', u'name': u'original_gene_symbol'}, {u'type': u'STRING', u'name': u'HGNC_gene_symbol'}, {u'type': u'INTEGER', u'name': u'gene_id'}, {u'type': u'FLOAT', u'name': u'normalized_count'}]

Domain knowledge¶

While my PhD is in cancer biology, I studied melanoma (a type of skin cancer). I'm familiar with the proper resources to fill in gaps in knowledge (PubMed). I need to conduct additional research on how to normalize mRNA expression data.

Resources:

Tutorial: Machine Learning For Cancer Classification - Part 1 - Preparing The Data Sets https://www.biostars.org/p/85124/
Tutorial: Machine Learning For Cancer Classification - Part 2 - Building A Random Forest Classifier https://www.biostars.org/p/86981/
Tutorial: Machine Learning For Cancer Classification - Part 3 - Predicting With A Random Forest Classifier https://www.biostars.org/p/87110/
Tutorial: Machine Learning For Cancer Classification - Part 4 - Plotting A Kaplan-Meier Curve For Survival Analysis https://www.biostars.org/p/87580/
Seq-ing improved gene expression estimates from microarrays using machine learning https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0712-z

Project Concerns¶

Given time, what's a reasonable, attainable goal?
How should I further narrow the scope of the project without losing the project's appeal (to me)?
Ensuring results are scientifically sound
- trying to pinpoint the correct way to normalize the mRNA expression data
Finding the best references to which I can compare my results

Outcomes¶

I expect the outcomes to hopefully mirror previous research. I believe that this will be a marker of success. Any deviations from what has been published should at least make biological sense. If the project is a bust, which, I believe may be due to mRNA expression issues,then i will see what can be uncovered solely from the clinical data. My target audience would be other scientists and people interested in cancer research.

early data exploration:

In [35]:

bq.Table('isb-cgc:tcga_201607_beta.Clinical_data').sample(count=2)

Out[35]:

ParticipantBarcode	Study	Project	ParticipantUUID	TSSCode	age_at_initial_pathologic_diagnosis	anatomic_neoplasm_subdivision	batch_number	bcr	clinical_M	clinical_N	clinical_T	clinical_stage	colorectal_cancer	country	vital_status	days_to_birth	days_to_death	days_to_last_known_alive	days_to_last_followup	days_to_initial_pathologic_diagnosis	days_to_submitted_specimen_dx	ethnicity	gender	gleason_score_combined	histological_type	history_of_colon_polyps	history_of_neoadjuvant_treatment	hpv_calls	hpv_status	icd_10	icd_o_3_histology	icd_o_3_site	lymphatic_invasion	lymphnodes_examined	lymphovascular_invasion_present	menopause_status	mononucleotide_and_dinucleotide_marker_panel_analysis_status	neoplasm_histologic_grade	new_tumor_event_after_initial_treatment	number_of_lymphnodes_examined	number_of_lymphnodes_positive_by_he	number_pack_years_smoked	year_of_initial_pathologic_diagnosis	pathologic_M	pathologic_N	pathologic_T	pathologic_stage	person_neoplasm_cancer_status	pregnancies	primary_neoplasm_melanoma_dx	primary_therapy_outcome_success	psa_value	race	residual_tumor	tobacco_smoking_history	tumor_tissue_site	tumor_type	weight	height	BMI	age_began_smoking_in_years	h_pylori_infection	other_dx	other_malignancy_anatomic_site	other_malignancy_histological_type	other_malignancy_malignancy_type	stopped_smoking_year	venous_invasion	year_of_tobacco_smoking_onset
TCGA-AB-2959	LAML	TCGA	31ac4afd-8195-479a-97da-92edaa89a85e	AB	71		25	Washington University							Dead	-26114	489	489		0		NOT HISPANIC OR LATINO	MALE				No																2007										WHITE			Bone Marrow
TCGA-AB-2957	LAML	TCGA	c887d68c-da8e-4f15-a524-3b6674b145ed	AB	31		25	Washington University							Alive	-11626		609	609	0		NOT HISPANIC OR LATINO	MALE				Yes																2007										WHITE			Bone Marrow

(rows: 2, time: 1.1s, cached, job: job_85ES9EiNbkthmYbYU0XJ-j3gGRc)

In [36]:

ds = bq.Table('isb-cgc:tcga_201607_beta.Clinical_data').to_dataframe()

In [37]:

ds.Study.unique()

Out[37]:

array(['LAML', 'HNSC', 'SKCM', 'COAD', 'READ', 'THYM', 'SARC', 'UCS',
       'PRAD', 'THCA', 'DLBC', 'ESCA', 'STAD', 'LIHC', 'CHOL', 'PAAD',
       'LUAD', 'LUSC', 'PCPG', 'MESO', 'OV', 'BRCA', 'CESC', 'UCEC',
       'TGCT', 'KIRP', 'KIRC', 'KICH', 'BLCA', 'LGG', 'GBM', 'ACC', 'UVM'], dtype=object)

In [38]:

import pandas as pd
pd.options.display.max_columns = 100

In [39]:

ds[(ds.Study == 'BRCA')].histological_type.unique()

Out[39]:

array(['Infiltrating Ductal Carcinoma', 'Other, specify',
       'Mucinous Carcinoma', 'Infiltrating Lobular Carcinoma',
       'Mixed Histology (please specify)', 'Metaplastic Carcinoma',
       'Infiltrating Carcinoma NOS', 'Medullary Carcinoma'], dtype=object)

In [40]:

print ds[(ds.Study == 'BRCA')].pathologic_stage.unique()

['Stage IIA' 'Stage IIB' 'Stage I' 'Stage IIIA' 'Stage IA' 'Stage X'
 'Stage IV' 'Stage IIIB' 'Stage IIIC' 'Stage IB' None 'Stage III'
 'Stage II']

In [41]:

ds[(ds.Study == 'BRCA')].count()

Out[41]:

ParticipantBarcode                      1097
Study                                   1097
Project                                 1097
ParticipantUUID                         1097
TSSCode                                 1097
age_at_initial_pathologic_diagnosis     1096
anatomic_neoplasm_subdivision           1097
batch_number                            1097
bcr                                     1097
clinical_M                                 0
clinical_N                                 0
clinical_T                                 0
clinical_stage                             0
colorectal_cancer                          0
country                                  696
vital_status                            1097
days_to_birth                           1081
days_to_death                            149
days_to_last_known_alive                1096
days_to_last_followup                    947
days_to_initial_pathologic_diagnosis    1096
days_to_submitted_specimen_dx              0
ethnicity                                923
gender                                  1097
gleason_score_combined                     0
histological_type                       1097
history_of_colon_polyps                    0
history_of_neoadjuvant_treatment        1097
hpv_calls                                  0
hpv_status                                 0
                                        ... 
number_of_lymphnodes_examined            971
number_of_lymphnodes_positive_by_he      929
number_pack_years_smoked                   0
year_of_initial_pathologic_diagnosis    1094
pathologic_M                            1097
pathologic_N                            1097
pathologic_T                            1097
pathologic_stage                        1086
person_neoplasm_cancer_status           1061
pregnancies                                0
primary_neoplasm_melanoma_dx               0
primary_therapy_outcome_success            0
psa_value                                  0
race                                    1002
residual_tumor                             0
tobacco_smoking_history                    0
tumor_tissue_site                       1097
tumor_type                                 0
weight                                     0
height                                     0
BMI                                        0
age_began_smoking_in_years                 0
h_pylori_infection                         0
other_dx                                1087
other_malignancy_anatomic_site            76
other_malignancy_histological_type        64
other_malignancy_malignancy_type          77
stopped_smoking_year                       0
venous_invasion                            0
year_of_tobacco_smoking_onset              0
dtype: int64

In [42]:

new_ds = ds[['ParticipantBarcode', 'Study', 'Project', 'ParticipantUUID', 'TSSCode', 'age_at_initial_pathologic_diagnosis', 'anatomic_neoplasm_subdivision', 'batch_number',
            'bcr','vital_status', 'days_to_birth', 'days_to_last_known_alive', 'days_to_initial_pathologic_diagnosis', 'gender', 'histological_type', 
             'history_of_neoadjuvant_treatment', 'year_of_initial_pathologic_diagnosis', 'pathologic_M', 'pathologic_N', 'pathologic_T', 'pathologic_stage',
             'person_neoplasm_cancer_status', 'race', 'tumor_tissue_site', 'other_dx'
            ]]

In [43]:

new_ds.tumor_tissue_site.value_counts()

Out[43]:

Breast                                                     1098
Lung                                                       1026
Kidney                                                      941
Brain                                                       599
Ovary                                                       582
Head and Neck                                               565
Endometrial                                                 547
Central nervous system                                      515
Thyroid                                                     509
Prostate                                                    500
Colon                                                       460
Stomach                                                     445
Bladder                                                     412
Liver                                                       378
Cervical                                                    307
Bone Marrow                                                 199
Extremities                                                 196
Pancreas                                                    185
Esophagus                                                   185
Trunk                                                       172
Rectum                                                      167
Adrenal gland                                               147
Testes                                                      134
Thymus                                                       97
Adrenal                                                      94
Pleura                                                       87
Choroid                                                      80
Retroperitoneum/Upper abdominal - Retroperitoneum            74
Uterus                                                       57
Bile duct                                                    45
                                                           ... 
Bone                                                          4
Lower Extremity - Foot/ankle                                  4
Superficial Trunk - Buttock                                   4
Retroperitoneum/Upper abdominal - Colon                       4
Omentum                                                       3
Small intestine                                               3
Retroperitoneum/Upper abdominal - Small Intestines            3
Chest - Other (please specify                                 2
Retroperitoneum/Upper abdominal - Other (please specify       2
Peritoneum ovary                                              2
Superficial Trunk - Abdominal wall                            2
Head and Neck - Other (please specify                         2
Chest - Lung/pleura                                           2
Lower Extremity - Groin                                       2
Retroperitoneum/Upper abdominal - Gastric                     2
Primary Tumor                                                 2
Head and Neck - Head                                          2
Lower abdominal/Pelvic - Spermatic Cord                       2
Lower abdominal/Pelvic - Other (please specify                2
Ascites/Peritoneum                                            2
Retroperitoneum/Upper abdominal - Pancreas                    1
Parotid Gland                                                 1
Soft Tissue (muscle  ligaments  subcutaneous)                 1
Head and Neck - Neck                                          1
Gynecological - Ovary                                         1
Superficial Trunk - Flank                                     1
Lower abdominal/Pelvic - Bladder                              1
Chest - Breast                                                1
Other Extranodal Site                                         1
Chest - Mediastinum                                           1
Name: tumor_tissue_site, dtype: int64

In [44]:

new_ds.age_at_initial_pathologic_diagnosis.describe()

Out[44]:

count    11109.000000
mean        59.099019
std         14.415440
min         10.000000
25%         50.000000
50%         60.000000
75%         70.000000
max         90.000000
Name: age_at_initial_pathologic_diagnosis, dtype: float64

In [45]:

new_ds.age_at_initial_pathologic_diagnosis.plot.hist()

Out[45]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f49c325be10>

In [ ]:

Test Notebook

Project Problem and Hypothesis¶

Datasets¶

Domain knowledge¶

Project Concerns¶

Outcomes¶

Comments