Transcript here. - Computational Biology and Functional Genomics Laboratory
Driving Discovery Through Data Integration and Analysis John Quackenbush Molecular Diagnostics World 2010 28 October 2010
Genetic Risk Birth
Disease Progression and Personalized Care
Treatment Natural History of Disease Clinical Care Quality Of Life Death Environment + Lifestyle Outcomes Treatment Options Disease Staging Patient Stratification Early Detection Biomarkers
Turning the vision into a reality
Assure access to samples and rational consent Develop a technology platform Make information integration as a central mission Conduct research as a vital component Present data and information to the local community Enable research beyond your own Engage corporate partners Communicating the mission to the community.
Assure Access to Samples
Access, Research, Security
Patients want to be part of the process of curing disease Informed consent needs to be structured to allow patients to be partners in the research process HIPPA requires both informed consent and that we assure patient confidentiality But “identifiability” is a moving target in a genomic age With the <$1000 genome, in the age of Facebook, what this means remains unclear The new Genomics is a disruptive technology.
Develop a Technology Platform
2006: State of the Art Sequencing
PRODUCTION Rooms of equipment Subcloning > picking > prepping 35 FTEs 3-4 weeks SEQUENCING 74x Capillary Sequencers 10 FTEs 15-40 runs per day 1-2Mb per instrument per day 120Mb total capacity per day Sequencing the genome took ~15 years and $3B
2008: Enabling a New Era in Genome Analysis
PRODUCTION 1x Cluster Station 1 FTE 1 day SEQUENCING 1x Genome Analyzer Same FTE as above 1 run per 5 days 15 Gb per instrument per run >3 Gb per day (1x genome coverage) We can now re-sequence the genome in a ~1 week
The Challenge
New technologies inspired by the Human Genome Project are transforming
biomedical research
from a laboratory science to an
information science
We need new approaches to making sense of the data we generate The winners in the race to understand disease are going to be those best able to collect, manage, analyze, and interpret the data.
Make information integration as a central mission
Gene
http://compbio.dfci.harvard.edu
RNA Protein Network Patient TM4 Microarray Software
Predict Network Candidate Gene(s) Perturb Network (RNAi) Assay Response (
m
A) Resourcerer Other Databases Other tools MeSHer ClusterMed
Other Things: Mesoscopic Expression Correlated Signatures State Space Gene Models Tiling Arrays to Genes
Dealing with an Information Overload
Beating Information Overload Clinical Data Genomics Cytogenomics Metabolomics Chemical Biology Clinical Trials Etc.
Central Warehouse Transcriptomics Epigenomics Proteomics Improved Diagnostics Individualized Therapies More Effective Agents PubMed The Genome Published Datasets The HapMap Disease Databases (OMIM) Drug Bank
Dana-Farber Research DB Conceptual Architecture
PubMed GenBank Rules Engine BAM Dashboard Web Center Portal Portals Business Intelligence Partners OMICS IDX Rx Lab Clinical Trial …… Idm & Security HTB ODS genomics A Facts B A Custom Facts B C D C D BPEL Web Service Directory …..
De-identification Terminology EMPI Security Mapping Auditing Severity Score Clinical Pathways RFID Build or Buy Oracle Existing
An Example: Signature Analysis
Warehouse Array Express GEO Random Websites Fenglong Liu Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard
GeneChip Oncology Database
Fenglong Liu
GeneChip Oncology Database
Fenglong Liu
An Example: Signature Analysis
PubMed Kerm Picard Warehouse Array Express GEO Random Websites Fenglong Liu In-House Studies Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard Analysis
GeneSigDB – release 2
http://compbio.dfci.harvard.edu/genesigdb
GeneSigDB – comparing cancers
Cancer is a Cell-Cycle Disease
Aedin Culhane, Daniel Gusenleitner
Breast Cancer has unique signatures
Aedin Culhane, Daniel Gusenleitner
A sample research question How many Multiple Myeloma patients, with bone marrow or blood samples in the bank, and who have a chromosome 13 deletion, responded (complete, partial, or minor remission) to therapy and how many did not respond?
A Path Forward We are working to develop a two-way strategy for future Clinic → Lab Lab → Clinic Consider OncotypeDx This approach represents the intellectual framework for future success – and the bridges between the various laboratories and programs.
Conduct research as a vital component
Bayesian Networks
Amira Djebbari Raktin Sinha Dan Schlauch
When we say “Networks” we mean… Genes are represented as “nodes” Interactions are represented by “edges” Edges can be directed to show “causal” interactions Edges are not necessarily direct interactions
Bayesian network - example
Conditional probability table at node “Gene2” Edges represent dependencies Gene1 Gene1 -1 0 1 Gene2=1|Gene1 0.1
0.2
Gene2 Gene3 0.7
Gene4 Learning Bayesian networks: Structure Conditional probability tables
Bayesian networks - priors
No free lunch theorem (Wolpert & MacReady, 1996): The performance of general-purpose optimization algorithm iterated on cost function is independent of the algorithm when averaged over all cost functions. Suggests that when considering a specific application one can introduce a potentially useful bias using domain knowledge
A low-cost lunch?
One can “help” the search along by providing a seed structure representing what we believe is the most likely network The network search process will then use gene expression data to look for perturbations on the structure that are supported by the data There are many possible sources of prior structures including the Biomedical literature and large-scale interaction studies (PPI)
Bayesian networks using microarray data and literature Test Set: Golub et al. ALL/AML dataset Learn BN with literature network as prior structure, Protein-Protein Interaction data (PPI), and literature+PPI Perform 200 bootstrap network estimations and find links that are “high confidence” Compare without prior (microarray data only) vs. with prior structure from the literature to look for known interactions.
Amira Djebbari
BN: No Priors
Amira Djebbari
BN: PPI Data
Amira Djebbari
BN: Literature Priors
Amira Djebbari
BN: Literature + PPI
Cell Cycle Gene Subnetwork
Improving the Seeds
Co-occurrence does not a provide directionality for interactions, but a BN is a DAG and our assignment is ad hoc The literature contains information about how we the genes (and their products) interact The challenge is extracting that information from the literature —there is too much to read Text mining doesn’t work well for the biomedical literature.
Improving the Seeds (2)
Solution: Use a hybrid approach!
Use text-mining tools to find sentences that contain names of two or more genes Use the Amazon Mechanical Turk to extract [subject] —[predicate]—[object] triples Define relationships between genes based on the “consensus” interaction Combine these results with pathway databases to build seed networks.
“PredictiveNetworks” seeds from the literature
Present data and information to the local community
LGRC Research Portal
LGRC Research Portal
PAGE DETAILS - View aggregate statistics
- View cohort details - Build cohort sets - Build composite phenotypes
Actions: -Go to data download for selected cohort
-Go to assay detail for selected
cohort
-Go to cohort manager
LGRC Research Portal
PAGE DETAILS Search
-
Facets
-
Search within results
-
Keyword prompts
-
Search history Table:
-
Paged results
-
Sortable columns Actions:
-
Go to Gene detail page
-
Add genes to ‘gene set’
Annotation Summary Gene Expression Summary RNASeq
PAGE DETAILS Annotation summary & summary view for each assay/data type: Accordion style sections
-GEXP – expression profile across
major Dx categories
-RNASeq – Exon structure of the
gene
-SNPs – Table of SNPs in region of
gene, highlighting association with major Dx group - Methylation – Methylation profile in region around gene
-Genomic alterations – table of
CNVs & alterations observed w/ freq in region around gene Actions: - Click through to assay detail page
-Add gene to set
LGRC Research Portal
Analysis Tools Cohort 1: Cohort 2: Job name:
View analysis parameters
Set 1 Set 2 My job 1 Start Analysis Job Status Running
PAGE DETAILS
-Very minimal parameters and
options…here just 2 cohorts of interest, maybe p-value cutoff Generates comprehensive report Edit in place results – Don’t set parameters, edit the results Analysis goes into queue, email notification when finished
Supervised Analysis Meta analysis Analysis of Differential Expression: My Job 1
PAGE DETAILS
-Very minimal parameters and
options.
Generates comprehensive report Edit in place results – Don’t set parameters, edit the results Accordion style result sections Generate PDF report of analysis Analysis goes into queue, email notification when finished
Unsupervised analysis
Engage corporate partners
We need to find the best tools
We received an $1M Oracle Commitment grant to create our integrated clinical/research data warehouse We’ve partnered with IDBS to create data portals We are working with Illumina on a variety of projects We are forging relationships with Thomson-Reuters to link genomic profiling data to drug, trial, and patent information We are building partnerships with Roche, Genomatix, NEB, and others interested in entering the personal genomics space.
Enable research beyond your own
John Quackenbush, Director Mick Correll, Associate Director
The Mission The mission of the CCCB is to provide broad-based support for the analysis and interpretation of ‘omic data and in doing so to further basic, clinical and translational research. CCCB also will conduct research that opens new ways of understanding cancer.
CCCB Service Offering IT Infrastructure -Application hosting -Data management -Custom software development -Comprehensive collaboration portals
CCCB Service Offering Next-Gen Sequencing -Competitive per-lane pricing -Integrated informatics -Major focus for development in 2010
CCCB Service Offering Analytical Consulting -Bioinformatics / statistical data analysis -Experimental design -Value-add for IT/Sequencing services
CCCB Collaborative Consulting Model Consulting 1. Initial meeting to understand project scope and objectives 2. Development of an analysis plan and time/cost estimate 3. During project execution, data and results are exchanged through a secure, password-protected collaboration portal 4. Available as ad-hoc service, or larger scale support agreements
Communicate the mission to the community.
The LGRC
Genomics is here to stay
Acknowledgments
The Gene Index Team Corina Antonescu Valentin Antonescu Fenglong Liu Geo Pertea Razvan Sultana John Quackenbush Array Software Hit Team Katie Franklin Eleanor Howe Sarita Nair Jerry Papenhausen John Quackenbush Dan Schlauch Raktim Sinha Joseph White H. Lee Moffitt Center/USF Timothy J. Yeatman Greg Bloom
http://cccb.dfci.harvard.edu
Microarray Expression Team Stefan Bentink Thomas Chittenden Aedin Culhane Kristina Holton Jane Pak Renee Rubio (Former) Stellar Students Martin Aryee Kaveh Maghsoudi Jess Mar Systems Support Stas Alekseev, Sys Admin Assistant Patricia Papastamos http://compbio.dfci.harvard.edu