here. - Computational Biology and Functional Genomics Laboratory

Transcript here. - Computational Biology and Functional Genomics Laboratory

Driving Discovery Through Data Integration and Analysis John Quackenbush Molecular Diagnostics World 2010 28 October 2010

Genetic Risk Birth

Disease Progression and Personalized Care

Treatment Natural History of Disease Clinical Care Quality Of Life Death Environment + Lifestyle Outcomes Treatment Options Disease Staging Patient Stratification Early Detection Biomarkers

Turning the vision into a reality

Assure access to samples and rational consent Develop a technology platform Make information integration as a central mission Conduct research as a vital component Present data and information to the local community Enable research beyond your own Engage corporate partners Communicating the mission to the community.

Assure Access to Samples

Access, Research, Security

Patients want to be part of the process of curing disease Informed consent needs to be structured to allow patients to be partners in the research process HIPPA requires both informed consent and that we assure patient confidentiality But “identifiability” is a moving target in a genomic age With the <$1000 genome, in the age of Facebook, what this means remains unclear The new Genomics is a disruptive technology.

Develop a Technology Platform

2006: State of the Art Sequencing

PRODUCTION Rooms of equipment Subcloning > picking > prepping 35 FTEs 3-4 weeks SEQUENCING 74x Capillary Sequencers 10 FTEs 15-40 runs per day 1-2Mb per instrument per day 120Mb total capacity per day Sequencing the genome took ~15 years and $3B

2008: Enabling a New Era in Genome Analysis

PRODUCTION 1x Cluster Station 1 FTE 1 day SEQUENCING 1x Genome Analyzer Same FTE as above 1 run per 5 days 15 Gb per instrument per run >3 Gb per day (1x genome coverage) We can now re-sequence the genome in a ~1 week

The Challenge

New technologies inspired by the Human Genome Project are transforming

biomedical research

from a laboratory science to an

information science

We need new approaches to making sense of the data we generate The winners in the race to understand disease are going to be those best able to collect, manage, analyze, and interpret the data.

Make information integration as a central mission

Gene

Gene Index Databases

http://compbio.dfci.harvard.edu

RNA Protein Network Patient TM4 Microarray Software

Predict Network Candidate Gene(s) Perturb Network (RNAi) Assay Response (

A) Resourcerer Other Databases Other tools MeSHer ClusterMed

Bayesian Nets

Central Warehouse

Other Things: Mesoscopic Expression Correlated Signatures State Space Gene Models Tiling Arrays to Genes

Dealing with an Information Overload

Beating Information Overload Clinical Data Genomics Cytogenomics Metabolomics Chemical Biology Clinical Trials Etc.

Central Warehouse Transcriptomics Epigenomics Proteomics Improved Diagnostics Individualized Therapies More Effective Agents PubMed The Genome Published Datasets The HapMap Disease Databases (OMIM) Drug Bank

Dana-Farber Research DB Conceptual Architecture

PubMed GenBank Rules Engine BAM Dashboard Web Center Portal Portals Business Intelligence Partners OMICS IDX Rx Lab Clinical Trial …… Idm & Security HTB ODS genomics A Facts B A Custom Facts B C D C D BPEL Web Service Directory …..

De-identification Terminology EMPI Security Mapping Auditing Severity Score Clinical Pathways RFID Build or Buy Oracle Existing

An Example: Signature Analysis

Warehouse Array Express GEO Random Websites Fenglong Liu Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard

GeneChip Oncology Database

Fenglong Liu

GeneChip Oncology Database

Fenglong Liu

An Example: Signature Analysis

PubMed Kerm Picard Warehouse Array Express GEO Random Websites Fenglong Liu In-House Studies Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard Analysis

GeneSigDB – release 2

http://compbio.dfci.harvard.edu/genesigdb

GeneSigDB – comparing cancers

Cancer is a Cell-Cycle Disease

Aedin Culhane, Daniel Gusenleitner

Breast Cancer has unique signatures

Aedin Culhane, Daniel Gusenleitner

A sample research question How many Multiple Myeloma patients, with bone marrow or blood samples in the bank, and who have a chromosome 13 deletion, responded (complete, partial, or minor remission) to therapy and how many did not respond?

A Path Forward We are working to develop a two-way strategy for future Clinic → Lab Lab → Clinic Consider OncotypeDx This approach represents the intellectual framework for future success – and the bridges between the various laboratories and programs.

Conduct research as a vital component

Bayesian Networks

Amira Djebbari Raktin Sinha Dan Schlauch

When we say “Networks” we mean… Genes are represented as “nodes” Interactions are represented by “edges” Edges can be directed to show “causal” interactions Edges are not necessarily direct interactions

Bayesian network - example

Conditional probability table at node “Gene2” Edges represent dependencies Gene1 Gene1 -1 0 1 Gene2=1|Gene1 0.1

0.2

Gene2 Gene3 0.7

Gene4 Learning Bayesian networks: Structure Conditional probability tables

Bayesian networks - priors

No free lunch theorem (Wolpert & MacReady, 1996): The performance of general-purpose optimization algorithm iterated on cost function is independent of the algorithm when averaged over all cost functions. Suggests that when considering a specific application one can introduce a potentially useful bias using domain knowledge

A low-cost lunch?

One can “help” the search along by providing a seed structure representing what we believe is the most likely network The network search process will then use gene expression data to look for perturbations on the structure that are supported by the data There are many possible sources of prior structures including the Biomedical literature and large-scale interaction studies (PPI)

Bayesian networks using microarray data and literature Test Set: Golub et al. ALL/AML dataset Learn BN with literature network as prior structure, Protein-Protein Interaction data (PPI), and literature+PPI Perform 200 bootstrap network estimations and find links that are “high confidence” Compare without prior (microarray data only) vs. with prior structure from the literature to look for known interactions.

Amira Djebbari

BN: No Priors

Amira Djebbari

BN: PPI Data

Amira Djebbari

BN: Literature Priors

Amira Djebbari

BN: Literature + PPI

Cell Cycle Gene Subnetwork

Improving the Seeds

Co-occurrence does not a provide directionality for interactions, but a BN is a DAG and our assignment is ad hoc The literature contains information about how we the genes (and their products) interact The challenge is extracting that information from the literature —there is too much to read Text mining doesn’t work well for the biomedical literature.

Improving the Seeds (2)

Solution: Use a hybrid approach!

Use text-mining tools to find sentences that contain names of two or more genes Use the Amazon Mechanical Turk to extract [subject] —[predicate]—[object] triples Define relationships between genes based on the “consensus” interaction Combine these results with pathway databases to build seed networks.

“PredictiveNetworks” seeds from the literature

Present data and information to the local community

LGRC Research Portal

PAGE DETAILS - View aggregate statistics

- View cohort details - Build cohort sets - Build composite phenotypes

Actions: -Go to data download for selected cohort

-Go to assay detail for selected

cohort

-Go to cohort manager

LGRC Research Portal

PAGE DETAILS Search

Facets

Search within results

Keyword prompts

Search history Table:

Paged results

Sortable columns Actions:

Go to Gene detail page

Add genes to ‘gene set’

Annotation Summary Gene Expression Summary RNASeq

PAGE DETAILS Annotation summary & summary view for each assay/data type: Accordion style sections

-GEXP – expression profile across

major Dx categories

-RNASeq – Exon structure of the

gene

-SNPs – Table of SNPs in region of

gene, highlighting association with major Dx group - Methylation – Methylation profile in region around gene

-Genomic alterations – table of

CNVs & alterations observed w/ freq in region around gene Actions: - Click through to assay detail page

-Add gene to set

LGRC Research Portal

Analysis Tools Cohort 1: Cohort 2: Job name:

View analysis parameters

Set 1 Set 2 My job 1 Start Analysis Job Status Running

PAGE DETAILS

-Very minimal parameters and

options…here just 2 cohorts of interest, maybe p-value cutoff Generates comprehensive report Edit in place results – Don’t set parameters, edit the results Analysis goes into queue, email notification when finished

Supervised Analysis Meta analysis Analysis of Differential Expression: My Job 1

PAGE DETAILS

-Very minimal parameters and

options.

Generates comprehensive report Edit in place results – Don’t set parameters, edit the results Accordion style result sections Generate PDF report of analysis Analysis goes into queue, email notification when finished

Unsupervised analysis

Engage corporate partners

We need to find the best tools

We received an $1M Oracle Commitment grant to create our integrated clinical/research data warehouse We’ve partnered with IDBS to create data portals We are working with Illumina on a variety of projects We are forging relationships with Thomson-Reuters to link genomic profiling data to drug, trial, and patent information We are building partnerships with Roche, Genomatix, NEB, and others interested in entering the personal genomics space.

Enable research beyond your own

John Quackenbush, Director Mick Correll, Associate Director

The Mission The mission of the CCCB is to provide broad-based support for the analysis and interpretation of ‘omic data and in doing so to further basic, clinical and translational research. CCCB also will conduct research that opens new ways of understanding cancer.

CCCB Service Offering IT Infrastructure -Application hosting -Data management -Custom software development -Comprehensive collaboration portals

CCCB Service Offering Next-Gen Sequencing -Competitive per-lane pricing -Integrated informatics -Major focus for development in 2010

CCCB Service Offering Analytical Consulting -Bioinformatics / statistical data analysis -Experimental design -Value-add for IT/Sequencing services

CCCB Collaborative Consulting Model Consulting 1. Initial meeting to understand project scope and objectives 2. Development of an analysis plan and time/cost estimate 3. During project execution, data and results are exchanged through a secure, password-protected collaboration portal 4. Available as ad-hoc service, or larger scale support agreements

Communicate the mission to the community.

The LGRC

Genomics is here to stay

Acknowledgments

The Gene Index Team Corina Antonescu Valentin Antonescu Fenglong Liu Geo Pertea Razvan Sultana John Quackenbush Array Software Hit Team Katie Franklin Eleanor Howe Sarita Nair Jerry Papenhausen John Quackenbush Dan Schlauch Raktim Sinha Joseph White H. Lee Moffitt Center/USF Timothy J. Yeatman Greg Bloom Center for Cancer Computational Biology Mick Correll Howie Goodell Kristina Holton Jerry Papenhausen Patricia Papastamos John Quackenbush

http://cccb.dfci.harvard.edu

Microarray Expression Team Stefan Bentink Thomas Chittenden Aedin Culhane Kristina Holton Jane Pak Renee Rubio (Former) Stellar Students Martin Aryee Kaveh Maghsoudi Jess Mar Systems Support Stas Alekseev, Sys Admin Assistant Patricia Papastamos http://compbio.dfci.harvard.edu