Transcript HPCS_2013

The Data Tsunami in
Biomedical Research
Guillaume Bourque
McGill University and Genome Quebec Innovation Center,
Dept. of Human Genetics, McGill University
June 5th, 2013
Next-generation sequencing (NGS)
2
Stein, Genome Biol. 2010
Falling cost of sequencing
3
DeWitt, Nat. Biotechnol. 2012
Sequencing human genomes
2001
2011
2013 (?)
The
Human
Genome
1000
Genomes
Project
Your
Genome
~ 3 Billion $
~ 10 000 $
100 - 1000 $
Outline
• Overview of Next-Generation Sequencing (NGS)
• Applications
• Challenges
• Solutions
5
Sequencing Revolution
Sanger sequencing
Next-Generation sequencing
http://www.brusselsgenetics.be
Metzker, Nat. Rev. Genet. 2010
100s of reactions…
10000s of base pairs…
Millions of reactions!
Billions of base pairs!
6
High-throughput Sequencing
2009
36bp X 20M
X 8 lanes
6 Gbases
2013
2 X 150bp X 250M
X 8 lanes
600 Gbases
200 Human Genomes in 1 run!!!
NGS Technology Comparison
instrument
Method
Pacbio
Ion Torrent
454
Single-molecule
Ion
Pyrosequencing
in real-time semiconductor
Illumina
SOLiD
synthesis
Ligation
Read length
3kb average
200 bp
700 bp
50 to 250 bp
50+35 or 50+50
bp
Error type
indel
indel
indel
substitution
A-T bias
single-Pass
Error rate %
13
~1
~0.1
~0.1
~0.1
Reads per run
35000–75000
up to 4M
1M
up to 3.2G
1.2 to 1.4G
Time per run
30 minutes to 2
hours
2 hours
24 hours
1 to 10 days,
1 to 2 weeks
Cost per 1
million bases
(in US$)
$2
$1
$10
$0.05 to $0.15
$0.13
Advantages
Longest read
length. Fast.
Less expensive
high sequence
Long read size.
equipment.
yield, cost,
Fast.
Fast.
accuracy
Low cost per
base.
Low yield at
Slower than
Runs are
high accuracy.
Equipment can other methods,
Homopolymer
expensive.
Disadvantages Equipment can
be very
read length,
errors.
Homopolymer
be very
expensive. longevity of the
errors.
expensive.
plateform
Genome Canada
• > $915M investment and > $900M in co-funding
• 100s Large-scale genomics projects
• 5 Innovation centers
9
Outline
• Overview of Next-Generation Sequencing (NGS)
• Applications
• Challenges
• Solutions
10
Applications (I)
• De novo sequencing
– From the human genome… To all model organisms… To all relevant
organisms (e.g. extreme genomes)… To “all” organisms?
11
Human Genome
• 3 Billion DNA base pairs (bp)
• Two human genomes are
~99.9% identical
• There are about ~3M bp
differences between you and
me
• Some of these differences
explain variation in:
– Disease susceptibility
– Differences in drug metabolism
– …
www.dnacenter.com
12
Applications (II)
• Genome re-sequencing
– Genetic disorders
– Cancer genome sequencing
– Map genomic structural variations across individuals
– Genealogy and migration
– Agricultural crops
– …
The Cancer Genome Atlas
1000 Genomes Project
13
Exome sequencing for Mendelian disease
“… about one-half to one-third
(~3,000) of all known or
suspected Mendelian disorders
(for example, cystic fibrosis and
sickle cell anaemia) have been
discovered. However, there is a
substantial gap in our knowledge
about the genes that cause many
rare Mendelian phenotypes.”
“Accordingly, we can realistically look towards
a future in which the genetic basis of all
Mendelian traits is known, …”
14
Exome sequencing
15
Cancer genome sequencing
Can obtain a full catalogue of mutations
16
Michael Stromberg, bioinformatics.ca
Mutations in paediatric gliblastoma
Jabado, Pfister and Majewski
18
Mutations in paediatric gliblastoma
Sequenced the exomes of 48
paediatric GBM samples, found:
• Somatic mutations in the H3.3ATRX-DAXX chromatin
remodelling pathway in 44% of
tumours
• Recurrent mutations in H3F3A,
which encodes the replicationindependent histone 3 variant
H3.3 in 31% of tumours
19
Applications (III)
• Quantitative biology of complex systems
– New high-throughput technologies in functional genomics: ChIP-Seq,
RNA-Seq, ChIA-PET, RIP-Seq, …
– From single-gene measurements, to thousands of probes on arrays, to
profiles covering all 3B bases of the genome
– Important systems: Stem cells, Cancer, Infectious diseases…
20
Outline
•
•
•
•
Overview of Next-Generation Sequencing (NGS)
Applications
Challenges
Solutions
21
High-throughput Sequencing
2009
36bp X 20M
X 8 lanes
6 Gbases
2013
2 X 150bp X 250M
X 8 lanes
600 Gbases
200 Human Genomes in 1 run!!!
Big Data
2013
70 TBytes
Image files
2 X 10 TBytes
1 TBytes
Intensity files
Reads + qualities
Big Data
2013
From: Alexandre Montpetit
Subject: news from Illumina
Date: 4 June, 2013 2:15:16
2 XPM
10EDT
TBytes
To: Guillaume Bourque
1 TBytes
De Mark Van Oene (vp Illumina ventes): dans la prochaine
Intensity files
Reads + qualities
annee on doit s'attendre a 2x plus de reads en 2x moins de
temps (et 2x plus longs)
Ca cause probleme?
240 TBytes
12 TBytes
Alex
25 TB of raw data / month
300 TB of raw data / year
Large NGS project
Cancer project with whole genome data:
500 matched-normal
500 tumors
125 TB raw
500 X 3 lanes = 500 X 250GB
vs
125 TB raw
500 X 3 lanes = 500 X 250GB
DNA bases sequenced at the
Innovation Center
72 Trillions!
DNA bases
0r 800
genomes at
30X
12 HiSeqs
26
adventure.nationalgeographic.com
27
Biomedical research is built on data integration
Your data
Biomedical research is built on data integration
100X
Your data
Challenges
• NGS instruments generate TBs of data
• NGS instruments are getting faster, cheaper and will
increasingly be found in small research labs and
hospitals
• Data sharing and integration is critical in biomedical
research
• Sequencing data represents sensitive private data
and is identifiable
30
Outline
•
•
•
•
Overview of Next-Generation Sequencing (NGS)
Applications
Challenges
Solutions
31
Nanuq software
Has tracked data and meta-data for more than:
• 2.6 million sample aliquots,
• 20,500 reagents,
• 17,000 plates,
• 140,000 tubes,
• Multiple platforms, technologies and
workflows(sequencing, genotyping, microarray, etc.)
• 3,900 external users
32
Standardized analysis pipelines
…
Methylation
Analysis report
RNA-Seq
Analysis report
ChIP-Seq
Analysis report
…
…
…
…
33
Data center at the Innovation Center
> 1200 cores
> 2 PB disk
> 5 PB tape
34
Need more!
UdeS Mammouth – 39168 cores
McGill Guillimin – 16000 cores
35
Data processing issues
• We have many different projects all needing space
and processing.
• We want to use the Compute Canada clusters for
scalability but also to facilitate data distribution (we
have >800 users).
• This brings uniformity problems:
– Different setups Hardware and Software
– Different configurations
– Etc.
Our strategy
• We wrote analyses pipelines to be easily configurable
across clusters.
• Same code, one ini file to customize (we already have
templates for 3 cluster sites)
• We install Linux modules readable by all on all these
clusters so we know exactly what is available
everywhere
• We also deploy common genomes across sites.
Usage on Compute Canada
38
Canadian Epigenetics,
Environment and
Health Research
Consortium (CEEHRC)
$1.5M
(2012-2017)
39
PORTal for the Analysis of Genetics and
Genomics Experiments (PORTAGGE)
40
Conclusions
• NGS offers a variety of technologies and numerous exciting
applications
• Many areas of NGS data analyses are still under active
development (e.g. RNA-Seq)
• A major challenge is to ensure sufficient compute and storage
capacities not to limit more advanced analyses
• Need to work together to avoid duplication of efforts in
installing tools but also to develop efficient ways to use HPC in
biomedical research
Acknowledgements
IT team
Terrance Mcquilkin
Marc-André Labonté
Genevieve Dancausse
Andras Frankel
Alexandru Guja
EDCC team
David Morais (UdeS)
Carol Gauthier (UdeS)
Bryan Caron (McGill)
Alain Veilleux (UdeS)
ME Rousseau (McGill)
Analysis team
Louis Letourneau
Mathieu Bourgey
Maxime Caron
Gary Lévesque
Robert Eveleigh
Francois Lefebvre
Johanna Sandoval
Pascale Marquis
Development team
Nathalie Émond
David Bujold
Francois Cantin
Catherine Côté
Burak Demirtas
Daniel Guertin
Louis Dumond Joseph
Francois Korbuly
Marc Michaud
Thuong Ngo
[email protected]
Questions?
43