Eulerian Graphs and DNA Sequence Assembly

Download Report

Transcript Eulerian Graphs and DNA Sequence Assembly

Computational Biology
in the
21-st Century
Michael Waterman
University of Southern California
Tsinghua University
Outline
I.
DNA Sequencing History
a. Shotgun Sequence Assembly
b. New Generation Sequencing
II. Personalized Medicine
III. An Inspiring Example
I. DNA SEQUENCING
HISTORY
• Sanger receives a 1958 Nobel Prize for
sequencing insulin, a protein
• Sanger and Gilbert receive the 1980 Nobel
prize for DNA sequencing methods
...history repeats itself:
sequencing insulin
Fred Sanger
1958 (!) Nobel prize for
sequencing insulin by Edman
degradation
Average read
length = 5 aa!
• Clone and amplify the target DNA
to obtain a large number of identical molecules
• Attach primer to one end of single stranded DNA
• Polymerase extends from labeled primer
• Four separate reactions, for each letter A,T,G,C
• Extension proceeds as normal until a chain terminating
dideoxynucleotide is incorporated
Gearing Up
• Automated sequencing machines
• Label became attached fluorescent dyes, one color for
each terminating base
• Reaction could be run in the same chamber and only use
one column of the gel
• Detection of bases became image processing
• Caltech origins: Lee Hood, Lloyd Smith, Hunkapilar
brothers
I.a. Shotgun Sequence Assembly
When rapid DNA sequencing technology
came from Sanger’s laboratory at
Cambridge in the MRC in 1976,
Sanger also recruited Roger Staden who
created the first assembly program.
Fragment Assembly
reads
In shotgun sequencing, whole genomes are sequenced by
making clones, breaking them into small pieces, and trying
to put the pieces together again based on overlaps.
Note that the fragments are randomly sampled, and thus no
positional information is available.
Whole Genome Assembly: problem description
• The goal is to reconstruct an unknown source sequence
(the genome) on {A, C, G, T} given many random short
segments from the sequence, the shotgun reads.
• A read is a sequence of nucleotides of length 30-800,
taken from a random place in the genome.
• Reads contain two kinds of errors: base substitutions and
indels. Base substitutions occur with a frequency from 0.5
– 2%. Indels occur roughly 10 times less frequently.
• Strand orientation is unknown.
Assembly is challenging
Even worse…
• We must deal with significant errors in the sequence reads.
• The orientation of each read is unknown.
• Genomes have many repeats (approximate copies of the same
sequence), which are very hard to identify and reconstruct.
• Gaps due to low coverage
• The size of the problem is very large. Celera’s Human Genome
sequencing project contained roughly 26.4 million reads, each about
550 bases long.
Repeats
• Repeats: A major problem for fragment assembly
• > 50% of human genome is repeats:
- over 1 million Alu repeats (about 300 bp)
- about 200,000 LINE repeats (1000 bp and longer)
Overlap-Layout-Consensus
Assembler programs: ARACHNE, PHRAP, CAP, TIGR, CELERA
Common Approach:
Overlap: find potentially overlapping reads
Layout: merge reads into contigs and
contigs into supercontigs
Consensus: derive the DNA
sequence and correct read errors
..ACGATTACAATAGGTT..
Human Genome example
• The Celera project
• 26.5*106 reads of length 550
• Pairs of reads = 7*1014
• One pair takes 3*105 units
• Total resources = 2*1020
SUMMARY
• Computer assembly of genomes is
challenging
• The computer programs used in the HGP
were sophisticated upgrades of Staden’s
approach
• Much work and ingenuity went into these
computational projects
I.b. New Generation
Sequencing
• Novel technologies, highly parallel
• The era of $1000 genomes is coming!
New Generation Sequencing
Current NGS technologies
454
Platform
Illumina
GS FLX Ti GA IIx HiSeq
250M
1G
SOLiD
4
Helicos
4hq
Reads
1M
>1.4 G ~3 G
Read length
400
2 X 100 2 X 100 2 X 50 2 X 75
35
Run time
10 hr
9.5 day 12 day 12 day 14 day
??
Yield (bp)
400M
50G
200G
100G
Error rate
<1%
~1%
~1%
0.06% 0.01% low with 20X
300G
1G
35G
Overlap-Layout-Consensus with short reads
• Finding consensus is NP hard
• Number of short reads is large, and tiling of each overlap is
very small => tremendous runtime.
Eulerian path approach
Sequence reconstruction is reduced to the search of not
Hamiltonian but Eulerian paths in graphs. It avoids
computational difficulties and formulates necessary and sufficient
conditions for sequence reconstruction from the k-tuple spectrum.
This approach was begun by Idury and Waterman (1995)
Edge in graph G -- k-tuple in S
Euler Approach to Assembly
De Bruijn graph
Vertices: (l-1)-tuples in each read
Edges: l-tuples in each read
Euler approach: advantages
•
•
•
•
•
Linear time
No pairwise alignment
No layout
No consensus or multiple alignment
One letter indels are no harder to handle
than mismatches
Euler approach: difficulties
• Erroneous edges
• Tangled graph
II. Personalized Medicine &
Statistics
Wing Hung Wong
Personalized Medicine
• Identify predisposition to disease
• Provide specific diagnosis
• Design effective individualized treatment
All based on the genetic data, biomarker
status, and electronic health record of the
specific individual (health consumer)
Genetic predisposition
• The risk of Crohn’s disease is 20 times
higher than the population risk if one has an
affected sib. Furthermore, the disease
concordance rate is 50% in identical twins
versus only 10% in non-identical twins.
• Similar evidence suggest large genetic
predispositions for many other diseases, for
example diabetes and autism
Predisposition may be traced to
specific genes
• The lifetime risk of breast cancer is about 12%
for women in the general population
• This risk is elevated to 60% for women who
have inherited a harmful mutation in BRCA1
or BRCA2
• These mutations accounts for >10 percent of
the 180,000 breast cancers diagnosed in the
United States per year.
To enable personalized
assessment of risk, we need
• knowledge of the susceptibility gene (or
genes)
• the ability to examine the status of these
genetic loci
These are being driven by advances in
genome technology, computing and
statistics
Genome technology:
microarrays
Current microarray
technology allows the
analysis of close to 2
million genetic
markers (1 marker
per 1500 bp on
average)
Finding susceptibility genes
• n1 cases of patients with the disease,
n2 normal controls
• For each subject, measure status of m
polymorphic markers with genotyping arrays
• Perform statistical analysis of these data to map
the locations of susceptibility genes
An example of such GWAS
study
• The Wellcome Trust Case Control Consortium
study 7 diseases (Nature 2007)
n1=2000 for each disease group
n2=3000 for a common control sample
m=500,000 binary markers genotyped for each
individual using SNP array.
Threshold for significance: p-value < 5 x 10-7
Before 2006, it was not clear whether GWAS will result in replicable associations
for common diseases.
By mid-2009, 1607 loci has been associated with diseases through this approach
Genotype of subject
disease Status
-----no
-----no
-----no
-----yes
-----no
------no
-------
Statistical issues
• What test statistics should be used to search
for candidate susceptibility loci
• How to guarantee that the huge number of
statistical tests performed will not result in
many false predictions?
• For single locus tests, these have been
resolved by new statistical methods in the past
few years, e.g., false discovery rate theory and
permutation analysis
Analysis of multiple markers
Haplotypes (unobserved)
GAACAACCACTGCCATGTCGCCTGGTCA
GAACATCCACTGCCACGTCGCCTGCTCA
-----------------------------------------------------------GAACATCCACTGCCATGTCGCCTGGTCA
GAACATCCACTGCCACGTCGCCTGCTCA
-----------------------------------------------------------------GAACATCCACTGCCACGTCGCCTGCTCA
GAACATCCACTGCCACGTCGCCTGCTCA
-----------------------------------------------------------------GAACATCCACTGCCATGTCGCCTGGTCA
genotypes
2, 2, 2
--------3, 2, 2
--------3, 1, 1
--------2, 3, 3
Why is this difficult?
• Haplotype data is much more informative but not yet
available by current technology
• Next best option is to analyze the genotypes jointly
(rather than singly as in most studies)
• For m markers there are 3m joint genotypes
• Statistical challenge is to detect difference in how the
cases and controls are distributed among these 3m
joint genotypes
• We have developed methods that can handle m=20,
i.e. up to 3.5 billion joint genotypes.
Mapping computation to FPGA will allow us to handle larger m
BEE compute
Module
However, current approach is
fundamentally inefficient
• For each disease, need to recruit a large
number of affected cases or families
• Need support for participating physicians and
nurses. This often cost much more than the
genotyping cost itself
• Does not offer direct benefit to the subjects
• In summary, large effort to answer just a single
question, but often still not enough samples
How to do it better?
• Implementation of interoperable Electronic
Health Record (EHR). In the US, this has just
been started under the Obama administration.
• On a population level, obtain personal genetic
data and provide linkage to EHR.
• Incorporate statistical learning into the routine
analysis of these data, for finding susceptibility
factors, making better diagnosis and treatment
based on EHR and personalized genetic data
Advantages
• Can use very large sample sizes
(100,000’s). Conclusion from statistical
analysis will be much more accurate.
• Same genetic data can be can be used to
study many different diseases
• Can study not only genetic predisposition
but also how genotype may affect response
to drugs. The later allow customization of
treatment to the patient’s genetic profile.
What genetic data to collect?
• Genotyping arrays are ready and cheap
• Complete genome sequence is now feasible
but expensive (2nd generation sequencers)
• Haplotype and hyploid-genome sequence
are most desirable but not yet feasible.
These must await 3rd generation sequencers
Genome technology: DNA sequencer
Limitations of traditional sequencers:
• Electrophoresis is slow
• Number of capillaries limited to about 100
2nd gen. sequencers avoid electrophoresis
and exploit massive parallelism
(George Church 2003, Steve Quake, 2003)
From a single machine:
Sample preparation to imaging
cycle takes about 4 days
50-100 million short sequences
can be read, yielding several
billion bp of information
Need just a few weeks to
sequence a human genome
Grows faster than Moore’s law!
Picture from Illumina:
Personal genome data
• For $1000, get sequence for all coding
regions of the genome.
• For $10,000, get sequence of whole genome
• By Moore’s law, we can expect these costs
to drop by a factor of 10 in the next 2-3
years. At that point it becomes feasible to
collect personal genome data at a
population level, say cohorts of size in the
hundred’s of thousands.
What about haploid genomes?
• 3rd gen. sequencers, e.g. those based on ZMW or
nanopore promises to generate very long reads
(approach 100,000 nt). Such reads can span
multiple SNPs and thus allow haplotype
reconstruction
• Other novel protocols such as long range pair-end
reads can also be combined with current 2nd gen.
sequencers to get haplotype information.
• Thus, methods/algorithms to analyze such data
will soon be needed.
The statistical challenges
• Find disease subtypes (discussed later)
• Find stratification based on response of specific drug or
treatment
• Search for patterns in personal genetic data that correlate with
such stratifications
• From (trillion)n variations, find the pattern(s) that confer high
risk in a small subset of affected (this is why we need large
sample size). Need algorithms capable of handling this
“double” exponential explosion
• Equally important, how to assess the statistical significance of
the discoveries
Better diagnosis & prognosis
• EHR provide “deep phenotype” information
• Together with suitable molecular-level
information obtained through genomic and
proteomic analyses, this can help to better
classify disease into subtypes that are more
relevant for personalized medicine.
• For example, next generation sequencing
provide extremely detailed survey of
isoforms expressed in a particular sample.
Final remarks
• The future of medicine is to “personalize”
• Informatics and statistical analysis will be
key to this development
• There is no serious barriers technologically
• Major obstacles remains in terms of
building the infrastructure and ensuring
privacy w/o sacrificing useful information.
Genetics and Environment
integration is key to future medicine
Genome
Blood proteins,
miRNA, mRNA,
Cell
Read-outs
Predictive, Personalized
Diagnostics
Disease
&
Health
Complex biological
Networks
Environment
Complete Genome Sequences of
Family of Four
• Parents healthy—kids both have two genetic diseases
• Power of a family sequence permits:
–
–
–
–
–
Noise reduction Very low error rate — <1/100,000
Discovery of ~230,000 new single base variants (SNPs)
First full recombination maps of a family
Determination of inter-generational mutation rates
Identification of disease gene candidates for rare genetic
diseases – two in this case
Roach et al., Science, April 30 2010
Whole Genome Sequencing of Family
Unaffected parents
Children with craniofacial
And limb malformation (Miller Syn.)
and lung disease
Lynn Jorde and Michael Bamshad: family DNA
Sequencing a nuclear family yields high
resolution crossover sites and haplotypes
Inheritance “patterns” in kids (0.5Mb)
QuickT ime™ and a
decompressor
are needed to see this picture.
Dad
QuickTime™ and a
decompress or
are needed t o see this picture.
Mom
haploidentical
paternal
identical
haploidentical
maternal
nonidentical
Genomes of kids
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22
KIAA0556
DHODH
3
DNAH5
2
ZNF721
1
Ciliary dyskenesis gene
centromere
heterochromatin
error region
CNV
candidate gene
haploidentical maternal
paternal recombination
identical
maternal recombination
haploidentical paternal
nonidentical
Miller’s gene
X
Inter-generational base-change mutations
• “Genetic errors”: Genetically impossible SNPs in kids will most likely
be sequencing errors, but some (perhaps ~1/1000) will be new
mutations. We can find these!
• New mutations, germline nucleotide substitutions, have never been
directly measured before.
• Low error rate & family makes this approach work
• We trapped by hybridization and resequenced ~50,000 sites
• Indirect, phylogenetic estimate is between 8 x 10-9 & 2.1 x 10-8 per
base per generation,
•
The intergenerational mutation rate of the autosome is
~1.1 x 10-8 /base/generation (~30% transversions)
THANKS FOR
LISTENING!