PowerPoint template - Georgia Institute of Technology

Download Report

Transcript PowerPoint template - Georgia Institute of Technology

Computational Genomics:
Genome assembly
Andrey Kislyuk
23 January 2012
Key resources
 SEQanswers (http://seqanswers.com/)
 BioStar (http://biostar.stackexchange.com/)
17 July 2015 · Computational Genomics
Why do we need to assemble genomes?
 DNA sequencing methods can’t sequence more than about
1000 nucleotides at a time
 Sanger method (1975)
- chain termination with labeled ddNTPs
 Maxam-Gilbert method (1976)
- cleaving agents
- Both require primers for DNA Polymerase, but we can’t
make the primers until we know the sequence!
- Both limited to ~1000 nt on the gel/capillary
- Both require cloning and/or PCR amplification
 Large-scale sequencing: primer walking
- Slow, costly, and error-prone: not practical beyond ~10Kbp
17 July 2015 · Computational Genomics
Shotgun sequencing
 Whole genome shotgun sequencing (1995)
- Hydroshearing: prepare several libraries of random fragments approx. 2, 5, 10, 50…
Kbp long
- Cloning: use bacterial plasmids to grow DNA – problems arise if DNA contains a gene
harmful to the host bacteria
- Picking, amplification
- Sanger sequencing, capillary
electrophoresis, read out fluorescent
dyes with a laser – 4 different colors
 Result: lots of ~1000 nt Sanger reads
- Assemble them with pairwise
sequence alignment
- Multiple coverage corrects errors
 Seems straightforward now, but many
did not believe it could be done!
17 July 2015 · Computational Genomics
How much shotgun sequencing?
 So, can we really sequence the whole genome with this? (No, we can’t.)
 Lander and Waterman (1988):
- Assuming random distribution of reads and ignoring repeat resolution issues,
Define:
 G = genome length
 N = number of reads sequenced
 L = length of a single read
 T = minimum overlap to align the reads
 Then overall coverage is C = LN/G
 Coverage for any given base obeys
the Poisson distribution:
 The number of gaps (bases with 0
coverage) is:
together
C C x e C
PC x  y  
C x!
G  Ne
17 July 2015 · Computational Genomics
LN  T 
  1 
G L
How much shotgun sequencing?
17 July 2015 · Computational Genomics
How much shotgun sequencing?
17 July 2015 · Computational Genomics
Pioneers of sequence assembly
J. Craig Venter’s group at TIGR (later JCVI)
- Created TIGR Assembler, Celera Assembler, and associated tools
Jim Kent (UC Santa Cruz)
- Created GigAssembler
- Allowed the Human Genome Project to compete with Celera
Philip Green’s group at the University of Washington
- Created Phred, Phrap and Consed tools
Sequencing centers: JCVI, Sanger Institute, Whitehead/MIT,
DOE JGI, Baylor HGSC, WUSTL
17 July 2015 · Computational Genomics
Why do we need to assemble genomes?
2nd Generation sequencing methods
- Cheaper and more processive (sequence more data), but shorter
read length
- 454 Pyrosequencing: 200-600 nt average read length
- Illumina: 36-120 nt read length (up to 200 in development)
- ABI SOLiD: 50 nt read length
- Same idea: randomly hydrosheared library
Random reads from across the genome form a big puzzle
17 July 2015 · Computational Genomics
Next Generation Sequencing Technologies
Sequencing by synthesis
Method
Read length Paired-end Coverage
reads
Sanger
1000 nt
Yes,
2x500nt
454
200 (gen 1)
500 (gen 2)
Yes, 2x40nt 20-40X
Solexa/
Illumina
50 nt
100X
SOLiD
35 nt
Yes, 2x25nt 100X
 2nd generation
- 454 Pyrosequencing
- Solexa/Illumina
- SOLiD
 3rd generation
- Single-molecule sequencing
5-10X
- Nanopore sequencing
17 July 2015 · Computational Genomics
454 Pyrosequencing
A
+ PCR Reagents
+ Emulsion Oil
B
Mix DNA library
& capture beads
(limited dilution)
“Break micro-reactors”
Isolate DNA containing beads
Create
“Water-in-oil”
emulsion
Perform emulsion PCR
17 July 2015 · Computational Genomics
454 Pyrosequencing
Load enzyme
beads
Load beads into
PicoTiter™Plate
PicoTiter™Plate
Diameter = 44 μm
44 μm
Depth = 55 μm
Well size = 75 pl
Well density = 480 wells mm-2
1.6 million wells per slide
17 July 2015 · Computational Genomics
454 Pyrosequencing
Sequencing by
synthesis
Photons generated
are captured by
CCD camera
Reagent flow
Margulies et al., 2005
Raw sequencer output
 Sanger: Trace (usually .ab1/.scf file format)
 454: Flowgram (.sff file format)
Flow Order
4-mer
3-mer
T
A
C
G
KEY (TCAG)
Measures the
presence or absence
of each nucleotide at
any given position
2-mer
1-mer
17 July 2015 · Computational Genomics
Assembly algorithms
Paradigms
- Overlap-Layout-Consensus
- De Bruijn graphs
17 July 2015 · Computational Genomics
Differences between an overlap graph and a de Bruijn graph
Schatz M C et al. Genome Res. 2010;20:1165-1173
©2010 by Cold Spring Harbor Laboratory Press
Overlap-Layout-Consensus
Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA
Overlap: find potentially overlapping reads
Layout: merge reads into contigs and
contigs into supercontigs
Consensus: derive the DNA
sequence and correct read errors
..ACGATTACAATAGGTT..
Credit: Serafim Batzoglou
Overlap: A pairwise alignment problem
 Find the best match between the suffix of one read and the prefix
of another
 Due to sequencing errors, need to use dynamic programming to
find the optimal overlap alignment
 Apply a filtration method to filter out pairs of fragments that do
not share a significantly long common substring
Credit: Serafim Batzoglou
Overlapping Reads
•
Sort all k-mers in reads
(k ~ 24)
•
Find pairs of reads sharing a k-mer
•
Extend to full alignment – throw away if not
>95% similar
TACA TAGATTACACAGATTAC T GA
|| ||||||||||||||||| | ||
TAGT TAGATTACACAGATTAC TAGA
Credit: Serafim Batzoglou
Overlapping Reads and Repeats
 A k-mer that appears N times initiates N2 comparisons
 For an Alu that appears 106 times  1012 comparisons – too much
 Solution:
Discard all k-mers that appear more than
t  Coverage, (t ~ 10)
Credit: Serafim Batzoglou
Finding Overlapping Reads
Create local multiple alignments from the overlapping reads
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
Credit: Serafim Batzoglou, Masahiro Kasahara
Finding Overlapping Reads (cont’d)
• Correct errors using multiple alignment
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
C:
C:
T:
C:
C:
20
35
30
35
40
C:
C:
C:
C:
C:
20
35
0
35
40
A:
A:
A:
A:
15
25
40
25
A:
A:
A:
A:
A:
15
25
0
40
25
• Score alignments
• Accept alignments with good scores
Credit: Serafim Batzoglou
Layout
 Repeats are a major challenge
 Do two aligned fragments really overlap, or are they from two
copies of a repeat?
Credit: Serafim Batzoglou
The k-mer uniqueness ratio
Schatz M C et al. Genome Res. 2010;20:1165-1173
©2010 by Cold Spring Harbor Laboratory Press
Merge Reads into Contigs
repeat region
Merge reads up to potential repeat boundaries
Credit: Serafim Batzoglou
Merge Reads into Contigs (cont’d)
repeat region
 Ignore non-maximal reads
 Merge only maximal reads into contigs
Credit: Serafim Batzoglou
Merge Reads into Contigs (cont’d)
repeat boundary???
sequencing
error
b
a
 Ignore “hanging” reads when detecting repeat boundaries
Credit: Serafim Batzoglou
Merge Reads into Contigs (cont’d)
?????
Unambiguous
• Insert non-maximal reads whenever unambiguous
Credit: Serafim Batzoglou
Link Contigs into Supercontigs (aka scaffolds)
Normal density
Too dense:
Overcollapsed?
(Myers et al. 2000)
Inconsistent links:
Overcollapsed?
Credit: Serafim Batzoglou
Link Contigs into Supercontigs (cont’d)
Find all links between unique contigs
Connect contigs incrementally, if  2 links
Credit: Serafim Batzoglou
Link Contigs into Supercontigs (cont’d)
Fill gaps in supercontigs with paths of
overcollapsed contigs
Credit: Serafim Batzoglou
Link Contigs into Supercontigs (cont’d)
Contig A
d ( A, B )
Contig B
Define G = ( V, E )
V := contigs
E := ( A, B ) such that d( A, B ) < C
Reason to do so: Efficiency; full shortest paths cannot be computed
Credit: Serafim Batzoglou
Link Contigs into Supercontigs (cont’d)
Contig A
Contig B
Define T: contigs linked to either A or B
Fill gap between A and B if there is a path in
G passing only from contigs in T
Credit: Serafim Batzoglou
Consensus
 A consensus sequence is derived from a profile of the assembled
fragments
 A sufficient number of reads is required to ensure a statistically
significant consensus
 Reading errors are corrected
Credit: Serafim Batzoglou
Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTATTGACTTCATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive multiple alignment from pairwise read alignments
Derive each consensus base by weighted
voting
Credit: Serafim Batzoglou
Mate pairs and paired-end reads
 Mate pairs: Circularize and trim size-selected fragments during
library preparation. Inserts can be approx. 1, 5, 10, 20 Kbp long.
 Paired-end reads: Sequence a short amplified fragment from both
ends. Fragment length is more precise but limited to about 300
bp.
17 July 2015 · Computational Genomics
Mate pairs/Paired-end reads
17 July 2015 · Computational Genomics
Paired end reads (aka mate pairs)
17 July 2015 · Computational Genomics
Credit: 454 Life Sciences
Base Calling and Trimming
 Base Calling: the process of translating the raw sequencer output
into
1. The most likely nucleotide sequence
2. Confidence scores for each position
.FASTQ or .SFF
 Trimming: the process of removing adapter, key, vector, and/or
low quality sequence from a read
17 July 2015 · Computational Genomics
Reference-based assembly
 Reference-based assembly
- Replaces overlap detection with alignment against a similar genome
- Also called mapping, mapped assembly
17 July 2015 · Computational Genomics
Credit: M. Schatz
Reference-based assembly
 Use a related genome to ease the layout task
- Much faster computationally
- Arranges reads with more confidence, so a better
assembly is possible
- Allows other types of analysis: somatic mutations,
organismal SNPs, structural variation, RNA-Seq, …
17 July 2015 · Computational Genomics
Assembly quality control
 QC/QA
 Metrics: Size, number of contigs, N50
 Diagnostic procedures
17 July 2015 · Computational Genomics
Genome size as predicted from the assembly
17 July 2015 · Computational Genomics
Read length, paired-end reads, coverage
- Read length and paired-end reads matter.
- Long reads can span repeat regions
- Paired-end reads can reach into repeat regions and bridge gaps
- Combination of the two maximizes shotgun sequencing
performance
- Coverage also matters.
- High coverage allows very high confidence in base calling
- Can do repeat resolution based on coverage fitting
- More likely that a read will span an ambiguous region
17 July 2015 · Computational Genomics
Scaffolding
 If paired end reads are available, scaffolding is already done.
 If not (our case)…
- Sequenced relatives may exist (our case)
- Use reference-based assembly to predict scaffolding
- No ready-made tools available for this
- Can be inaccurate
Assemblers can get confused by repeats or overlaps that
are too short
- May be able to join by hand
- Manual gap fill
- Automated gap fill (no tools exist yet)
17 July 2015 · Computational Genomics
Finishing
 Finishing is the process of completing the chromosome sequence.
- Close all gaps (usually by PCR, but large gaps in big genomes can be sent back
to make BACs for resequencing)
- Re-sequence areas with less than 2x, 3x, 5x coverage (depending on quality
standard) – same procedure as gaps
- Check and manually assemble unresolved repeat regions
- Check for mis-assembly by analyzing the overlap graph
 This is the most expensive and time-consuming part of sequencing.
- Lots of small projects omit finishing and work with draft genomes
17 July 2015 · Computational Genomics
Assemblers we used and our results
Strain
Reads
Avg.
coverage
Bases
Avg. read
length
A_PH_M13220
197,000
47.6M
20×
240
B_OR_M10699
420,000
82M
35×
195
C_NYC_M15141
379,000
94.3M
40×
245
W135_BF_M9261
605,000
64M
27×
105
Strain
Newbler
Newbler + AMOS + manual gap fill
Contigs
Nt, max length
Contigs
Nt, max length
% gapfilled
A_PH_M13220
332
2.06M
68K
57 (41)
2.3M
398K
1.8%
B_OR_M10699
169
2.10M
114K
40
2.18M
435K
1.1%
C_NYC_M15141
176
2.05M
170K
50 (34)
2.28M
759K
2.0%
W135_BF_M9261
205
2.0M
152K
27 (79)
2.21M
866K
1.6%
17 July 2015 · Computational Genomics
Homopolymer errors
 Specific to 454 pyrosequencing
 Sequencing errors usually result in frameshifts!
Flow Order
4-mer
3-mer
T
A
C
G
KEY (TCAG)
Measures the
presence or absence
of each nucleotide at
any given position
2-mer
1-mer
17 July 2015 · Computational Genomics
Visualization tools: Hawkeye
17 July 2015 · Computational Genomics
Visualization tools: Mauve
17 July 2015 · Computational Genomics
More topics
 Currently relevant assemblers
- Newbler (demo)
- Velvet
- ALLPATHS-LG
- ABYSS
- SHRiMP
- Celera/WGS
- PHRAP
 Other visualization tools (Consed, MAQ, Prospector 2, ABySSExplorer…)
 Microread assembly (Solexa and SOLiD)
 de Bruijn graph assembly paradigm
17 July 2015 · Computational Genomics