Slides 3: NGS short
Download
Report
Transcript Slides 3: NGS short
CS 6293 Advanced Topics:
Current Bioinformatics
Genome Assembly: a brief
introduction
Slides Adapted from Mihai Pop, Art Delcher, and Steven Salzberg
Homework #2
• #1: questions will be posted online before Monday class
• #2: Form groups of 3
– Each group reads two papers on a topic:
Short reads alignment or assembly
– Present the papers and do some comparison
– ~8 minutes presentation
• You can choose to go to some really cool details
• Or give the main idea of the paper
– Other teams (and me) will judge you
– Send me names in your group and optionally papers you want to
present
– List of papers:
http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html
Genome sequencing
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT
3x109 nucleotides
~500 nucleotides
Genome sequencing
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT
3x109 nucleotides
A big puzzle
~60 million pieces
Computational Fragment Assembly
Introduced ~1980
1995: assemble up to 1,000,000 long DNA pieces
2000: assemble whole human genome
Shotgun DNA Sequencing
(Technology)
DNA target sample
SHEAR
SIZE SELECT
e.g.,
10Kbp
± 8%
std.dev.
End Reads (Mates)
550bp
Primer
LIGATE &
CLONE
SEQUENCE
Vector
Whole Genome Shotgun
Sequencing
– Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads
Short
2Kbp
–
Long
for Human.
10Kbp
Collect another 20X in clone coverage of 50Kbp end sequence pairs:
~ 1.2million pairs for Human.
–
Early simulations showed that if repeats were considered black
boxes, one could still cover 99.7% of the genome unambiguously.
BAC 5’
+ single highly automated process
+ only three library constructions
– assembly is much more difficult
BAC 3’
Sequencing Factory
Celera’s Sequencing Factory
(circa 2001)
300 ABI 3700 DNA Sequencers
50 Production Staff
20,000 sq. ft. of wet lab
20,000 sq. ft. of sequencing space
800 tons of A/C (160,000 cfm)
$1 million / year for electrical service
$10 million / month for reagents
Human Data (April 2000)
Collected 27.27 Million reads = 5.11X coverage
21.04 Million are paired (77%) = 10.52 Million pairs
2Kbp
5.045M
98.6% true *
<6% std.dev.
10Kbp
4.401M
98.6% true *
<8% std.dev.
50Kbp
1.071M
90.0% true *
<15% std.dev.
* validated against finished Chrom. 21 sequence
The clones cover the genome 38.7X times
Data is from 5 individuals (roughly 3X, 4 others at .5X)
Pairs Give Order & Orientation
Assembly without pairs results
in contigs whose order and
orientation are not known.
Contig
Consensus (15- 30Kbp)
Reads
?
Pairs, especially groups of corroborating
ones, link the contigs into scaffolds where
the size of gaps is well characterized.
2-pair
Mean & Std.Dev.
is known
Scaffold
Anatomy of a WGS Assembly
STS
Chromosome
STS-mapped Scaffolds
Contig
Read pair (mates)
Gap (mean & std. dev. Known)
Consensus
Reads (of several haplotypes)
SNPs
External “Reads”
Assembly gaps
Physical gaps
Sequencing gaps
sequencing gap - we know the order and orientation of the contigs and have at
least one clone spanning the gap
physical gap - no information known about the adjacent contigs, nor about the DNA
spanning the gap
12
Assembly paradigms
• Overlap-layout-consensus
– greedy (TIGR Assembler, phrap, CAP3...)
– graph-based (Celera Assembler, Arachne)
• Eulerian path (especially useful for short
read sequencing)
13
TIGR Assembler/phrap
Greedy
• Build a rough map of fragment
overlaps
• Pick the largest scoring overlap
• Merge the two fragments
• Repeat until no more merges
can be done
14
(A) Overlap between two reads—note that agreement within overlapping region need not be
perfect; (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C)
Assembly produced by the greedy approach.
Pop M Brief Bioinform 2009;10:354-366
© The Author 2009. Published by Oxford University Press. For Permissions, please email:
[email protected]
Overlap-layout-consensus
Main entity: read
Relationship between reads: overlap
1
4
2
7
5
8
3
2
1
1
3
2
6
4
3
5
9
6
1
7
2
8
3
9
1
2
3
ACCTGA
ACCTGA
AGCTGA
ACCAGA
16
Paths through graphs and
assembly
• Hamiltonian circuit: visit each node (city) exactly
once, returning to the start
• Hamiltonian path: visit each node (city) exactly
once
B
C
D
E
A
G
A
E
G
F
I
H
F
B
Genome
C
I
H
D
Overlap between two
sequences
overlap (19 bases)
overhang (6 bases)
GGATGCGCGGACACGTAGCCAGGAC
CAGTACTTGGATGCGCTGACACGTAGC
overhang
% identity = 18/19 % = 94.7%
overlap - region of similarity between regions
overhang - un-aligned ends of the sequences
The assembler screens merges based on:
• length of overlap
• % identity in overlap region
• maximum overhang size.
18
All pairs alignment
• Needed by the assembler
• Try all pairs – must consider ~ n2 pairs
• Smarter solution: only n x coverage (e.g. 8)
pairs are possible
– Build a table of k-mers contained in sequences
(single pass through the genome)
– Generate the pairs from k-mer table (single pass
through k-mer table)
E
k-mer
A
G
B
F
C
I
H
D
19
BWT-based overlap detection
• Efficient construction of an assembly string graph using
the FM-index, Jared T. Simpson and Richard Durbin,
Bioinformatics, 26 (12): i367-i373 (2010)
• Read it yourself for more details
ACT
ACT
ACT$......
ACT…..
ACT…..
$
ACT….
BWT for multiple sequences
OVERLAP GRAPH
Edge Types:
Regular Dovetail
A
B
A
B
Prefix Dovetail
A
B
B
A
Suffix Dovetail
A
B
A
B
E.G.:
Edges are annotated
with deltas of overlaps
The Unitig Reduction
1. Remove “Transitively Inferrable” Overlaps:
A
C
A
B
C
B
The Unitig Reduction
412
352
45
2. Collapse “Unique Connector” Overlaps:
A
A
B
B
Celera Assembly Pipeline
Trim & Screen
Find all overlaps 40bp allowing 6% mismatch.
Overlapper
A
Unitiger
B
implies
Scaffolder
TRUE
A
B
Repeat Rez I, II
OR
A
REPEATINDUCED
B
Celera Assembly Pipeline
Trim & Screen
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
Compute all overlap consistent sub-assemblies:
Unitigs (Uniquely Assembled Contig)
Celera Assembly Pipeline
Trim & Screen
Scaffold U-unitigs with confirmed pairs
Mated reads
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
Celera Assembly Pipeline
Trim & Screen
Fill repeat gaps with doubly anchored positive unitigs
Overlapper
Unitig>0
Unitiger
Scaffolder
Repeat Rez I, II
Handling repeats
1. Repeat detection
–
pre-assembly: find fragments that belong to
repeats
•
•
–
–
statistically (most existing assemblers)
repeat database (RepeatMasker)
during assembly: detect "tangles" indicative of
repeats (Pevzner, Tang, Waterman 2001)
post-assembly: find repetitive regions and
potential mis-assemblies.
•
•
Reputer, RepeatMasker
"unhappy" mate-pairs (too close, too far, mis-oriented)
2. Repeat resolution
–
–
find DNA fragments belonging to the repeat
determine correct tiling across the repeat
28
Statistical repeat detection
Significant deviations from average coverage flagged as
repeats.
- frequent k-mers are ignored
- “arrival” rate of reads in contigs compared with theoretical
value
Problem 1: assumption of uniform distribution of fragments leads to false positives
non-random libraries
poor clonability regions
Problem 2: repeats with low copy number are missed - leads
to false negatives
29
Mis-assembled repeats
excision
collapsed tandem
a
b
I
c
II
a
c
I
a
b
III
c
d
b
III
a
c
d
b
II
b
c
rearrangement
I
II
a
I
c
b
a
a
III
d
IV
e
d
III
f
II
e
b
IV
c
f
30
Eulerian path-based assembly
• Break each read into k-mers (typically k >= 19)
• Construct a de Bruijn graph using the k-mers
from all reads
– Each k-mer is a node
– v1 has a directed edge to v2 if v1 can be expressed
by removing the last char from v2 and adding a new
char at the beginning of v2, E.g.
v1 = acgtctgact
v2 = cgtctgactg
• Find a Eulerian path in the graph
– visits each edge exactly once
3. Simplification
1. Sequencing
2. Constructing a
de Bruijn graph
4. Error removal
Eulerian path-based assembly
• No need to compute pairwise overlaps – important for
NGS data
• Eulerian paths are much easier to find than Hamiltonian
path
– Catch: multiple Eulerian paths may exist
– Loss of information
• Repeats appear as cycles in the graph
– Less likely to cause mis-assembly
• More suitable for short-reads assembly
–
–
–
–
–
Newbler
VELVET
EDENA
ABySS
See Flicek & Birney, Nat Methods, 2009
References
• Sense from sequence reads: methods for alignment and
assembly, Paul Flicek & Ewan Birney, Nature Methods 6,
S6 - S12 (2009)
• Genome assembly reborn: recent computational
challenges, Mihai Pop, Briefings in Bioinformatics, 10(4):
354-366 (2009)