Transcript Slide 1

Some Terminology
insert a fragment that was incorporated in a
circular genome, and can be copied
(cloned)
vector the circular genome (host) that
incorporated the fragment
BAC
read
Bacterial Artificial Chromosome, a type
of insert–vector combination, typically
of length 100-200 kb
DNA Sequencing
a 500-900 long word that comes out of
a sequencing machine
coverage the average number of reads (or
inserts) that cover a position in the
target DNA piece
shotgun
the process of obtaining many reads
sequencing from random locations in DNA, to
detect overlaps and assemble
The Walking Method
1.
Build a very redundant library of BACs with sequenced cloneends (cheap to build)
2.
Sequence some “seed” clones
3.
“Walk” from seeds using clone-ends to pick library clones that
extend left & right
Walking: An Example
Walking off a Single Seed
Cycle time to process one clone: 1-2 months
1.
2.
3.
4.
5.
Grow clone
Prepare & Shear DNA
Prepare shotgun library & perform shotgun
Assemble in a computer
Close remaining gaps
A mammalian genome would need 15,000 walking steps !
Walking off several seeds in parallel
Efficient
Inefficient
• Few sequential steps
• Additional redundant sequencing
In general, can sequence a genome in ~5 walking steps,
with <20% redundant sequencing
Using Two Libraries
Most inefficiency comes from closing a small gap with a much
larger clone
Solution: Use a second library of small clones
Whole Genome Shotgun
Sequencing
genome
cut many times at
random
plasmids (2 – 10 Kbp)
forward-reverse paired
reads
known dist
cosmids (40 Kbp)
~500 bp
~500 bp
Advantages & Disadvantages of
different sequencing strategies
Physical Mapping
 ADV. Easy assembly
 DIS. Build physical map
Whole Genome Shotgun (WGS)
 ADV. No mapping
 DIS. Difficult to assemble and resolve repeats
Walking: combines some advantages of both
Other possible method:
• Shotgun sequencing of 10x BACs without any mapping
 ADV. Can re-sequence hard regions
 DIS. Too many shotgun libraries
Fragment Assembly
(in whole-genome shotgun sequencing)
Fragment Assembly
Given N reads…
Where N ~ 6
million…
We need to use a
linear-time
algorithm
Steps to Assemble a Genome
Some Terminology
read
500-900 longreads
word that comes
1.
Findaoverlapping
out of sequencer
mate pair a pair of reads from two ends
the same
insert
fragment
2. Mergeofsome
“good”
pairs
of reads into
longer contigs
contig
a contiguous sequence formed
by several overlapping reads
with no gaps
3. Link contigs to form supercontigs
supercontig an ordered and oriented set
(scaffold)
of contigs, usually by mate
pairs
consensus
sequence sequence
derived from the
4.
Derive consensus
sequene
multiple alignment of reads
in a contig
..ACGATTACAATAGGTT..
1. Find Overlapping Reads
aaactgcagtacggatct
aaactgcag
aactgcagt
…
tacggatct
gggcccaaactgcagtac
gggcccaaa
ggcccaaac
…
ctgcagtac
gtacggatctactacaca
tgacggatc
gacggatct
…
tactacaca
(word, read, orient., pos.)
(word, read, orient., pos.)
aaactgcag
aactgcagt
actgcagta
…
gtacggatc
tacggatct
gggcccaaa
ggcccaaac
gcccaaact
…
actgcagta
ctgcagtac
gtacggatc
tacggatct
acggatcta
…
ctactacac
tactacaca
aaactgcag
aactgcagt
acggatcta
actgcagta
actgcagta
cccaaactg
cggatctac
ctactacac
ctgcagtac
ctgcagtac
gcccaaact
ggcccaaac
gggcccaaa
gtacggatc
gtacggatc
tacggatct
tacggatct
tactacaca
1. Find Overlapping Reads
• Sort all k-mers in reads
(k ~ 24)
• Find pairs of reads sharing a k-mer
• Extend to full alignment – throw away if not >97% similar
TACA TAGATTACACAGATTAC T GA
|| ||||||||||||||||| | ||
TAGT TAGATTACACAGATTAC TAGA
1. Find Overlapping Reads
One caveat: repeats
A k-mer that appears N times, initiates N2 comparisons
ALU: 1,000,000 times
Solution:
Discard all k-mers that appear more than c 
Coverage, (c ~ 10)
1. Find Overlapping Reads
Create local multiple alignments from the
overlapping reads
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
1. Find Overlapping Reads (cont’d)
• Correct errors using multiple alignment
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
C:
C:
T:
C:
C:
20
35
30
35
40
C:
C:
C:
C:
C:
20
35
0
35
40
A:
A:
A:
A:
15
25
40
25
A:
A:
A:
A:
A:
15
25
0
40
25
• Score alignments
• Accept alignments with good scores
2. Merge Reads into Contigs
repeat region
Unique Contig
Overcollapsed Contig
Merge reads up to potential repeat boundaries
2. Merge Reads into Contigs
• Overlap graph:
 Nodes: reads r1…..rn
 Edges: overlaps (ri, rj, shift, orientation, score)
Remove transitively
inferrable overlaps
Overlap graph after forming contigs
Repeats, errors, and contig lengths
• Repeats shorter than read length are OK
• Repeats with more base pair diffs than sequencing error rate are OK
• To make the genome appear less repetitive, try to:
 Increase read length
 Decrease sequencing error rate
Role of error correction:
Discards ~90% of single-letter sequencing errors
decreases error rate
 decreases effective repeat content
 increases contig length
2. Merge Reads into Contigs
repeat region
• Ignore non-maximal reads
• Merge only maximal reads into contigs
2. Merge Reads into Contigs
repeat boundary???
sequencing
error
b
a
• Ignore “hanging” reads, when detecting repeat boundaries
2. Merge Reads into Contigs
?????
Unambiguous
• Insert non-maximal reads whenever unambiguous
3. Link Contigs into Supercontigs
Normal density
Too dense
 Overcollapsed
Inconsistent links
 Overcollapsed?
3. Link Contigs into Supercontigs
Find all links between unique contigs
Connect contigs incrementally, if  2 links
3. Link Contigs into Supercontigs
Fill gaps in supercontigs with paths of repeat contigs
4. Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTATTGACTTCATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive multiple alignment from pairwise read alignments
Derive each consensus base by weighted voting
(Alternative: take maximum-quality letter)
Some Assemblers
• PHRAP
• Early assembler, widely used, good model of read errors
• Overlap O(n2)->layout (no mate pairs)->consensus
• Celera
• First assembler to handle large genomes (fly, human, mouse)
• overlap->layout->consensus
• Arachne
• Public assembler (mouse, several fungi)
• overlap->layout->consensus
• Phusion
• overlap->clustering->PHRAP->assemblage->consensus
• Euler
• indexing->Euler graph->layout by picking paths->consensus
Quality of assemblies
Celera’s assemblies of human and mouse
Quality of assemblies—mouse
Quality of assemblies—mouse
Quality of assemblies—rat
History of WGA
1997
• 1982: -virus, 48,502 bp
• 1995: h-influenzae, 1
Let’s sequence
the human
Mbp
genome with the
shotgun strategy
• 2000: fly, 100 Mbp
• 2001 – present
 human (3Gbp), mouse (2.5Gbp),
Thatrat
is*, chicken, dog, chimpanzee,
several fungal genomes impossible, and a
bad idea anyway
Gene Myers
Phil Green
Next few lectures
More on alignments
Large-scale global alignment
– Comparing entire genomes
Suffix trees, sparse dynamic programming
MumMer, Avid, LAGAN, Shuffle-LAGAN
Multiple alignment
– Comparing proteins, many genomes
Scoring, Multidimensional-DP, Center-Star, Progressive alignment
CLUSTALW, TCOFFEE, MLAGAN
Gene recognition
Gene recognition on a single genome
GENSCAN – A HMM for gene recognition
Cross-species comparison-based gene recognition
TWINSCAN – A HMM
SLAM – A pair-HMM