Transcript Slide 1
Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host) that incorporated the fragment BAC read Bacterial Artificial Chromosome, a type of insert–vector combination, typically of length 100-200 kb DNA Sequencing a 500-900 long word that comes out of a sequencing machine coverage the average number of reads (or inserts) that cover a position in the target DNA piece shotgun the process of obtaining many reads sequencing from random locations in DNA, to detect overlaps and assemble The Walking Method 1. Build a very redundant library of BACs with sequenced cloneends (cheap to build) 2. Sequence some “seed” clones 3. “Walk” from seeds using clone-ends to pick library clones that extend left & right Walking: An Example Walking off a Single Seed Cycle time to process one clone: 1-2 months 1. 2. 3. 4. 5. Grow clone Prepare & Shear DNA Prepare shotgun library & perform shotgun Assemble in a computer Close remaining gaps A mammalian genome would need 15,000 walking steps ! Walking off several seeds in parallel Efficient Inefficient • Few sequential steps • Additional redundant sequencing In general, can sequence a genome in ~5 walking steps, with <20% redundant sequencing Using Two Libraries Most inefficiency comes from closing a small gap with a much larger clone Solution: Use a second library of small clones Whole Genome Shotgun Sequencing genome cut many times at random plasmids (2 – 10 Kbp) forward-reverse paired reads known dist cosmids (40 Kbp) ~500 bp ~500 bp Advantages & Disadvantages of different sequencing strategies Physical Mapping ADV. Easy assembly DIS. Build physical map Whole Genome Shotgun (WGS) ADV. No mapping DIS. Difficult to assemble and resolve repeats Walking: combines some advantages of both Other possible method: • Shotgun sequencing of 10x BACs without any mapping ADV. Can re-sequence hard regions DIS. Too many shotgun libraries Fragment Assembly (in whole-genome shotgun sequencing) Fragment Assembly Given N reads… Where N ~ 6 million… We need to use a linear-time algorithm Steps to Assemble a Genome Some Terminology read 500-900 longreads word that comes 1. Findaoverlapping out of sequencer mate pair a pair of reads from two ends the same insert fragment 2. Mergeofsome “good” pairs of reads into longer contigs contig a contiguous sequence formed by several overlapping reads with no gaps 3. Link contigs to form supercontigs supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence sequence derived from the 4. Derive consensus sequene multiple alignment of reads in a contig ..ACGATTACAATAGGTT.. 1. Find Overlapping Reads aaactgcagtacggatct aaactgcag aactgcagt … tacggatct gggcccaaactgcagtac gggcccaaa ggcccaaac … ctgcagtac gtacggatctactacaca tgacggatc gacggatct … tactacaca (word, read, orient., pos.) (word, read, orient., pos.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca aaactgcag aactgcagt acggatcta actgcagta actgcagta cccaaactg cggatctac ctactacac ctgcagtac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatc gtacggatc tacggatct tacggatct tactacaca 1. Find Overlapping Reads • Sort all k-mers in reads (k ~ 24) • Find pairs of reads sharing a k-mer • Extend to full alignment – throw away if not >97% similar TACA TAGATTACACAGATTAC T GA || ||||||||||||||||| | || TAGT TAGATTACACAGATTAC TAGA 1. Find Overlapping Reads One caveat: repeats A k-mer that appears N times, initiates N2 comparisons ALU: 1,000,000 times Solution: Discard all k-mers that appear more than c Coverage, (c ~ 10) 1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA 1. Find Overlapping Reads (cont’d) • Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA C: C: T: C: C: 20 35 30 35 40 C: C: C: C: C: 20 35 0 35 40 A: A: A: A: 15 25 40 25 A: A: A: A: A: 15 25 0 40 25 • Score alignments • Accept alignments with good scores 2. Merge Reads into Contigs repeat region Unique Contig Overcollapsed Contig Merge reads up to potential repeat boundaries 2. Merge Reads into Contigs • Overlap graph: Nodes: reads r1…..rn Edges: overlaps (ri, rj, shift, orientation, score) Remove transitively inferrable overlaps Overlap graph after forming contigs Repeats, errors, and contig lengths • Repeats shorter than read length are OK • Repeats with more base pair diffs than sequencing error rate are OK • To make the genome appear less repetitive, try to: Increase read length Decrease sequencing error rate Role of error correction: Discards ~90% of single-letter sequencing errors decreases error rate decreases effective repeat content increases contig length 2. Merge Reads into Contigs repeat region • Ignore non-maximal reads • Merge only maximal reads into contigs 2. Merge Reads into Contigs repeat boundary??? sequencing error b a • Ignore “hanging” reads, when detecting repeat boundaries 2. Merge Reads into Contigs ????? Unambiguous • Insert non-maximal reads whenever unambiguous 3. Link Contigs into Supercontigs Normal density Too dense Overcollapsed Inconsistent links Overcollapsed? 3. Link Contigs into Supercontigs Find all links between unique contigs Connect contigs incrementally, if 2 links 3. Link Contigs into Supercontigs Fill gaps in supercontigs with paths of repeat contigs 4. Derive Consensus Sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive multiple alignment from pairwise read alignments Derive each consensus base by weighted voting (Alternative: take maximum-quality letter) Some Assemblers • PHRAP • Early assembler, widely used, good model of read errors • Overlap O(n2)->layout (no mate pairs)->consensus • Celera • First assembler to handle large genomes (fly, human, mouse) • overlap->layout->consensus • Arachne • Public assembler (mouse, several fungi) • overlap->layout->consensus • Phusion • overlap->clustering->PHRAP->assemblage->consensus • Euler • indexing->Euler graph->layout by picking paths->consensus Quality of assemblies Celera’s assemblies of human and mouse Quality of assemblies—mouse Quality of assemblies—mouse Quality of assemblies—rat History of WGA 1997 • 1982: -virus, 48,502 bp • 1995: h-influenzae, 1 Let’s sequence the human Mbp genome with the shotgun strategy • 2000: fly, 100 Mbp • 2001 – present human (3Gbp), mouse (2.5Gbp), Thatrat is*, chicken, dog, chimpanzee, several fungal genomes impossible, and a bad idea anyway Gene Myers Phil Green Next few lectures More on alignments Large-scale global alignment – Comparing entire genomes Suffix trees, sparse dynamic programming MumMer, Avid, LAGAN, Shuffle-LAGAN Multiple alignment – Comparing proteins, many genomes Scoring, Multidimensional-DP, Center-Star, Progressive alignment CLUSTALW, TCOFFEE, MLAGAN Gene recognition Gene recognition on a single genome GENSCAN – A HMM for gene recognition Cross-species comparison-based gene recognition TWINSCAN – A HMM SLAM – A pair-HMM