de novo - MCB3895-004 Fall 2014: Computational Methods in

Download Report

Transcript de novo - MCB3895-004 Fall 2014: Computational Methods in

MCB3895-004 Lecture #9
Sept 23/14
Illumina library preparation, de novo genome
assembly
Illumina sequencing
• https://www.yout
ube.com/watch?
v=womKfikWlxM
http://openwetware.org/images/7/76/BMC_IlluminaFlowcell.png
Illumina sequencing - summary
1. Template consists of DNA fragments
amplified by bridge clustering
2. "Sequencing by synthesis" used to generate
DNA sequences
3. DNA sequence read as unique fluorescent
signatures following base incorporation
Illumina sequencing - summary
4. Adapters at each end of the template
molecule bind the flowcell adaptors and
facilitate bridge amplification
5. "Dual indexing" allows multiple samples to be
sequenced on the same flowcell, each
having a unique set of indices
6. Paired-end sequencing extends the regular
sequencing protocol to read each template
molecule in both directions
Paired-end sequencing
• Objective: allows repetitive regions to be
sequenced more precisely
http://technology.illumina.com/technology/next-generation-sequencing/paired-end-sequencing_assay.html
Paired-end sequencing
• Be careful to distinguish terms!
• Do not confuse adapters with the read or
template fragment
http://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html
Paired-end sequencing
• "Insert" is even more confusing
• Refers to entire fragment, including both the
reads and the unsequenced "inner mate"
region between them
• Term stems from long-dead plasmid
sequencing approaches
http://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html
Paired-end sequencing
• It is possible to have paired end reads that
overlap each other
• Can assemble to create long, highly accurate
contiguous reads
http://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html
Paired-end sequencing
• If the template fragment is too short, it is
possible to read past the end of the fragment
• Results in adapter region being included in
read
• Needs to be removed computationally.
http://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html
Library preparation
• How exactly are template fragments
generated?
• Lots of methods, I only present two: TruSeq
and Nextera
• Most common Illumina methods (specific kits
available from Illumina)
• Think about: where might biases arise?
TruSeq library preparation
• Step #1: Fragment DNA
• Typically via shearing
• Produces uniformly sized fragments
http://res.illumina.com/documents/products%5Cdatasheets%5Cdatasheet_truseq_dna_pcr_free_sample_prep.pdf
TruSeq library preparation
• Step #2: Create blunt ends using a polymerase
to remove 3' overhangs and fill in 5' overhangs
• Use bead purification to remove smallest
fragments, blunt ending reagents
http://res.illumina.com/documents/products%5Cdatasheets%5Cdatasheet_truseq_dna_pcr_free_sample_prep.pdf
TruSeq library preparation
• Step #3: Adenylate 3' ends to prevent selfligation while adding adapters
http://res.illumina.com/documents/products%5Cdatasheets%5Cdatasheet_truseq_dna_pcr_free_sample_prep.pdf
TruSeq library preparation
• Step #4: Ligate adapters containing sequencing
primer, indices, flowcell capture site
http://res.illumina.com/documents/products%5Cdatasheets%5Cdatasheet_truseq_dna_pcr_free_sample_prep.pdf
Nextera library preparation
• Nextera uses engineered transposases to
fragment genomic DNA and add sequencing
adaptors at the same time
• Low DNA input requirement
• "Transposome" = transposon + DNA for
attachment
http://support.illumina.com/content/dam/illumina-support/documents/myillumina/2a3297c5-8a34-4fc5-a148-3e16666fd65e/nextera_dna_sample_prep_guide_15027987_b.pdf
Nextera library preparation
• Step #1: Use "tagmentation" to simultaineously
fragment template DNA and add sequencing
adapters
• 300bp insert size reflects minimum needed by
transposases to cut and add adapters
http://support.illumina.com/content/dam/illumina-support/documents/myillumina/2a3297c5-8a34-4fc5-a148-3e16666fd65e/nextera_dna_sample_prep_guide_15027987_b.pdf
Nextera library preparation
• Step 2: Purify fragments from transposome
(part of Nextera kit)
• Result: fragment contains both 5' and 3'
sequencing adapters
http://support.illumina.com/content/dam/illumina-support/documents/myillumina/2a3297c5-8a34-4fc5-a148-3e16666fd65e/nextera_dna_sample_prep_guide_15027987_b.pdf
Nextera library preparation
• Step #3: Use PCR to add indices and flowcell
capture sites to the fragment
• Non-template fragments excluded during bead
clean-up following this step
http://support.illumina.com/content/dam/illumina-support/documents/myillumina/2a3297c5-8a34-4fc5-a148-3e16666fd65e/nextera_dna_sample_prep_guide_15027987_b.pdf
Nextera library preparation
• Final result:
•
•
•
•
•
Template fragment
Sequencing adapters
Dual indices
Flowcell capture sites
(same structure as TruSeq)
http://support.illumina.com/content/dam/illumina-support/documents/myillumina/2a3297c5-8a34-4fc5-a148-3e16666fd65e/nextera_dna_sample_prep_guide_15027987_b.pdf
Library prep is not error-free
http://res.illumina.com/documents/products/technotes/technote_truseq_comparison.pdf
Library prep is not error-free
http://res.illumina.com/documents/products/technotes/technote_truseq_comparison.pdf
Library prep is
not error-free
• Regions with lower
coverage are GC-rich
• No method is perfect
• Also note: Nextera
uses low cycle PCR,
has potential for bias
http://res.illumina.com/documents/products/technotes/technote_truseq_comparison.pdf
Mate pairs
• Paired end sequencing actually binds each
fragment to the flowcell and sequences from
each end
• Size limitations: large fragments are too floppy
to sequence well
• Mate pairs: maintain same philosophy of
adding inserts of known sizes, but facilitating
larger insert sizes
Nextera mate pair library
preparation
• Step #1: Use Nextera tagmentation to fragment
template and add adapters
• Adaptors are biotinylated for later steps
http://res.illumina.com/documents/products/datasheets/datasheet_nextera_mate_pair.pdf
Nextera mate pair library
preparation
• Step #2: Fragment is circularized using a
"biotin junction adapter"
http://res.illumina.com/documents/products/datasheets/datasheet_nextera_mate_pair.pdf
Nextera mate pair library
preparation
• Step #3: Circular molecules fragmented, biotin
tags used to enrich fragments having junction
• Recall: junction contains original fragment ends
http://res.illumina.com/documents/products/datasheets/datasheet_nextera_mate_pair.pdf
Nextera mate pair library
preparation
• Step #4: Use TruSeq protocol to end repair, Atail, and ligate flowcell capture sequences and
barcodes
• Final product has all the normal parts of an
Illumina template library but also junction
region mid-fragment
http://res.illumina.com/documents/products/datasheets/datasheet_nextera_mate_pair.pdf
Questions?
Digging deeper into the guts
de novo genome assembly
• Important to know to be able to tune assembly
software appropriately!
• Two paradigms:
1. Overlap/layout/consensus
2. De Bruijn graphs
• Both find overlaps between sequences, create
a network representation, and find the best
path through that network to represent the final
assembly
Overlap/layout/consensus
genome assembly
• Step #1: Compare all reads to each other to
find those that overlap
• Let's do it together! Reads (5'->3'):
TGGCA
CAATT
ATTTGAC
GCATTGCAA
TGCAAT
Overlap/layout/consensus
genome assembly
• Step #2: Create overlap graph arranging reads
according to their overlaps
• Step #3: Find unique path through the graph
• Step #4: Assemble overlapping reads by
aligning the reads and deriving consensus
Overlap/layout/consensus
genome assembly
• Requires all-vs-all comparison of reads
• becomes computationally intensive as the number
of reads increases
• Developed and applied for Sanger and 454
sequencing
• Not dead yet! Has reemerged for PacBio and other
long-read techniques
But consider errors
• Our network was for perfectly accurate reads
• What happens when you have both the correct
TGGCA read and a TGCCA read containing a
substitution sequencing error?
De Bruijn graph assembly
• Instead of comparing all reads with each other,
split reads up into kmers
• i.e., subsets of each read of a given length
• Much more computationally efficient than allvs-all comparison in overlap/layout/consensus
De Bruijn graph assembly
• Step #1: Tally kmers
• Let's find all kmers where k=4 for our set of
reads from before
TGGCA
CAATT
ATTTGAC
GCATTGCAA
TGCAAT
De Bruijn graph assembly
• Step #2: Create graph of kmer overlap, where
kmers are nodes and overlap between them
are edges
• More complex than overlap graph
• Step #3: Find unique path through the graph
• Can leverage kmers adjacent to each other in reads
to reduce complexity
• Step #4: Synthesize path into a consensus
sequence
De Bruijn graph assembly
• Doesn’t need all-vs-all comparison so is much
faster
• Can handle large numbers of reads, e.g., as
generated by Illumina technology
• Graph is much more complicated, RAM
intensive
• More sensitive to errors
De Bruijn graph assembly
• Consider errors: make the graph even more
complicated with bubbles, dead ends
• Consider repeats: parts of the graph with no
unique path through it
• Graph broken on each side, forming contigs
Next class
• Quality control of Illumina data
• Adapter trimming
• Error correction
• Next week: de novo genome assembly