High-Throughput Sequencing

Download Report

Transcript High-Throughput Sequencing

High-Throughput Sequencing
Richard Mott
with contributions from Gil Mcvean,
Gerton Lunter, Zam Iqbal, Xiangchao
Gan, Eric Belfield
Sequencing Technologies
•
•
•
•
•
Capillary (eg ABI 3700)
Roche/454 FLX
Illumina GAII
ABI Solid
Others….
Capillary Sequencing
– based on electrophoresis
– used to sequence human and mouse genomes
– read lengths currently around 600bp (but used to
be 200-400 bp)
– relatively slow – 384 sequences per run in x hours
– expensive ???
Capillary Sequencing Trace
http://wheat.pw.usda.gov/genome/sequence/
ACGT represented by continuous traces.
Base-calling requires the identification of well-defined peaks
PHRED Quality Scores
• PHRED is an accurate base-caller used for capillary
traces (Ewing et al Genome Research 1998)
• Each called base is given a quality score Q
• Quality based on simple metrics (such as peak spacing)
callibrated against a database of hand-edited data
• Q = 10 * log10(estimated probability call is wrong)
10 prob = 0.1
20 prob = 0.01
30 prob = 0.001
[Q30 often used as a threshold for useful sequence data]
Capillary sequence assembly and
editing
CONSED screen shots
http://www.jgi.doe.gov/education/how/how_12.html
Illumina Sequencing machines
GA-II
HiSeq
The Illumina Flow-cell
•
•
•
•
•
•
•
Each flow-cell has 8 lanes (16 on HiSeq)
A different sample can be run in each lane
It it possible to multiplex up to 12 samples in
a lane
Each lane comprises 2*60=120
square tiles
Each tile is imaged and analysed separately
Sometimes a control phiX lane is run (in a
control, the genome sequenced is identical
to the reference and its GC content is not too
far from 50%)
Illumina GA-II “traces”
Discontinuous – a set of 4 intensities at each base position
Cross talk:
base-calling errors
Whiteford et al Bioinformatics 2009
Base-calling errors
Typical base-calling error rate ~ 1%,
Error rate increases towards end of read
Usually read2 has more errors than read1
Assessing Sequence Quality
Example summary for a lane of 51bp paired-end data
# reads
% good quality reads
(passed chastity filter)
% mapped to
reference
using ELAND
(the read-mapper
supplied by Illumina)
% optical
duplicates
% of differences from
reference (upper bound
on error rate) in mapped reads
Illumina Throughput (April 2013)
Machine
HiSeq2000
HiSeq2000
Read Length
2 x 50bp
2 x 100 bp
Run Time
1 week
12 days
HiSeq2500
MiSeq
2 x 150bp
2 x 150 bp
>30hrs
24-27hrs
Output/lane
15-18Gb
150-200 Gb
Output/flow cell
120-144Gb Gb
150-200 Gb
cost/lane
>4.5Gb
40-60Gb
£0.85k
£2.5-3k
Note: 1 human genome = 3Gb.
20-30x coverage of one human genome = 2=3 lanes of HiSeq
£1.5-2k
Illumina HiSeq Throughput (Feb 2011)
Read Length
Run Time
Output
1 x 35bp
1.5 days
26-35 Gb
2 x 50bp
4 days
75-100 Gb
1 x 100 bp
8 days
150-200 Gb
Note: 1 human genome = 3Gb.
Illumina GA-II throughput per flow-cell
Note: these are correct as of February 2010.
Output is constantly improving due to changes in chemistry and software.
Consumables costs are indicative only
- they don’t include labour, depreciation, overheads or bioinformatics
- true costs are roughly double
Pooling and Multiplexing
Read
Primer
Primer
Barcode 6bp
Up to 96 distinct barcodes can be added to one end of a read
useful for low-coverage sequencing of many samples in a simple lane
Up to a further 96 barcodes can be added to other end of a read = 96*96 = 9216 samples
Useful for bacterial sequencing
Pooling Costs
• Library Preps
– £80-150 per sample, depending on type of
sequencing
– £<50 per sample for 96-plex genomic DNA
• Pooling costs are dominated by library prep,
not HiSeq lane costs
• eg 96-plex of gDNA on on HiSeq lane = £4k
Data Formats
• Sequencing produces vast amounts of data
• Rate of data growth exceeds Moore’s law
The FastQ format
(standard text representation of short reads)
• A FASTQ text file normally uses four lines per sequence.
– Line 1 begins with a '@' character and is followed by a sequence
identifier and an optional description (like a FASTA title line).
– Line 2 is the raw sequence letters.
– Line 3 begins with a '+' character and is optionally followed by the
same sequence identifier (and any description) again.
– Line 4 encodes the quality values for the sequence in Line 2, and must
contain the same number of symbols as letters in the sequence. The
letters encode Phred Quality Scores from 0 to 93 using ASCII 33 to 126
– Example
•
•
•
•
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Binary FastQ
• Computer-readable compressed form of
FASTQ
• About 1/3 size of FASTQ
• Enables much faster reading and writing
• Standard utility programs will interconvert (eg.
maq)
• Becoming obsolete……
SAM and SAMTOOLS
http://samtools.sourceforge.net/
• SAM (Sequence Alignment/Map) format is a generic format for
storing large nucleotide sequence alignments.
• SAM aims to be a format that:
– Is flexible enough to store all the alignment information generated by
various alignment programs;
– Is simple enough to be easily generated by alignment programs or
converted from existing alignment formats;
– Is compact in file size;
– Allows most of operations on the alignment to work on a stream
without loading the whole alignment into memory;
– Allows the file to be indexed by genomic position to efficiently retrieve
all reads aligning to a locus.
– SAM Tools provide various utilities for manipulating alignments in the
SAM format, including sorting, merging, indexing and generating
alignments in a per-position format.
BAM files
• SAM, BAM are equivalent formats for describing alignments
of reads to a reference genome
• SAM: text
• BAM: compressed binary, indexed, so it is possible to access
reads mapping to a segment without decompressing the
entire file
• BAM is used by lookseq, IGV and other software
• Current Standard Binary Format
• Contains:
– Meta Information (read groups, algorithm details)
– Sequence and Quality Scores
– Alignment information
• one alignment per read
Inside a BAM file
@HD
VN:1.0
GO:none
SO:coordinate
@SQ
SN:chr10 LN:135534747
@SQ
SN:chr11 LN:135006516
...
@SQ
SN:chrX
LN:155270560
@SQ
SN:chrY
LN:59373566
@RG
ID:WTCHG_7618
PL:ILLUMINA
PU:101001_GAII06_00018_FC_5
LB:070/10_MPX
SM:1772/10 CN:WTCHG
@PG
ID:stampy VN:1.0.5_(r710)
CL:--processpart=1/4 --readgroup=ID:WTCHG_7618,SM:1772/10,PL:ILLUMINA,PU:101001_GAII06_00018_FC_5,LB:070/10_MPX,CN:WTCHG --comment=@Lane_5_comments.txt --keepreforder --solexa
@PG
ID:stampy.1 VN:1.0.5_(r710)
CL:--processpart=2/4 --readgroup=ID:WTCHG_7618,SM:1772/10,PL:ILLUMINA,PU:101001_GAII06_00018_FC_5,LB:070/10_MPX,CN:WTCHG --comment=@Lane_5_comments.txt --keepreforder --solexa
@PG
ID:stampy.2 VN:1.0.5_(r710)
CL:--processpart=3/4 --readgroup=ID:WTCHG_7618,SM:1772/10,PL:ILLUMINA,PU:101001_GAII06_00018_FC_5,LB:070/10_MPX,CN:WTCHG --comment=@Lane_5_comments.txt --keepreforder --solexa
@PG
ID:stampy.3 VN:1.0.5_(r710)
CL:--processpart=4/4 --readgroup=ID:WTCHG_7618,SM:1772/10,PL:ILLUMINA,PU:101001_GAII06_00018_FC_5,LB:070/10_MPX,CN:WTCHG --comment=@Lane_5_comments.txt --keepreforder --solexa
@CO
TM:Tue, 26 Oct 2010 09:21:06 BST WD:/data1/GA-DATA/101001_GAII06_00018_FC/Data/Intensities/BaseCalls/Demultiplexed-101009/004/GERALD_09-10-2010_johnb.2
HN:comp04.well.ox.ac.uk
UN:johnb
@CO
IX:GCCAAT SN:085 B-cell
ID:070/10_MPX
GE:Human37
SR:gDNA Indexed
CT:false
PR:P100116 SM:1771/10
@CO
TM:Tue, 26 Oct 2010 09:21:06 BST WD:/data1/GA-DATA/101001_GAII06_00018_FC/Data/Intensities/BaseCalls/Demultiplexed-101009/004/GERALD_09-10-2010_johnb.2
HN:comp04.well.ox.ac.uk
UN:johnb
@CO
IX:GCCAAT SN:085 B-cell
ID:070/10_MPX
GE:Human37
SR:gDNA Indexed
CT:false
PR:P100116 SM:1771/10
@CO
TM:Tue, 26 Oct 2010 09:21:06 BST WD:/data1/GA-DATA/101001_GAII06_00018_FC/Data/Intensities/BaseCalls/Demultiplexed-101009/004/GERALD_09-10-2010_johnb.2
HN:comp04.well.ox.ac.uk
UN:johnb
@CO
IX:GCCAAT SN:085 B-cell
ID:070/10_MPX
GE:Human37
SR:gDNA Indexed
CT:false
PR:P100116 SM:1771/10
@CO
TM:Tue, 26 Oct 2010 09:21:06 BST WD:/data1/GA-DATA/101001_GAII06_00018_FC/Data/Intensities/BaseCalls/Demultiplexed-101009/004/GERALD_09-10-2010_johnb.2
HN:comp04.well.ox.ac.uk
UN:johnb
@CO
IX:GCCAAT SN:085 B-cell
ID:070/10_MPX
GE:Human37
SR:gDNA Indexed
CT:false
PR:P100116 SM:1771/10
WTCHG_7618:5:40:5848:3669#GCCAAT
145
chr10
69795
3
11I40M
chr16
24580964 0
TCAGAAAAAAGAAAATGTGGTATATATACACAATGGAGTACTATTCAGCCC
GFIIIIIIIIHIIIIHIIIIIIIIFIIHII
WTCHG_7618:5:77:5375:15942#GCCAAT
99
chr10
82805
0
51M
=
83055
301
GCAGGGAGAATGGAACCAAGTTGGAAAACACTCTGCAGGATATTATCCAGG
GBBHHEBG<GGGGGGE
WTCHG_7618:5:77:5375:15942#GCCAAT
147
chr10
83055
0
51M
=
82805
-301
AGCTGATCTCTCAGCAGAAACCGTACAAGCCAGAAGAGAGTGGGGGCCAAC
#################DB
WTCHG_7618:5:49:18524:13016#GCCAAT
163
chr10
83516
0
51M
=
83734
269
CCCATCTCACGTGCAGAGACACACATAGACTCAAAATAAAAGGATGGAGGA
EHHIIIHIIIIHIIIIIFIDIIIEGE
WTCHG_7618:5:2:1789:11020#GCCAAT
161
chr10
83598
0
2M2D49M chrM
2220
0
GTGGGTTGCAATCCTAGTCTCTGATAAAACAGACTTTAAACCAATAAAGAT
GGGGG>DDBGGGGGGI
WTCHG_7618:5:5:8834:6028#GCCAAT
163
chr10
83702
0
51M
=
83876
225
AGAAGAGCTAACTATCCTAAATATATATGCACCCAATACAGGAGCACCCAG
EIIIIHHHGGIDIIIHEGIGIH
WTCHG_7618:5:49:18524:13016#GCCAAT
83
chr10
83734
0
51M
=
83516
-269
CCAATACAGGAGCACCCAGATTCATAAAGCAAGTCCTGAGTGACCTACAAT
BHHHHGHHHHHHHHFH
WTCHG_7618:5:5:8834:6028#GCCAAT
83
chr10
83876
0
51M
=
83702
-225
TACCCAGGAATTGAACTCAGCTCTGCACCAAGCAGACCTAATAGACATCTA
DEHIIIHIIIIDIGIHFHHGIH
samtools view -h WTCHG_7618.bam
SAMtools
•
A package for manipulating sequence data
–
–
–
–
–
–
–
–
import: SAM-to-BAM conversion
view: BAM-to-SAM conversion and subalignment retrieval
sort: sorting alignment
merge: merging multiple sorted alignments
index: indexing sorted alignment
faidx: FASTA indexing and subsequence retrieval
tview: text alignment viewer
pileup: generating position-based output and consensus/indel calling
• Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N.,
Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data
Processing Subgroup (2009) The Sequence alignment/map (SAM)
format and SAMtools. Bioinformatics, 25, 2078-9
Pileup Alignments
seq1
seq1
seq1
seq1
seq1
seq1
seq1
seq1
272
273
274
275
276
277
278
279
T
T
T
A
G
T
G
C
24
23
23
23
22
22
23
23
,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
,.$....,,.,.,...,,,.,...
7<7;<;<<<<<<<<<=<;<;<<6
,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
....,,.,.,...,,,.,....^k.
%38*<<;<7<<7<=<<<;<<<<<
A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
Applications
Genome Resequencing
• Align reads to reference genome
– assumed to be very similar, most reads will align
• Identify sequence differences
– SNPs, indels, rearrangements
– Focus may be on
• producing a catalogue of variants (1000 genomes)
• producing a small number of very accurate genomes
(mouse, Arabidopsis)
• Generate new genome sequences
Mapping Accuracy in simulated human data
Effects of Indels
Read Mapping:
read length matters
E.Coli
S. cerevisiae
A thaliana
H sapiens
2008 18: 810-820 Genome Res.
5.4 Mb
12.5 Mb
120 Mb
2.8 Gb
Read Mapping(1): Hashing
• Each nucleotide can be represented as a 2-bit binary number A=00, C=01,
G=10, T=11
• A string of K nucleotides can be represented as a string of 2K bits eg
AAGTC = 0000101101
• Each binary string can be interpreted as a unique integer
• All DNA strings of length K can be mapped to the integers 0,1,…..4K-1
–
–
–
–
–
–
k=10
65,535
k=11
262,143
k=12 1,048,575
k=13 4,194,303
k=14 16,777,215
k=15 67,108,864 (effective limit for 32-bit 4-byte words)
• Can use this relationship to index DNA for fast mapping
• Need not use contiguous nucleotides – spaced seeds, templates
• Trade-off between unique indexing/high memory use
MAQ
Package for read mapping, SNP calling and management of
read data and alignments
Genome Research 2008
Easy to use - unix command-line based
Although no longer state of the art and comparatively slow ,
generally produces good results
MAQ Read Mapping
– Indexes all reads in memory and then scans through genome
– Uses the first 28bp of each read for seed mapping
– Guarantees to find seed hits with no more than two mismatches, and
it also finds 57% of hits with three mismatches
– Uses a combination of 6 hash tables that index different parts of each
read to do this
– Defines a PHRED-like read mapping quality
• Qs = −10log10 Pr{read is wrongly mapped}.
• Based on summing the base-call PHRED scores at mismatched positions
– Reads that map equally well to multiple loci are randomly assigned
one location (and have Q=0)
– Uses mate pair information to look for pairs of reads correctly oriented
within a set distance
• Defines mapping quality for a pair of consistent reads as the sum of their
individual mapping qualities
Read-Mapping (2)
Bowtie, BWA, Stampy
all use the Burrows-Wheeler transform
Burrows Wheeler transform
• Represents a sequence in a form such that
– The original sequence can be recovered
– Is more compressible (human genome fits into
RAM)
– similar substrings tend to occur together (fast to
find words)
Bowtie
http://bowtie-bio.sourceforge.net/index.shtml
http://genomebiology.com/2009/10/3/R25
• Uses the BWA algorithm
• Indexes the genome, not the reads
• Not quite guaranteed to find all matching positions with <= 2
mismatches in first 28 bases (Maq’s criterion)
• Very fast (15-40 times faster than Maq)
• Low memory usage (1.3 Gb for human genome)
• Paper focuses on speed and # of mapped reads, not accuracy.
“[…] Bowtie’s sensitivity […] is equal to SOAP’s and somewhat
less than Maq’s”
Stampy (Gerton Lunter, WTCHG)
•
“Statistical Mapper in Python” (+ core in C)
•
Uses BWA and hashing
•
<= 3 mismatches in first 34 bp  match guaranteed
More mismatches: gradual loss of sensitivity
•
Algorithm scans full read, rather than just beginning
(and no length limit)
•
Handles indels well:
Reads are aligned to reference at all candidate positions
•
Faster than Maq, slower than Bowtie
•
2.7 Gb memory (shared between instances)
Performance – sensitivity
Mapping sensitivity
100
90
80
Stampy SE
70
60
Maq SE
50
BWA SE
40
Stampy PE
30
Maq PE
20
BWA PE
10
-30
-29
-28
-27
-26
-25
-24
-23
-22
-21
-20
-19
-18
-17
-16
-15
-14
-13
-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
0
Indel size
100
90
Top panel:
Sensitivity for reads with indels
80
70
60
stampy 72 SE
50
Right-hand panel:
Sensitivity as function of divergence
bwa 72 SE
40
eland 72 SE
30
maq 72 SE
20
10
div0
div0.01
div0.02
div0.03
div0.04
div0.05
div0.06
div0.07
div0.08
div0.09
div0.10
div0.11
div0.12
div0.13
div0.14
div0.15
(Genome: human)
0
Viewing Read Alignments: lookseq
http://www.sanger.ac.uk/resources/software/lookseq/
Viewing Read Alignments: IGV
http://www.broadinstitute.org/igv/
Variant calling
• A hard problem, several SNP callers exist eg MAQ/SAMTOOLS, Platypus
(WTCHG) GATK (Broad)
• Issue is to distinguish between sequencing errors and sequence variants
• If variant has been seen before in other samples then problem is easier
– genotyping vs variant discovery
• VCF Variant call format is now standard file format
• MAQ
–
–
–
–
Assumes genome is diploid by default
Error model initially assumes that sequence positions are independent,
attempts to compute probability of sequence variant
Has to use number of heuristics to deal with misalignments
SNP caller now part of SAMTOOLS varFilter.pl
acts as a filter on a large number of statistics tabulated about each sequence
position
Problems with Variant Calling
• Variant Calling is difficult because
– a diploid genome will have two haplotypes present, which can differ
significantly, eg due to polymorphic indels
• should be easier with haploid or inbred genomes
• but even harder when looking at low-coverage pools of individuals (eg 1000 genomes)
– Coverage can vary depending on GC content
• problem is sporadic
– Optical duplicates may give the impression there is more support for a variant
• often all reads with the same start and end points are thinned to a single representative,
but this can cause problems if the coverage is very high
– read misalignments can produce false positives
• repetitive reads can be mapped to the wrong place
• indels near the ends of reads can cause local read misalignments, where mismatches
(SNPs) are favoured over indels
– very divergent sequence is hard to align
• may fail to give any mapping signal and will look like a deletion
• problem addressed by local indel realignment (GATK)
GC content can affect read coverage
(Arabidopsis data from Plant Sciences, thanks to Eric Belfield and Nick Harberd)
coverage
GC content
global
•
Possible causes:
• Sanger identified that melting the gel slice by heating to 50 °C in chaotropic buffer decreased the
representation of A+T-rich sequences. Nature methods | VOL.5 NO.12 | DECEMBER 2008
• PCR bias during library amplification. Nature methods | VOL.6 NO.4 | APRIL 2009 | 291
Deletions can cause SNP artefacts, by inducing misalignments at
ends of reads
Arabidopsis Data from Eric Belfield and Nick Harberd, Plant Sciences, Oxford
de-Novo genome assembly
• No close reference genome available
• Harder than resequencing
– Only about 80-90% of genome is assembleable due to
repeats
– contiguation
– scaffolding
• Different Algorithms
• More data required:
– greater depth of coverage
– range of paired-end insert sizes
Assembling Genomes from Scratch
de-novo assembly
• Software include:
– VELVET
– ABySS
– ALL_PAIRS
– SOAPdenovo
– CORTEX
Computational limitations
• Traditional approach to take reads as
fundamental objects, and build algorithms/data
structures to encode their overlaps
– essentially quadratic in the number of reads
• Next-generation sequencing machines generate
too many reads!
– simply holding the base-calls requires tens of
terabytes for large projects
– analysis produces lots of large intermediate files
• Whatever we do, it has to scale slower than
coverage
The de Bruijn Graph
a representation of all possible paths joining reads together
Pevsner, PNAS 2001
Choose a word length k (5 in this example, but larger in applications)
AACTACTTACGCG
AACTA
The de Bruijn Graph
AACTAACTACGCG
AACTA ACTAA
The de Bruijn Graph
AACTAACTACGCG
AACTA ACTAA
CTAAC
The de Bruijn Graph
AACTAACTACGCG
AACTA ACTAA
CTAAC
TAACT
The de Bruijn Graph
AACTAACTACGCG
AACTA ACTAA
CTACT
TAACT
Same sequence, different k=3
ACTACTACTGCAGACTACT
TAC
CTA
TGC
CTG
GCA
ACT
CAG
GAC
AGA
Same sequence, different k=17
ACTACTACTGCAGACTACT
ACTACTACTGCAGACTA
CTACTACTGCAGACTAC
TACTACTGCAGACTACT
Recovering unambiguous contigs
bulge– two different paths;
in a diploid genome both
might be correct
Outline of de-Novo Assembly with the deBruijn Graph
2010 20: 265-272 Genome Res.
2010 20: 265-272 Genome Res.
Comparison of RAM requirements for whole human
genome
3500
3000
RAM required (Gb)
3000
2500
2000
1500
1000
500
160
336
0
Cortex
ABYSS
Velvet
Examples
Resequencing Inbred Lines
• Mouse
– 15 inbred strains, at Sanger (PI David Adams)
– 2.8Mbp
– sequenced to 20x
• Arabidopsis thaliana (plant model)
– 19 inbred accessions, here
– 120Mb
– sequenced to 20-30x
Iterative Reassembly (IMQ/DENOM)
Xiangchao Gan
• Basic idea
– Should only be one haplotype present (but not always
true)
– Align reads to reference (Stampy)
– Identify high-confidence SNPs and indels (SAMTOOLS)
– Modify reference accordingly
– Realign reads to modified reference
– Iterate until convergence
– 5 iterations usually sufficient
– Combine with denovo assembled contigs to improve
assembly
Iterative reassembly of inbred strains
Iterative Alignment + deNovo
Assembly
Low Coverage sequencing
• “Cheap” alternative to SNP genotyping chips
• Sequence populations at <1x coverage
• Impute compute genotypes from population
data + haplotype data (1000 genomes…)
CONVERGE study of Major Depression
• 12000 Chinese Women
– 6000 cases with Major Depression
– 6000 matched controls
– Sequenced at ~1x coverage
Imputation from 1x coverage
Comparison with 16 samples genotyped on Illumina Omnichip
orange is after imputation with 570 asian Thousand Genomes Project
haplotypes
green is before imputation (just using genotype likelihoods)
Outbred Mice
• 2000 commercial outbred mice
• descended from standard laboratory inbred
strains
• phenotyped for ~300 traits
• sequenced at ~ 0.1x coverage
Haplotype reconstruction as probabalistic
mosaics
QTLs
Other Applications
• RNA-Seq
– gene expression
• Chip-Seq
–
–
–
–
DNA-protein binding sites
Histone marks
Nucleosome positioning
DNAse hypersensitive sites
• Methylation
– bisulphite sequencing
• Mutation detection
– from mutagenesis experiments
– from human trios
• Multiplex Pooling
– random genotyping from low coverage read data