Transcript Indel

From calling bases to calling variants: Experiences with Illumina data Gerton Lunter Wellcome Trust Centre for Human Genetics

This talk  Refresher: Illumina sequencing  QC   What can go wrong Useful QC statistics  Read mapping   Comparison of popular read mappers Stampy  Indel and SNP calling  (Some results: 1000 Genomes indel calls)

7x Illumina GA-II 2x Roche 454 1x Illumina HiSeq 2000 700 600 500 400 300 200 100 0

WTCHG production (Gb)

1. Refresher: Illumina sequencing

Illumina sequencing

Illumina sequencing  8 lanes… x 120 tiles x108 bp x 2 reads… = about 48 Gb raw bp

2. QC

Quality issues  Bases are identified by their fluorescent tag  Overlapping emission spectra  Single base per cycle: reversible terminator chemistry   Not perfect: fraction lags, fraction runs ahead: “Phasing” Limits read length  Optimizing yield: cluster density    Higher densities mean more errors Above an optimum, yield decreases Partly signal processing issue: software improvements  Low amounts of initial DNA  Linker-linker hybrids; duplicated reads

Overlapping fluorescence spectra  C/A and G/T overlap  (Most common mutations are transitions, A-G and C-T) Rougemont et al, 2008

Refresher: Phred scores  Phred score = 10 log 10 ( probability of error )  10: 10% error probability  20: 1% error probability  30: 0.1% error probability (one in 1,000)  3 = 50%, 7 = 20%  13 = 5%, 17 = 2%  23 = 0.5%, 27 = 0.2%  33 = 0.05%, 37 = 0.02%

Phasing August 2009 June 2010

Cluster density & other improvements

August 2009: June 2010:

Library complexity, duplicate reads     Some sequences are read several times:   Low amount of initial material, many PCR copies Optical duplicates; secondary cluster seeding Problem for variant calling  Any PCR error will be seen twice: evidence for variant Rate of duplicates is rarely >5%   Criterion: both ends of a PE read map to matching location Can occur by chance, but low probability, except for very high coverage Post processing: duplicate removal  Standard processing step (e.g. Samtools, Picard)  Useful statistic:   Duplicate fraction is approximately additive across lanes (same library) 2x duplication fraction ≈ fraction of the library that was sequenced

Library complexity, duplicate reads Fraction α of all molecules is sequenced Number of times a PCR copy is sequenced: Poisson( α )

n =

Poisson

0

e α Duplicates: 0

1

α e α 0

2

α 2 e α /2 1

3

α 3 e α /6 2

… … Expected fraction of duplicates: e α -1+ α As a fraction of all reads sequenced: (e α -1+ α )/ α = ½ α + …

Sequencing QC

QC statistics

QC statistics - coverage

QC statistics – quality scores

GATK recalibration tool

3. Read mapping

Read mapping  First processing step after sequencing:   Read mapping (most times) Assembly (no reference sequence; specialized analyses)  Quality of mapping determines downstream results     Accessible genome Biases (ref vs. variant) Sensitivity (divergent reference; snps, indels, SV) Specificity (calibration of mapping quality)

Read mapper comparison  Read mappers:  Maq  BWA  Eland  Novoalign  Stampy  Criteria:  Sensitivity (overall; divergent reference; variants)  Specificity (mapping quality calibration)  Speed

Sensitivity

Sensitivity - indels

Sensitivity – Divergent reference

Specificity – ROC curves ROC - indels

Performance on real data

Efficiency

Stampy – first part of algorithm 15 bp subsequence Remove rev-comp symmetry 29 bit word read 4 bytes x 2 29 entry (2 Gb) hash table candidate positions open addressing, cache-friendly

Second part: Fast candidate alignment Single-instruction multiple-data (SIMD), parallel execution Affine gap penalties. Linear-time and constant memory algorithm: DP table in registers.

Maximum indel size 15 bp.

Third part: Modeling mapping failures  Pseudo-bayesian posterior  (using candidates, rather than all mapping positions) Failure to find the correct candidate  (2 or more mismatches in every 15bp subsequence) Sequence not in reference (is sequence match better than expected best random match?)

4. SNP and indel calling

SNP calling  General idea:  Works quite well! Some caveats:  Include mapping quality: P(read|g) = P(read | wrong map) P(wrong map) + P(read | g, correct map) P(correct map)   Mapping errors are dependent: don’t include mapQ<10 Base errors are not uniform (A/C/G/T): assume worst case (all identical)  Assumes no anomalies (seg dups; alignments; indel/SV; …)  Hard problem: be conservative  Expected SNP rate (human): 10 -3 /nt. FPR of 10 -5 required for 1% FDR  Filtering is required to achieve good FDR – or all data features must be adequately modeled

Indel calling  General idea:  Differences with SNP calling:  Pseudo-Bayes: cannot consider all possible variants/genotypes Generate large set of candidates Filter using goodness-of-fit test  Illumina reads do not have an explicit indel error model

Indel error model

Indel error rate (per nt)

1,00E-02 1,00E-03 1,00E-04 1,00E-05 1 2 3 4 5 6 7 8 9 10 11 Homopolymer run length 12 13 14 15

Wrap up  GA-II produces large amounts of good data  Artefacts do occur, keep a look at QC statistics  Choice of mapper influences yield and quality  Variant calling:    Bayesian approaches work well Some assumptions (independence) not met, hard to model Filtering remains necessary