An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data
Download ReportTranscript An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data
An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data Jin Zhang, Jiayin Wang and Yufeng Wu Department of Computer Science and Engineering University of Connecticut 2:43 AM RECOMB-seq 2012 1 •Structural Variation (SV) Reference Alternative deletion insertion •SV calling using HTS sequencing data Method Pair or Single Coverage Assembly Higher Read depth Higher Read pair Pair only Split read •Exact breakpoint Mean insert size + 3 σ Exact Reference breakpoints Alternative Deletion No No Reference Alternative Deletion Mills et al. (Nature, 2011) “…,which facilitated analysing their origin and functional impact.“ Lam et al. (Nature Biotechnology 2010) classification and annotation •Problem Finding SVs with Exact breakpoints using Low-coverage Paired-end reads 2:43 AM RECOMB-seq 2012 2 •Split-read mapping (e.g. Deletion) Reads mapping tools: •Not map it •Or Soft-clipping Reference Alternative Focal region Maximum event size Deletion • Because of sequence and repeats, longer Maximum Event size (e.g. 1Mbps) may cause false positives • Different way of splits may cause even more false positives •Shorter maximum event size may reduce false positives but also may fail to find some larger deletions Method Algorithm Max Deletion Size Cutoff Pindel: Ye et al. (Bioinformatics 2009) Pattern growth Yes Yes 2:43et AMal. SVseq1: Zhang (Bioinformatics 2011) BWT Yes SVseq2 (For this work) (Recomb-seq 2012) 2:43 AM Dynamic Programming RECOMB-seq 2012 Yes Yes RECOMB-seq 2012 Insert Size Focal Region Yes Yes 3 •SVseq2: a pattern for deletion calling: Finding focal region with the help of a spanning pair li: library mean σ: standard deviation l: read length li + 3σ Known breakpoint Alternative unknown breakpoint li+ 3σ -2l They are the same breakpoint on Alternative Spanning pair E.g. Reference li+ 3σ -2l li+ 3σ -2l Alternative Deletion (not known) (a) within length li+ 3σ -2l from ,find (b) Find by using , coz they are a pair (c) Find by mapping the soft-clipped portion within length li+ 3σ -2l of 2:43 AM li+ 3σ -2l = 400 + 3*50 -200 = 350 bps Note Maximum Event Size can be 1Mbps Using focal region: (1)Search in much smaller space (2)Reduce the way of splits (3)Able to find large deletions RECOMB-seq 2012 4 •SVseq2: another pattern in deletion calling Anchor Alternative Reference Deletion The pair itself is also a spanning pair li+ 3σ -2l •Dynamic alignment algorithm (semi-global) Similarity : 1 for matches and −1 for mismatches. Penalty: 3 for gaps inside the sequence, 0 outside. GTTCTAAGCCAGTGGTTCTACCAACTTGAGTATGCATCAGAATCACTTGGA - - - - - - - - - -AGTGGTTCT- CCAACTTGAGAATGCATCA - - - - - - - - - - - - 2:43 AM RECOMB-seq 2012 5 •SVseq2: Type III pattern for Insertion calling: Insertion Read 1 Alternative Reference Read 2 Region 1 Overlap Portion 2 Portion 1 Portion 3 Portion 4 Mapping score: Penalties same as the deletion case Calling: Score / length of overlap < Threshold SVseq2 currently not reconstruct inserted sequences still use cutoff 2:43 AM RECOMB-seq 2012 6 Results •Simulation on deletions Simulate on chromosome 15 (100, 338, 915 bps); Introduce deletions with exact break from 1000 genomes project release: union.2010 06.deletions.genotypes.vcf.gz (number of them are 132) 45 individuals Simulate reads with wgsim (https://github.com/lh3/wgsim) (error rate 0.02) Pair-ends reads with length 100, outer distance 500 Mapped by BWA Cutoff: SVseq2: cutoff 3; SVseq1 cutoff 3; Pindel 0.2.4d cutoff 3 2:43 AM RECOMB-seq 2012 7 •Real data 20101123 Illumina datasets of 18 individuals on chromosome 20 (9 CEU, 9YRI) Mapped by BWA on NCBI37 Benchmark: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/ Contains called SVs using BreakDancerMax1.1, CNVnator, GenomeStrip, EMBL/Delly and Pindel ( with data of 1094 individuals) SVseq2 Cutoff 3(no cutoff for type I) and 4; SVseq1 and Pindel 0.2.4d cutoff 3. •Individual data •Pooled data ** F: Findings SE: supported by Exact breakpoint 2:43 AM RECOMB-seq 2012 SO: supported by Overlap 8 •Running time NA19312, One Thread •Acknowledgement Supported by NSF grant IIS-0953563 2:43 AM RECOMB-seq 2012 9