An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data

Download Report

Transcript An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data

An improved approach for accurate and
efficient calling of structural variations with
low-coverage sequence data
Jin Zhang, Jiayin Wang and Yufeng Wu
Department of Computer Science and Engineering
University of Connecticut
2:43 AM
RECOMB-seq 2012
1
•Structural Variation (SV)
Reference
Alternative
deletion
insertion
•SV calling using HTS sequencing data
Method
Pair
or Single
Coverage
Assembly
Higher
Read depth
Higher
Read pair
Pair only
Split read
•Exact breakpoint
Mean insert size + 3 σ
Exact
Reference
breakpoints Alternative
Deletion
No
No
Reference
Alternative
Deletion
Mills et al. (Nature, 2011)
“…,which facilitated analysing their origin and functional impact.“
Lam et al. (Nature Biotechnology 2010)
classification and annotation
•Problem
Finding SVs with Exact breakpoints using Low-coverage Paired-end reads
2:43 AM
RECOMB-seq 2012
2
•Split-read mapping (e.g. Deletion)
Reads mapping tools:
•Not map it
•Or Soft-clipping
Reference
Alternative
Focal region
Maximum event size
Deletion
• Because of sequence and repeats,
longer Maximum Event size (e.g. 1Mbps) may cause false positives
• Different way of splits may cause even more false positives
•Shorter maximum event size may reduce false positives
but also may fail to find some larger deletions
Method
Algorithm
Max Deletion
Size
Cutoff
Pindel: Ye et al.
(Bioinformatics 2009)
Pattern growth
Yes
Yes
2:43et
AMal.
SVseq1: Zhang
(Bioinformatics 2011)
BWT
Yes
SVseq2 (For this work)
(Recomb-seq
2012)
2:43 AM
Dynamic
Programming
RECOMB-seq 2012
Yes
Yes
RECOMB-seq 2012
Insert
Size
Focal
Region
Yes
Yes
3
•SVseq2: a pattern for deletion calling:
Finding focal region with the help of a spanning pair
li: library mean
σ: standard deviation
l: read length
li + 3σ
Known breakpoint
Alternative
unknown breakpoint
li+ 3σ -2l
They are the same
breakpoint on Alternative
Spanning pair
E.g.
Reference
li+ 3σ -2l
li+ 3σ -2l
Alternative
Deletion
(not known)
(a) within length li+ 3σ -2l from
,find
(b) Find
by using
, coz they are a pair
(c) Find
by mapping the soft-clipped portion
within length li+ 3σ -2l of
2:43 AM
li+ 3σ -2l = 400 + 3*50 -200
= 350 bps
Note Maximum Event Size can be
1Mbps
Using focal region:
(1)Search in much smaller space
(2)Reduce the way of splits
(3)Able to find large deletions
RECOMB-seq 2012
4
•SVseq2: another pattern in deletion calling
Anchor
Alternative
Reference
Deletion
The pair itself is also a spanning pair
li+ 3σ -2l
•Dynamic alignment algorithm (semi-global)
Similarity : 1 for matches and −1 for mismatches.
Penalty: 3 for gaps inside the sequence, 0 outside.
GTTCTAAGCCAGTGGTTCTACCAACTTGAGTATGCATCAGAATCACTTGGA
- - - - - - - - - -AGTGGTTCT- CCAACTTGAGAATGCATCA - - - - - - - - - - - -
2:43 AM
RECOMB-seq 2012
5
•SVseq2: Type III pattern for Insertion calling:
Insertion
Read 1
Alternative
Reference
Read 2
Region 1
Overlap
Portion 2
Portion 1
Portion 3
Portion 4
Mapping score:
Penalties same as the deletion case
Calling:
Score / length of overlap < Threshold
SVseq2 currently not reconstruct inserted sequences
still use cutoff
2:43 AM
RECOMB-seq 2012
6
Results
•Simulation on deletions
Simulate on chromosome 15 (100, 338, 915 bps);
Introduce deletions with exact break from 1000 genomes project release:
union.2010 06.deletions.genotypes.vcf.gz (number of them are 132)
45 individuals
Simulate reads with wgsim (https://github.com/lh3/wgsim) (error rate 0.02)
Pair-ends reads with length 100, outer distance 500 Mapped by BWA
Cutoff: SVseq2: cutoff 3; SVseq1 cutoff 3; Pindel 0.2.4d cutoff 3
2:43 AM
RECOMB-seq 2012
7
•Real data
20101123 Illumina datasets of 18 individuals on chromosome 20 (9 CEU, 9YRI)
Mapped by BWA on NCBI37
Benchmark: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/
Contains called SVs using BreakDancerMax1.1, CNVnator, GenomeStrip,
EMBL/Delly and Pindel ( with data of 1094 individuals)
SVseq2 Cutoff 3(no cutoff for type I) and 4; SVseq1 and Pindel 0.2.4d cutoff 3.
•Individual data
•Pooled data
** F: Findings SE: supported by Exact breakpoint
2:43 AM
RECOMB-seq 2012
SO: supported by Overlap
8
•Running time
NA19312, One Thread
•Acknowledgement
Supported by NSF grant IIS-0953563
2:43 AM
RECOMB-seq 2012
9