Sequence Variation Informatics BI420 – Introduction to Bioinformatics Gabor T. Marth

Download Report

Transcript Sequence Variation Informatics BI420 – Introduction to Bioinformatics Gabor T. Marth

BI420 – Introduction to Bioinformatics
Sequence Variation
Informatics
Gabor T. Marth
Department of Biology, Boston College
[email protected]
Sequence variations
• Human Genome Project produced a reference genome
sequence that is 99.9% common to each human being
• sequence variations make our
genetic makeup unique
SNP
• Single-nucleotide polymorphisms
(SNPs) are most abundant, but other
types of variations exist and are important
Why do we care about variations?
phenotypic differences
inherited diseases
demographic history
Where do variations come from?
• sequence variations are the result of mutation events
• mutations are propagated down
through generations
TAAAAAT
TAACAAT
MRCA
TAAAAAT
TAAAAAT
TAAAAAT
TAAAAAT
• variation patterns permit
reconstruction of phylogeny
TAACAAT
TAACAAT
TAACAAT
TAACAAT
SNP discovery
• comparative analysis of multiple
sequences from the same region of the
genome (redundant sequence coverage)
• diverse sequence
resources can be used
EST
WGS
BAC
Steps of SNP discovery
Sequence clustering
Cluster refinement
Multiple alignment
SNP detection
Computational SNP mining – PolyBayes
Two innovative ideas:
1. Utilize the genome reference
sequence as a template to organize
other sequence fragments from
arbitrary sources
2. Use sequence quality information
(base quality values) to distinguish
true mismatches from sequencing
errors
sequencing error
true polymorphism
Computational SNP mining – PolyBayes
sequence clustering simplifies to database
search with genome reference
multiple alignment by anchoring fragments
to genome reference
paralog filtering by counting mismatches
weighed by quality values
SNP detection by differentiating true
polymorphism from sequencing error using
quality values
SNP discovery with PolyBayes
genome reference sequence
1. Fragment recruitment
(database search)
2. Anchored
alignment
4. SNP detection
3. Paralog
identification
Sequence clustering
• Clustering simplifies to search against sequence database to
recruit relevant sequences
• Clusters = groups of overlapping sequence fragments matching
the genome reference
genome reference
fragments
cluster 1
cluster 2
cluster 3
(Anchored) multiple alignment
• The genomic reference sequence serves as an anchor
• fragments pair-wise aligned to genomic sequence
• insertions are propagated – “sequence padding”
• Advantages
• efficient -- only involves pair-wise comparisons
• accurate -- correctly aligns alternatively spliced ESTs
Paralog filtering -- idea
• The “paralog problem”
• unrecognized paralogs give rise to spurious SNP predictions
• SNPs in duplicated regions may be useless for genotyping
• Challenge
• to differentiate between sequencing errors and paralogous
difference
Sequencing
errors
Paralogous
difference
Paralog filtering -- probabilities
• Pair-wise comparison between EST and genomic sequence
• Model of expected discrepancies
• Native: sequencing error + polymorphisms
• Paralog: sequencing error + paralogous sequence difference
Probability
Paralog discrimination
P(d|Model_NAT)
P(d|Model_PAR)
P(Model_NAT|d)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Discrepancies (d)
• Bayesian discrimination algorithm
Paralog filtering -- paralogs
Paralog filtering -- selectivity
Number of
sequences
Distribution of P(NAT) probability values
1200
1000
800
600
400
200
0
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
P(NAT)
375 paralogous
1,579 native
ESTs
ESTs
probability cutoff
SNP detection
• Goal: to discern true variation from sequencing error
sequencing error
polymorphism
Bayesian-statistical SNP detection
A
A
A
A
A
polymorphic
permutation
Bayesian
posterior
probability
P( SNP ) 
C
C
C
C
C
Base call +
Base quality

all var iable
G
G
G
G
G
T
T
T
T
T
monomorphic
permutation
Expected polymorphism rate
P( S N | RN )
P( S1 | R1 )
 ... 
 PPr ior ( S1 ,..., S N )
PPr ior ( S1 )
PPr ior ( S N )
P( SiN | R1 )
P( Si1 | R1 )
S
...

...

 PPr ior ( Si1 ,..., SiN )


P
(
S
)
P
(
S
)
S i1 [ A ,C ,G ,T ] S iN [ A ,C ,G ,T ] Pr ior
i1
Pr ior
iN
Base composition Depth of coverage
The SNP score
polymorphism
specific variation
SNP priors
• Distribution of SNPs according
to minor allele frequency
relative occurence [%]
• Polymorphism rate in population -- e.g. 1 / 300 bp
40
30
20
10
0
10
20
30
40
50
• Distribution of SNPs according
to specific variation
Relative occurance
minor allele frequency [%]
70
60
50
40
30
20
10
0
AC
AG
AT
CG
Variation type
Prob(k alleles of N = 20)
Prob
• Sample size (alignment depth)
0.8
0.6
p = 0.02
p = 0.1
p = 0.5
0.4
0.2
0
0
5
10
15
20
k alleles
Selectivity of detection
Distribution of P(SNP) values
76,844
120
Number of sites
100
80
60
40
20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
P(SNP)
SNP probability threshold
0.8
0.9
1
Validation by pooled sequencing
African
SNP confirmation rate
Asian
Hispanic
Confirmation rate
Caucasian
SNPs confirmed
80
60
40
20
0
0.37 - 0.59
0.60 - 0.79
P(SNP)
CHM 1
0.80 - 1.00
Confirmation rate [%]
Validation by re-sequencing
100
80
60
40
20
0
51-60
61-70
71-80
SNP score [%]
81-90
91-100
Rare alleles are hard to detect
Detection of a single allele
Threshold = 0.9
Quality value vs. allele frequency
(alignment depth = 20)
50
40
30
20
Threshold = 0.5
10
0
Threshold = 0.9
50
2
3
4
5
6
7
Alignment depth
8
9
10
Quality value
Quality value
Threshold = 0.5
40
30
20
10
0
5
10 15 20 25 30 35 40 45 50
allele frequency [% ]
• frequent alleles are easier to detect
• high-quality alleles are easier to detect
The PolyBayes software
http://genome.wustl.edu/gsc/polybayes
• First statistically rigorous SNP discovery
tool
• Correctly analyzes alternative cDNA
splice forms
• Available for use (~70 licenses)
Marth et al., Nature Genetics, 1999
INDEL discovery
There is no “base quality” value
for “deleted” nucleotide(s)
No reliable prior expectation for
INDEL rates of various classes
Sequencing chemistry context-dependent
INDEL discovery
Deletion Flank Deletion
Deletion Flank
Insertion Flank Insertion Insertion Flank
Q(deletion) = average of Q(deletion flank)
Q(insertion flank) >= 35
Q(deletion flank) >= 35
INDEL discovery
• Majority 1-4 bp insertion length (1
bp – 68 %, 2bp – 13%)
70
Fraction observed [%]
• 123,035 candidate INDELs (~ 25%
of substitutions)
80
60
50
40
30
20
10
0
1
2
3
4
5
6
Insertion length [bp]
• Validation rate steeply increases with insertion length
14.3%
<
60.8%
<
61.7%
7
8
9
SNP discovery in diploid traces
usually, PCR products are sequenced from
multiple individuals
sequence is guaranteed to originate from a
single location: no alignment problem
=
sequence is the product of two chromosomes, hence can be heterozygous;
base quality values are not applicable to heterozygous sequence
SNP discovery in diploid traces
Heterozygous trace peak
Homozygous trace peak
SNP mining: genome BAC overlaps
overlap detection
inter- & intra-chromosomal duplications
known human repeats
fragmentary nature of draft data
SNP analysis
candidate SNP predictions
BAC overlap mining results
~ 30,000 clones
>CloneX
ACGTTGCAACGT
GTCAATGCTGCA
>CloneY
ACGTTGCAACGT
GTCAATGCTGCA
25,901 clones
(7,122 finished, 18,779 draft
with basequality values)
21,020 clone overlaps
(124,356 fragment overlaps)
ACCTAGGAGACTGAACTTACTG
ACCTAGGAGACCGAACTTACTG
507,152 high-quality
candidate SNPs
(validation rate 83-96%)
Marth et al., Nature Genetics 2001
SNP mining projects
1. Short deletions/insertions (DIPs) in the BAC overlaps
Weber et al., AJHG 2002
2. The SNP Consortium (TSC): polymorphism discovery in
random, shotgun reads from whole-genome libraries
Sachidanandam et al., Nature 2001
The current variation resource
• The current public resource (dbSNP)
contains over 2 million SNPs as a dense
genome map of polymorphic markers
1. How are these SNPs structured within
the genome?
2. What can we learn about the
processes that shape human variability?
New sequencers for SNP discovery