Sequence Variation Informatics BI420 – Introduction to Bioinformatics Gabor T. Marth
Download ReportTranscript Sequence Variation Informatics BI420 – Introduction to Bioinformatics Gabor T. Marth
BI420 – Introduction to Bioinformatics Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College [email protected] Sequence variations • Human Genome Project produced a reference genome sequence that is 99.9% common to each human being • sequence variations make our genetic makeup unique SNP • Single-nucleotide polymorphisms (SNPs) are most abundant, but other types of variations exist and are important Why do we care about variations? phenotypic differences inherited diseases demographic history Where do variations come from? • sequence variations are the result of mutation events • mutations are propagated down through generations TAAAAAT TAACAAT MRCA TAAAAAT TAAAAAT TAAAAAT TAAAAAT • variation patterns permit reconstruction of phylogeny TAACAAT TAACAAT TAACAAT TAACAAT SNP discovery • comparative analysis of multiple sequences from the same region of the genome (redundant sequence coverage) • diverse sequence resources can be used EST WGS BAC Steps of SNP discovery Sequence clustering Cluster refinement Multiple alignment SNP detection Computational SNP mining – PolyBayes Two innovative ideas: 1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources 2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors sequencing error true polymorphism Computational SNP mining – PolyBayes sequence clustering simplifies to database search with genome reference multiple alignment by anchoring fragments to genome reference paralog filtering by counting mismatches weighed by quality values SNP detection by differentiating true polymorphism from sequencing error using quality values SNP discovery with PolyBayes genome reference sequence 1. Fragment recruitment (database search) 2. Anchored alignment 4. SNP detection 3. Paralog identification Sequence clustering • Clustering simplifies to search against sequence database to recruit relevant sequences • Clusters = groups of overlapping sequence fragments matching the genome reference genome reference fragments cluster 1 cluster 2 cluster 3 (Anchored) multiple alignment • The genomic reference sequence serves as an anchor • fragments pair-wise aligned to genomic sequence • insertions are propagated – “sequence padding” • Advantages • efficient -- only involves pair-wise comparisons • accurate -- correctly aligns alternatively spliced ESTs Paralog filtering -- idea • The “paralog problem” • unrecognized paralogs give rise to spurious SNP predictions • SNPs in duplicated regions may be useless for genotyping • Challenge • to differentiate between sequencing errors and paralogous difference Sequencing errors Paralogous difference Paralog filtering -- probabilities • Pair-wise comparison between EST and genomic sequence • Model of expected discrepancies • Native: sequencing error + polymorphisms • Paralog: sequencing error + paralogous sequence difference Probability Paralog discrimination P(d|Model_NAT) P(d|Model_PAR) P(Model_NAT|d) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Discrepancies (d) • Bayesian discrimination algorithm Paralog filtering -- paralogs Paralog filtering -- selectivity Number of sequences Distribution of P(NAT) probability values 1200 1000 800 600 400 200 0 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 P(NAT) 375 paralogous 1,579 native ESTs ESTs probability cutoff SNP detection • Goal: to discern true variation from sequencing error sequencing error polymorphism Bayesian-statistical SNP detection A A A A A polymorphic permutation Bayesian posterior probability P( SNP ) C C C C C Base call + Base quality all var iable G G G G G T T T T T monomorphic permutation Expected polymorphism rate P( S N | RN ) P( S1 | R1 ) ... PPr ior ( S1 ,..., S N ) PPr ior ( S1 ) PPr ior ( S N ) P( SiN | R1 ) P( Si1 | R1 ) S ... ... PPr ior ( Si1 ,..., SiN ) P ( S ) P ( S ) S i1 [ A ,C ,G ,T ] S iN [ A ,C ,G ,T ] Pr ior i1 Pr ior iN Base composition Depth of coverage The SNP score polymorphism specific variation SNP priors • Distribution of SNPs according to minor allele frequency relative occurence [%] • Polymorphism rate in population -- e.g. 1 / 300 bp 40 30 20 10 0 10 20 30 40 50 • Distribution of SNPs according to specific variation Relative occurance minor allele frequency [%] 70 60 50 40 30 20 10 0 AC AG AT CG Variation type Prob(k alleles of N = 20) Prob • Sample size (alignment depth) 0.8 0.6 p = 0.02 p = 0.1 p = 0.5 0.4 0.2 0 0 5 10 15 20 k alleles Selectivity of detection Distribution of P(SNP) values 76,844 120 Number of sites 100 80 60 40 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 P(SNP) SNP probability threshold 0.8 0.9 1 Validation by pooled sequencing African SNP confirmation rate Asian Hispanic Confirmation rate Caucasian SNPs confirmed 80 60 40 20 0 0.37 - 0.59 0.60 - 0.79 P(SNP) CHM 1 0.80 - 1.00 Confirmation rate [%] Validation by re-sequencing 100 80 60 40 20 0 51-60 61-70 71-80 SNP score [%] 81-90 91-100 Rare alleles are hard to detect Detection of a single allele Threshold = 0.9 Quality value vs. allele frequency (alignment depth = 20) 50 40 30 20 Threshold = 0.5 10 0 Threshold = 0.9 50 2 3 4 5 6 7 Alignment depth 8 9 10 Quality value Quality value Threshold = 0.5 40 30 20 10 0 5 10 15 20 25 30 35 40 45 50 allele frequency [% ] • frequent alleles are easier to detect • high-quality alleles are easier to detect The PolyBayes software http://genome.wustl.edu/gsc/polybayes • First statistically rigorous SNP discovery tool • Correctly analyzes alternative cDNA splice forms • Available for use (~70 licenses) Marth et al., Nature Genetics, 1999 INDEL discovery There is no “base quality” value for “deleted” nucleotide(s) No reliable prior expectation for INDEL rates of various classes Sequencing chemistry context-dependent INDEL discovery Deletion Flank Deletion Deletion Flank Insertion Flank Insertion Insertion Flank Q(deletion) = average of Q(deletion flank) Q(insertion flank) >= 35 Q(deletion flank) >= 35 INDEL discovery • Majority 1-4 bp insertion length (1 bp – 68 %, 2bp – 13%) 70 Fraction observed [%] • 123,035 candidate INDELs (~ 25% of substitutions) 80 60 50 40 30 20 10 0 1 2 3 4 5 6 Insertion length [bp] • Validation rate steeply increases with insertion length 14.3% < 60.8% < 61.7% 7 8 9 SNP discovery in diploid traces usually, PCR products are sequenced from multiple individuals sequence is guaranteed to originate from a single location: no alignment problem = sequence is the product of two chromosomes, hence can be heterozygous; base quality values are not applicable to heterozygous sequence SNP discovery in diploid traces Heterozygous trace peak Homozygous trace peak SNP mining: genome BAC overlaps overlap detection inter- & intra-chromosomal duplications known human repeats fragmentary nature of draft data SNP analysis candidate SNP predictions BAC overlap mining results ~ 30,000 clones >CloneX ACGTTGCAACGT GTCAATGCTGCA >CloneY ACGTTGCAACGT GTCAATGCTGCA 25,901 clones (7,122 finished, 18,779 draft with basequality values) 21,020 clone overlaps (124,356 fragment overlaps) ACCTAGGAGACTGAACTTACTG ACCTAGGAGACCGAACTTACTG 507,152 high-quality candidate SNPs (validation rate 83-96%) Marth et al., Nature Genetics 2001 SNP mining projects 1. Short deletions/insertions (DIPs) in the BAC overlaps Weber et al., AJHG 2002 2. The SNP Consortium (TSC): polymorphism discovery in random, shotgun reads from whole-genome libraries Sachidanandam et al., Nature 2001 The current variation resource • The current public resource (dbSNP) contains over 2 million SNPs as a dense genome map of polymorphic markers 1. How are these SNPs structured within the genome? 2. What can we learn about the processes that shape human variability? New sequencers for SNP discovery