Cryptic Variation in the Human mutation rate Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis.
Download ReportTranscript Cryptic Variation in the Human mutation rate Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis.
Cryptic Variation in the Human mutation rate
Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis
Variation in the mutation rate:
• Between different chromosomes • Between regions on chromosomes • Neighbouring nucleotides
Simple context effects:
Hwang and Green (2004) PNAS 101 : 13994-14001
Cryptic Variation:
• Remote context: AG TCGGTT ACCGTGACGTT G AACGTGT
Cryptic Variation:
• Remote context: AG TCGGTT ACCGTGACGTT G AACGTGT • Degenerate context: AGTCGGTTACCGTG YSRGY G AACGTGT
Cryptic Variation:
• Remote context: AG TCGGTT ACCGTGACGTT G AACGTGT • Degenerate context: AGTCGGTTACCGTG YSRGY G AACGTGT • No context / Complex context
Our approach to the problem
• Search for SNPs in human sequences that also have a SNP in the orthologous position in chimp.
Human Chimp
Our approach to the problem
• Search for SNPs in human sequences that also have a SNP in the orthologous position in chimp.
Human Chimp Do we see more coincident SNPs than expected by chance?
The method
• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.
The method
• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.
• Extract all chimp SNPs from dbSNP with 50bp either side of SNP.
The method
• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.
• Extract all chimp SNPs from dbSNP with 50bp either side of SNP.
• BLAST chimp SNPs against human database.
The method
• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.
• Extract all chimp SNPs from dbSNP with 50bp either side of SNP.
• BLAST chimp SNPs against human database.
• Extract results above a certain level of homology where there is a SNP on both sequences and reduce to 40bp either side of central position.
The method
• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.
• Extract all chimp SNPs from dbSNP with 50bp either side of SNP.
• BLAST chimp SNPs against human database.
• Extract results above a certain level of homology where there is a SNP on both sequences and reduce to 40bp either side of central position.
• Repeating both including and excluding CpG effects.
Results
• ~1.5 million chimp SNPs.
• ~310,000 81bp alignments containing a human and chimp SNP.
Results
• ~1.5 million chimp SNPs.
• ~310,000 81bp alignments containing a human and chimp SNP.
• Observe the number of coincident SNPs.
• Calculate the expected number, taking into account the effects of neighbouring nucleotides.
Results
Obs Exp Ratio 11571 6592 1.76 (1.72,1.79) All No-CpG 5028 2533 1.98 (1.93,2.04)
Results
C/T C/T G/A C/A G/T C/G A/T 1.91
1.04
1.19
1.21
0.96
G/A C/A 1.23
1.83
1.08
1.24
4.81
1.02
1.14
1.28
1.40
1.39
G/T C/G A/T 1.15
1.09
0.94
1.38
1.14
1.06
1.24
1.79
4.95
1.40
0.99
1.27
2.79
0.77
15.43
Alternative Explanations
• Bias in the Method • Selection • Ancestral Polymorphism • Paralogous SNPs
Alternative Explanations
• Bias in the Method • Selection • Ancestral Polymorphism • Paralogous SNPs
Methodological Bias
• Simulated data with same density of human and chimp SNPs as dbSNP under different divergence and mutation patterns.
• Method worked well under realistic conditions.
Methodological Bias
All sites (H&G): Div 0 1 2 Obs 839 2419 681 Exp 812 2316 685 Ratio 1.033
1.040
0.995
95% CI (0.963,1.103) (1.003,1.086) (0.920,1.069) Non CpG sites (H&G): Div 0 1 2 Obs 401 1182 374 Exp 428 1228 400 Ratio 0.936
0.963
0.935
95% CI (0.844,1.028) (0.908,1.018) (0.840,1.030)
Methodological Bias
All sites (H&G): Div 0 1 2 Obs 839 2419 681 Exp 812 2316 685 Ratio 1.033
1.040
0.995
95% CI (0.963,1.103) (1.003,1.086) (0.920,1.069) Non CpG sites (H&G): Div 0 1 2 Obs 401 1182 374 Exp 428 1228 400 Ratio 0.936
0.963
0.935
95% CI (0.844,1.028) (0.908,1.018) (0.840,1.030)
Alternative Explanations
• Bias in the method • Selection • Ancestral Polymorphism • Paralogous SNPs
Selection
• Areas of low SNP density result in clustering: Human Chimp
Selection
• Areas of low SNP density result in clustering: Human Chimp Apparent excess of coincident SNPs
Selection
• No clustering: 14000 12000 10000 8000 6000 4000 2000 0 -4 0 -3 8 -3 6 -3 4 -3 2 -3 0 -2 8 -2 6 -2 4 -2 2 -2 0 -1 8 -1 6 -1 4 -1 2 -1 0 -8 -6 -4 -2 0 2
Position
4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
Alternative Explanations
• Bias in the method • Selection • Ancestral Polymorphism • Paralogous SNPs
Ancestral Polymorphism
• SNP inherited from common ancestor of chimp and human: Common Ancestor T T T A Chimp T T T A T T A A Human
Ancestral Polymorphism
• SNP inherited from common ancestor of chimp and human: Common Ancestor T T T A Chimp T T T A T T A A Human Increase in coincident SNPs
Ancestral Polymorphism
• Expect observed/expected ratio to be same for all transitions: C/T G/A C/A G/T C/G A/T C/T 1.91
G/A 1.23
1.83
1.08
1.15
1.09
1.38
1.14
0.94
1.06
C/A 1.04
G/T 1.19
C/G 1.21
A/T 0.96
1.24
1.02
4.81
1.14
1.40
1.28
1.39
1.24
1.79
4.95
1.40
0.99
1.27
2.79
0.77
15.43
Ancestral Polymorphism
• Repeated initial analysis with macaque data.
• Humans and Macaque split ~23-24 million years ago so we expect there to be no shared polymorphisms.
Ancestral Polymorphism
• Repeated initial analysis with macaque data.
• Humans and Macaque split ~23-24 million years ago so we expect there to be no shared polymorphisms.
All No-CpG Obs 77 34 Exp 47 23 Ratio 1.64 (1.27,2.00) 1.51 (1.001,2.02)
Alternative Explanations
• Bias in the method • Selection • Ancestral Polymorphism • Paralogous SNPs
Paralogous SNPs
• Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions.
Paralogous SNPs
• Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions.
• Musumeci
et al
(2010): 8.32% of human variation in dbSNP may be due to paralogy.
Paralogous SNPs
• Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions.
• Musumeci
et al
(2010): 8.32% of human variation in dbSNP may be due to paralogy.
AGCTGCACGT Y CGGCATCCAA SNP AGCTGCACGT T CGGCATCCAA Chromosome 1 AGCTGCACGT A CGGCATCCAA Chromosome 7 Artifactual SNP
Paralogous SNPs
AGCTGCACGT ( T / A ) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT ( T / A ) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT A CGGCATCCAA
Paralogous SNPs
AGCTGCACGT ( T / A ) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT ( T / A ) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT A CGGCATCCAA 3.6% of coincident SNPs are possibly a consequence of paralogous sequences
Alternative Explanations
• Bias in the method • Selection • Ancestral Polymorphism • Paralogous SNPs Cryptic variation in the mutation rate
Context Analysis
• 4517 sequences containing non-CpG coincident SNPs flanked by 200bp.
• Tabulate triplet frequencies at each position in surrounding sequences.
• Test whether the proportions of triplets we observe at each position significantly different from the proportions in the sequences as a whole.
Context Analysis
• Coincident SNP in central position:
Context Analysis
• Coincident SNP in central position: No obvious context surrounding coincident SNPs
Genomic Distribution
• Tallied the number of coincident SNPs per MB: - 3.91 coincident SNPs per MB.
- 1.68 non-CpG coincident SNPs per MB.
Genomic Distribution
• Tallied the number of coincident SNPs per MB: - 3.91 coincident SNPs per MB.
- 1.68 non-CpG coincident SNPs per MB.
• If randomly distributed expect Poisson distribution and = 2 = 3.91
Genomic Distribution
• Tallied the number of coincident SNPs per MB: - 3.91 coincident SNPs per MB.
- 1.68 non-CpG coincident SNPs per MB.
• If randomly distributed expect Poisson distribution and = 2 = 3.91
• 2 = 13.27 (p<0.001) and so sampling variance explains approximately 30% of total variance.
Genomic Distribution
Feature
SNP density Distance to Telomere Distance to Centromere Recombination Rate Nucleosome Association Gene Density GC content
r
0.256
-0.022
0.011
0.107
0.004
-0.022
-0.006
r 2
0.0655
0.0004
0.0001
0.0114
0.0000
0.0004
0.0000
p
<0.001** 0.226
0.565
<0.001** 0.832
0.230
0.741
Genomic Distribution
• SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone.
Genomic Distribution
• SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone.
• Recombination rate positively correlated with SNP density (r = 0.242, p<0.001).
• Partial correlation controlling for SNP density: r = 0.048, p=0.011**.
Genomic Distribution
• SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone.
• Recombination rate positively correlated with SNP density (r = 0.242, p<0.001).
• Partial correlation controlling for SNP density: r = 0.048, p=0.011**.
• SNP densities explain 6.5% of the variance, recombination rate explains 0.2% of the variance of coincident SNPs.
Genomic Distribution
Feature
Coincident SNP Density Distance to Telomere Distance to Centromere Recombination Rate Nucleosome Association Gene Density GC content
r
0.256
-0.171
-0.047
0.234
0.187
0.064
0.184
r 2
0.0655
0.0292
0.0022
0.0548
0.0350
0.0041
0.0339
p
<0.001** <0.001** 0.012** <0.001** <0.001** 0.001** <0.001**
Quantification
• Use Log-normal distribution of relative mutation rates due to cryptic variation.
• Model the number of coincident SNPs under the effects of cryptic variation.
• Incorporate effects of divergence.
Quantification
• Use Log-normal distribution of relative mutation rates due to cryptic variation.
• Model the number of coincident SNPs under the effects of cryptic variation.
• Incorporate effects of divergence.
What level of variation in the log-normal distribution explains our results?
Log-normal model
Fastest 5% of sites mutate ~16.4 times faster than slowest 5% of sites.
Summary
• Cryptic variation in the mutation rate.
Summary
• Cryptic variation in the mutation rate.
• No obvious context surrounding coincident SNPs.
Summary
• Cryptic variation in the mutation rate.
• No obvious context surrounding coincident SNPs.
• Variation is truly cryptic.
Summary
• Cryptic variation in the mutation rate.
• No obvious context surrounding coincident SNPs.
• Variation is truly cryptic.
• Genomic distribution of coincident SNPs is over-dispersed
Summary
• Cryptic variation in the mutation rate.
• No obvious context surrounding coincident SNPs.
• Variation is truly cryptic.
• Genomic distribution of coincident SNPs is over-dispersed • Variation in mutation rate is substantial.
Acknowledgments
• BBSRC • People: Manolis Ladoukakis Adam Eyre-Walker