Cryptic Variation in the Human mutation rate Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis.

Download Report

Transcript Cryptic Variation in the Human mutation rate Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis.

Cryptic Variation in the Human mutation rate

Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis

Variation in the mutation rate:

• Between different chromosomes • Between regions on chromosomes • Neighbouring nucleotides

Simple context effects:

Hwang and Green (2004) PNAS 101 : 13994-14001

Cryptic Variation:

• Remote context: AG TCGGTT ACCGTGACGTT G AACGTGT

Cryptic Variation:

• Remote context: AG TCGGTT ACCGTGACGTT G AACGTGT • Degenerate context: AGTCGGTTACCGTG YSRGY G AACGTGT

Cryptic Variation:

• Remote context: AG TCGGTT ACCGTGACGTT G AACGTGT • Degenerate context: AGTCGGTTACCGTG YSRGY G AACGTGT • No context / Complex context

Our approach to the problem

• Search for SNPs in human sequences that also have a SNP in the orthologous position in chimp.

Human Chimp

Our approach to the problem

• Search for SNPs in human sequences that also have a SNP in the orthologous position in chimp.

Human Chimp Do we see more coincident SNPs than expected by chance?

The method

• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.

The method

• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.

• Extract all chimp SNPs from dbSNP with 50bp either side of SNP.

The method

• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.

• Extract all chimp SNPs from dbSNP with 50bp either side of SNP.

• BLAST chimp SNPs against human database.

The method

• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.

• Extract all chimp SNPs from dbSNP with 50bp either side of SNP.

• BLAST chimp SNPs against human database.

• Extract results above a certain level of homology where there is a SNP on both sequences and reduce to 40bp either side of central position.

The method

• Extract all human SNPs from dbSNP and construct a BLAST database on a chromosome by chromosome basis.

• Extract all chimp SNPs from dbSNP with 50bp either side of SNP.

• BLAST chimp SNPs against human database.

• Extract results above a certain level of homology where there is a SNP on both sequences and reduce to 40bp either side of central position.

• Repeating both including and excluding CpG effects.

Results

• ~1.5 million chimp SNPs.

• ~310,000 81bp alignments containing a human and chimp SNP.

Results

• ~1.5 million chimp SNPs.

• ~310,000 81bp alignments containing a human and chimp SNP.

• Observe the number of coincident SNPs.

• Calculate the expected number, taking into account the effects of neighbouring nucleotides.

Results

Obs Exp Ratio 11571 6592 1.76 (1.72,1.79) All No-CpG 5028 2533 1.98 (1.93,2.04)

Results

C/T C/T G/A C/A G/T C/G A/T 1.91

1.04

1.19

1.21

0.96

G/A C/A 1.23

1.83

1.08

1.24

4.81

1.02

1.14

1.28

1.40

1.39

G/T C/G A/T 1.15

1.09

0.94

1.38

1.14

1.06

1.24

1.79

4.95

1.40

0.99

1.27

2.79

0.77

15.43

Alternative Explanations

• Bias in the Method • Selection • Ancestral Polymorphism • Paralogous SNPs

Alternative Explanations

• Bias in the Method • Selection • Ancestral Polymorphism • Paralogous SNPs

Methodological Bias

• Simulated data with same density of human and chimp SNPs as dbSNP under different divergence and mutation patterns.

• Method worked well under realistic conditions.

Methodological Bias

All sites (H&G): Div 0 1 2 Obs 839 2419 681 Exp 812 2316 685 Ratio 1.033

1.040

0.995

95% CI (0.963,1.103) (1.003,1.086) (0.920,1.069) Non CpG sites (H&G): Div 0 1 2 Obs 401 1182 374 Exp 428 1228 400 Ratio 0.936

0.963

0.935

95% CI (0.844,1.028) (0.908,1.018) (0.840,1.030)

Methodological Bias

All sites (H&G): Div 0 1 2 Obs 839 2419 681 Exp 812 2316 685 Ratio 1.033

1.040

0.995

95% CI (0.963,1.103) (1.003,1.086) (0.920,1.069) Non CpG sites (H&G): Div 0 1 2 Obs 401 1182 374 Exp 428 1228 400 Ratio 0.936

0.963

0.935

95% CI (0.844,1.028) (0.908,1.018) (0.840,1.030)

Alternative Explanations

• Bias in the method • Selection • Ancestral Polymorphism • Paralogous SNPs

Selection

• Areas of low SNP density result in clustering: Human Chimp

Selection

• Areas of low SNP density result in clustering: Human Chimp Apparent excess of coincident SNPs

Selection

• No clustering: 14000 12000 10000 8000 6000 4000 2000 0 -4 0 -3 8 -3 6 -3 4 -3 2 -3 0 -2 8 -2 6 -2 4 -2 2 -2 0 -1 8 -1 6 -1 4 -1 2 -1 0 -8 -6 -4 -2 0 2

Position

4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

Alternative Explanations

• Bias in the method • Selection • Ancestral Polymorphism • Paralogous SNPs

Ancestral Polymorphism

• SNP inherited from common ancestor of chimp and human: Common Ancestor T T T A Chimp T T T A T T A A Human

Ancestral Polymorphism

• SNP inherited from common ancestor of chimp and human: Common Ancestor T T T A Chimp T T T A T T A A Human Increase in coincident SNPs

Ancestral Polymorphism

• Expect observed/expected ratio to be same for all transitions: C/T G/A C/A G/T C/G A/T C/T 1.91

G/A 1.23

1.83

1.08

1.15

1.09

1.38

1.14

0.94

1.06

C/A 1.04

G/T 1.19

C/G 1.21

A/T 0.96

1.24

1.02

4.81

1.14

1.40

1.28

1.39

1.24

1.79

4.95

1.40

0.99

1.27

2.79

0.77

15.43

Ancestral Polymorphism

• Repeated initial analysis with macaque data.

• Humans and Macaque split ~23-24 million years ago so we expect there to be no shared polymorphisms.

Ancestral Polymorphism

• Repeated initial analysis with macaque data.

• Humans and Macaque split ~23-24 million years ago so we expect there to be no shared polymorphisms.

All No-CpG Obs 77 34 Exp 47 23 Ratio 1.64 (1.27,2.00) 1.51 (1.001,2.02)

Alternative Explanations

• Bias in the method • Selection • Ancestral Polymorphism • Paralogous SNPs

Paralogous SNPs

• Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions.

Paralogous SNPs

• Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions.

• Musumeci

et al

(2010): 8.32% of human variation in dbSNP may be due to paralogy.

Paralogous SNPs

• Excess of coincident SNPs a consequence of artifactual SNPs called as a result of substitutions in paralogous regions.

• Musumeci

et al

(2010): 8.32% of human variation in dbSNP may be due to paralogy.

AGCTGCACGT Y CGGCATCCAA SNP AGCTGCACGT T CGGCATCCAA Chromosome 1 AGCTGCACGT A CGGCATCCAA Chromosome 7 Artifactual SNP

Paralogous SNPs

AGCTGCACGT ( T / A ) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT ( T / A ) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT A CGGCATCCAA

Paralogous SNPs

AGCTGCACGT ( T / A ) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT ( T / A ) CGGCATCCAA AGCTGCACGT T CGGCATCCAA AGCTGCACGT A CGGCATCCAA 3.6% of coincident SNPs are possibly a consequence of paralogous sequences

Alternative Explanations

• Bias in the method • Selection • Ancestral Polymorphism • Paralogous SNPs Cryptic variation in the mutation rate

Context Analysis

• 4517 sequences containing non-CpG coincident SNPs flanked by 200bp.

• Tabulate triplet frequencies at each position in surrounding sequences.

• Test whether the proportions of triplets we observe at each position significantly different from the proportions in the sequences as a whole.

Context Analysis

• Coincident SNP in central position:

Context Analysis

• Coincident SNP in central position: No obvious context surrounding coincident SNPs

Genomic Distribution

• Tallied the number of coincident SNPs per MB: - 3.91 coincident SNPs per MB.

- 1.68 non-CpG coincident SNPs per MB.

Genomic Distribution

• Tallied the number of coincident SNPs per MB: - 3.91 coincident SNPs per MB.

- 1.68 non-CpG coincident SNPs per MB.

• If randomly distributed expect Poisson distribution and  =  2 = 3.91

Genomic Distribution

• Tallied the number of coincident SNPs per MB: - 3.91 coincident SNPs per MB.

- 1.68 non-CpG coincident SNPs per MB.

• If randomly distributed expect Poisson distribution and  =  2 = 3.91

•  2 = 13.27 (p<0.001) and so sampling variance explains approximately 30% of total variance.

Genomic Distribution

Feature

SNP density Distance to Telomere Distance to Centromere Recombination Rate Nucleosome Association Gene Density GC content

r

0.256

-0.022

0.011

0.107

0.004

-0.022

-0.006

r 2

0.0655

0.0004

0.0001

0.0114

0.0000

0.0004

0.0000

p

<0.001** 0.226

0.565

<0.001** 0.832

0.230

0.741

Genomic Distribution

• SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone.

Genomic Distribution

• SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone.

• Recombination rate positively correlated with SNP density (r = 0.242, p<0.001).

• Partial correlation controlling for SNP density: r = 0.048, p=0.011**.

Genomic Distribution

• SNP densities must drive coincident SNP densities to a certain extent as approximately half of coincident SNPs are created by chance alone.

• Recombination rate positively correlated with SNP density (r = 0.242, p<0.001).

• Partial correlation controlling for SNP density: r = 0.048, p=0.011**.

• SNP densities explain 6.5% of the variance, recombination rate explains 0.2% of the variance of coincident SNPs.

Genomic Distribution

Feature

Coincident SNP Density Distance to Telomere Distance to Centromere Recombination Rate Nucleosome Association Gene Density GC content

r

0.256

-0.171

-0.047

0.234

0.187

0.064

0.184

r 2

0.0655

0.0292

0.0022

0.0548

0.0350

0.0041

0.0339

p

<0.001** <0.001** 0.012** <0.001** <0.001** 0.001** <0.001**

Quantification

• Use Log-normal distribution of relative mutation rates due to cryptic variation.

• Model the number of coincident SNPs under the effects of cryptic variation.

• Incorporate effects of divergence.

Quantification

• Use Log-normal distribution of relative mutation rates due to cryptic variation.

• Model the number of coincident SNPs under the effects of cryptic variation.

• Incorporate effects of divergence.

What level of variation in the log-normal distribution explains our results?

Log-normal model

Fastest 5% of sites mutate ~16.4 times faster than slowest 5% of sites.

Summary

• Cryptic variation in the mutation rate.

Summary

• Cryptic variation in the mutation rate.

• No obvious context surrounding coincident SNPs.

Summary

• Cryptic variation in the mutation rate.

• No obvious context surrounding coincident SNPs.

• Variation is truly cryptic.

Summary

• Cryptic variation in the mutation rate.

• No obvious context surrounding coincident SNPs.

• Variation is truly cryptic.

• Genomic distribution of coincident SNPs is over-dispersed

Summary

• Cryptic variation in the mutation rate.

• No obvious context surrounding coincident SNPs.

• Variation is truly cryptic.

• Genomic distribution of coincident SNPs is over-dispersed • Variation in mutation rate is substantial.

Acknowledgments

• BBSRC • People: Manolis Ladoukakis Adam Eyre-Walker