Transcript Document
Single Nucleotide Polymorphism Linkage Disequilibrium And Haplotypes
Xiaole Shirley Liu
2
Outline
SNP distribution and characteristics
– Allele frequency, LD, population stratification
• Haplotype inference:
Hapmap project and 1000 Genomes
STAT115
3
Polymorphism
• Polymorphism: sites/genes with “ common ” variation, less common allele frequency >= 1%, otherwise called rare variant and not polymorphic • Single Nucleotide Polymorphism – Come from DNA-replication mistake individual germ line cell, then transmitted – ~90% of human genetic variation • Copy number variations – May or may not be genetic
STAT115
4
Why Should We Care
• Disease gene discovery – Association studies, e.g. certain SNPs are susceptible for diabetes – Chromosome aberrations, duplication / deletion might cause cancer • Personalized Medicine – Drug only effective if you have one allele
STAT115
5
SNP Distribution
• Most common, 1 SNP / 100-300 bp – Balance between mutation introduction rate and polymorphism lost rate – Most mutations lost within a few generations • 2/3 are CT differences • In non-coding regions, often less SNPs at more conserved regions • In coding regions, often more synonymous than non-synonymous SNPs
STAT115
6
SNP Characteristics: Allele Frequency Distribution
• Most alleles are rare (minor allele frequency < 10%)
STAT115
7
SNP Characteristics: Linkage Disequilibrium
• Hardy-Weinberg equilibrium – In a population with genotypes AA, aa, and Aa, if p = freq(A), q =freq(a), the frequency of AA, aa and Aa will be p 2 , q 2 , and 2 pq respectively at equilibrium.
– Similarly with two loci, each two alleles Aa, Bb
STAT115
•
SNP Characteristics: Linkage Disequilibrium
Equilibrium Disequilibrium 8 • LD: If Alleles occur together more often than can be accounted for by chance, then indicate two alleles are physically close on the DNA – In mammals, LD is often lost at ~100 KB – In fly, LD often decays within a few hundred bases
STAT115
9
SNP Characteristics: Linkage Disequilibrium
• Statistical Significance of LD – Chi-square test (or Fisher’s exact test) –
e ij = n i . n.
j / n T
2
i
,
j
(
n ij
e ij
) 2
e ij
B1 A1 n 11 A2 n 21 Total n.
1 B2 n 12 n 22 n.
2 Total n 1 .
n 2 .
n T
STAT115
10
SNP Characteristics: Linkage Disequilibrium
• Haplotype block: a cluster of linked SNPs • Haplotype boundary: blocks of sequence with strong LD within blocks and no LD between blocks, reflect recombination hotspots
STAT115
11
SNP Characteristics: Linkage Disequilibrium
• Haplotype block: a cluster of linked SNPs • Haplotype boundary: blocks of sequence with strong LD within blocks and no LD between blocks, reflect recombination hotspots • Haplotype size distribution
STAT115
12
SNP Characteristics: Linkage Disequilibrium
• [C/T] [A/G] T X C [A/C] [T/A] – Possible haplotype: 2 4 – In reality, a few common haplotypes explain 90% variations
Redundant
• Tagging SNPs: – SNPs that capture most variations in haplotypes – removes redundancy
STAT115
13
SNP Genotyping
• One SNP at a time or genome-wide (SNP array)
2.5kb
5.8kb
0.30
STAT115
14
40 Probes Used Per SNP
• Allele call – AA, BB, AB • Signal – Theoretically 1A+1B, 2A, 2B – But could have 1A+3B Amplified!
STAT115
15
Haplotype
• Haplotype: cluster of SNPs with LD – Block with 10 SNPs has 2 10 possible haplotypes – Only observe 5-6 haplotypes (> 90% cases) – Tagging SNPs: subset of SNP to ID a haplotype • Association (with disease) studies using haplotype is more accurate than using single SNP locus • Haplotype inference: Aa BB Cc
STAT115
16
Haplotype Inference
• Genotyping only tells an individual is e.g. Aa BB Cc, but it doesn ’ t tell whether haplotype is: ABC + aBc, or ABc + aBC • Haplotype can often be inferred if parental genotype is known – Similar to blood typing, e.g. F: A, M: AB, C: B F: , M: , C: • Otherwise, look at the population genotypes, infer common haplotypes
STAT115
17 1.
2.
3.
4.
Haplotype Inference
Clark ’ s Algorithm Construct haplotypes from unambiguous individuals Remove samples that can be explained as combinations of haplotypes discovered already Propose haplotype that would explain most remaining Iterate 2 & 3 until finish
STAT115
1.
2.
3.
4.
Haplotype Inference
Clark ’ s Algorithm Construct haplotypes from unambiguous individuals Remove samples that can be explained as combinations of haplotypes discovered already Propose haplotype that would explain most remaining Iterate 2 & 3 until finish 18 • • • Disadvantages: Depend on # of ambiguous subjects Cannot get started when
n
is small
STAT115
19 EM and Gibbs Sampling in Motif Finding • Problem – Observe: sequence
S
– Unknown: motif θ and site location A (alignment), but given one, can infer the other • EM and Gibbs Sampler – Initialize random motif θ – Iterate: • Given θ and sequence
S
, update site location A • Given A and
S
, update θ – EM updates by weighted average – Gibbs sampling updates by sampling
STAT115
20
Statistical Model for Haplotype
Haplotype
T T A C C -- T T A C G -- T T A G C -- T T A G G -- T T C C C -- T T C C G -- T T C G C -- T T C G G ---
Frequency
1
2
3
4
5
6
7
8
8 6 Haplotype Pool 6 2 2 3 4 7 1 1 5 6 6 1 • Each individual ’ s two haplotypes are treated as random draws from a pool of haplotypes with certain frequencies that can satisfy the genotyping
STAT115
21
Haplotype Inference
EM and Gibbs Sampler • Observe genotype Y, estimate haplotype pair Z for each individual and haplotype frequency • Initialize haplotype frequencies • Iteration: – Estimate Z given Y, – Estimate given Y, Z
STAT115
22
Haplotype Inference
EM and Gibbs Sampler • Observe genotype Y, estimate haplotype pair Z for each individual and haplotype frequency • Initialize haplotype frequencies • Iteration: – Estimate Z given Y, – Estimate given Y, Z
STAT115
23
Haplotype Inference
Partition-Ligation • When #SNP is big, # possible haplotypes is too big, so divide and conquer – Consider an inferred sub-haplotype as one allele
STAT115
24
Hapmap of Human Genome
• HapMap: catalog of common genetic variants in human – What are these variants – Where do they occur in our DNA – How are they distributed within populations and between populations around the world • Goals: – Define haplotype “ blocks ” across the genome – Identify reference set of SNPs: “ tag ” each haplotype – Enable unbiased, genome-wide association studies
STAT115
25
1000 Genomes Projects
• Characterization of human genome sequence variation • Foundation for investigating the relationship between genotype and phenotype
STAT115
Summary
26 • SNP and CNV • SNP distribution and characteristics – Allele frequency (minor allele > 1%) – LD: linkage ~ physical proximity – Population stratification • SNP genotyping: SNP arrays, sequencing • Haplotype inference – Clarks: resolve unambiguous first, propose new haplotypes to maximize explanation – EM & Gibbs: iteratively infer haplotype frequency and individuals ’ haplotypes
STAT115
27
Acknowledgement
• Stefano Monti • Jun Liu & Tim Niu • Kenneth Kidd, Judith Kidd and Glenys Thomson • Joel Hirschhorn • Greg Gibson & Spencer Muse • Cheng Li & Yuhyun Park
STAT115