Transcript Document

Single Nucleotide Polymorphism Linkage Disequilibrium And Haplotypes

Xiaole Shirley Liu

2

Outline

Definition and motivation

SNP distribution and characteristics

– Allele frequency, LD, population stratification

• SNP and genotyping

• Haplotype inference:

Clark ’ s algorithm

EM and Gibbs sampling

Hapmap project and 1000 Genomes

STAT115

3

Polymorphism

• Polymorphism: sites/genes with “ common ” variation, less common allele frequency >= 1%, otherwise called rare variant and not polymorphic • Single Nucleotide Polymorphism – Come from DNA-replication mistake individual germ line cell, then transmitted – ~90% of human genetic variation • Copy number variations – May or may not be genetic

STAT115

4

Why Should We Care

• Disease gene discovery – Association studies, e.g. certain SNPs are susceptible for diabetes – Chromosome aberrations, duplication / deletion might cause cancer • Personalized Medicine – Drug only effective if you have one allele

STAT115

5

SNP Distribution

• Most common, 1 SNP / 100-300 bp – Balance between mutation introduction rate and polymorphism lost rate – Most mutations lost within a few generations • 2/3 are CT differences • In non-coding regions, often less SNPs at more conserved regions • In coding regions, often more synonymous than non-synonymous SNPs

STAT115

6

SNP Characteristics: Allele Frequency Distribution

• Most alleles are rare (minor allele frequency < 10%)

STAT115

7

SNP Characteristics: Linkage Disequilibrium

• Hardy-Weinberg equilibrium – In a population with genotypes AA, aa, and Aa, if p = freq(A), q =freq(a), the frequency of AA, aa and Aa will be p 2 , q 2 , and 2 pq respectively at equilibrium.

– Similarly with two loci, each two alleles Aa, Bb

STAT115

SNP Characteristics: Linkage Disequilibrium

Equilibrium Disequilibrium 8 • LD: If Alleles occur together more often than can be accounted for by chance, then indicate two alleles are physically close on the DNA – In mammals, LD is often lost at ~100 KB – In fly, LD often decays within a few hundred bases

STAT115

9

SNP Characteristics: Linkage Disequilibrium

• Statistical Significance of LD – Chi-square test (or Fisher’s exact test) –

e ij = n i . n.

j / n T

 2 

i

, 

j

(

n ij

e ij

) 2

e ij

B1 A1 n 11 A2 n 21 Total n.

1 B2 n 12 n 22 n.

2 Total n 1 .

n 2 .

n T

STAT115

10

SNP Characteristics: Linkage Disequilibrium

• Haplotype block: a cluster of linked SNPs • Haplotype boundary: blocks of sequence with strong LD within blocks and no LD between blocks, reflect recombination hotspots

STAT115

11

SNP Characteristics: Linkage Disequilibrium

• Haplotype block: a cluster of linked SNPs • Haplotype boundary: blocks of sequence with strong LD within blocks and no LD between blocks, reflect recombination hotspots • Haplotype size distribution

STAT115

12

SNP Characteristics: Linkage Disequilibrium

• [C/T] [A/G] T X C [A/C] [T/A] – Possible haplotype: 2 4 – In reality, a few common haplotypes explain 90% variations

Redundant

• Tagging SNPs: – SNPs that capture most variations in haplotypes – removes redundancy

STAT115

13

SNP Genotyping

• One SNP at a time or genome-wide (SNP array)

2.5kb

5.8kb

0.30

STAT115

14

40 Probes Used Per SNP

• Allele call – AA, BB, AB • Signal – Theoretically 1A+1B, 2A, 2B – But could have 1A+3B Amplified!

STAT115

15

Haplotype

• Haplotype: cluster of SNPs with LD – Block with 10 SNPs has 2 10 possible haplotypes – Only observe 5-6 haplotypes (> 90% cases) – Tagging SNPs: subset of SNP to ID a haplotype • Association (with disease) studies using haplotype is more accurate than using single SNP locus • Haplotype inference: Aa BB Cc

STAT115

16

Haplotype Inference

• Genotyping only tells an individual is e.g. Aa BB Cc, but it doesn ’ t tell whether haplotype is: ABC + aBc, or ABc + aBC • Haplotype can often be inferred if parental genotype is known – Similar to blood typing, e.g. F: A, M: AB, C: B  F: , M: , C: • Otherwise, look at the population genotypes, infer common haplotypes

STAT115

17 1.

2.

3.

4.

Haplotype Inference

Clark ’ s Algorithm Construct haplotypes from unambiguous individuals Remove samples that can be explained as combinations of haplotypes discovered already Propose haplotype that would explain most remaining Iterate 2 & 3 until finish

STAT115

1.

2.

3.

4.

Haplotype Inference

Clark ’ s Algorithm Construct haplotypes from unambiguous individuals Remove samples that can be explained as combinations of haplotypes discovered already Propose haplotype that would explain most remaining Iterate 2 & 3 until finish 18 • • • Disadvantages: Depend on # of ambiguous subjects Cannot get started when

n

is small

STAT115

19 EM and Gibbs Sampling in Motif Finding • Problem – Observe: sequence

S

– Unknown: motif θ and site location A (alignment), but given one, can infer the other • EM and Gibbs Sampler – Initialize random motif θ – Iterate: • Given θ and sequence

S

, update site location A • Given A and

S

, update θ – EM updates by weighted average – Gibbs sampling updates by sampling

STAT115

20

Statistical Model for Haplotype

Haplotype

T T A C C -- T T A C G -- T T A G C -- T T A G G -- T T C C C -- T T C C G -- T T C G C -- T T C G G ---

Frequency 

1

2

3

4

5

6

7

8

8 6 Haplotype Pool 6 2 2 3 4 7 1 1 5 6 6 1 • Each individual ’ s two haplotypes are treated as random draws from a pool of haplotypes with certain frequencies that can satisfy the genotyping

STAT115

21

Haplotype Inference

EM and Gibbs Sampler • Observe genotype Y, estimate haplotype pair Z for each individual and haplotype frequency  • Initialize haplotype frequencies • Iteration: – Estimate Z given Y,  – Estimate  given Y, Z

STAT115

22

Haplotype Inference

EM and Gibbs Sampler • Observe genotype Y, estimate haplotype pair Z for each individual and haplotype frequency  • Initialize haplotype frequencies • Iteration: – Estimate Z given Y,  – Estimate  given Y, Z

STAT115

23

Haplotype Inference

Partition-Ligation • When #SNP is big, # possible haplotypes is too big, so divide and conquer – Consider an inferred sub-haplotype as one allele

STAT115

24

Hapmap of Human Genome

• HapMap: catalog of common genetic variants in human – What are these variants – Where do they occur in our DNA – How are they distributed within populations and between populations around the world • Goals: – Define haplotype “ blocks ” across the genome – Identify reference set of SNPs: “ tag ” each haplotype – Enable unbiased, genome-wide association studies

STAT115

25

1000 Genomes Projects

• Characterization of human genome sequence variation • Foundation for investigating the relationship between genotype and phenotype

STAT115

Summary

26 • SNP and CNV • SNP distribution and characteristics – Allele frequency (minor allele > 1%) – LD: linkage ~ physical proximity – Population stratification • SNP genotyping: SNP arrays, sequencing • Haplotype inference – Clarks: resolve unambiguous first, propose new haplotypes to maximize explanation – EM & Gibbs: iteratively infer haplotype frequency and individuals ’ haplotypes

STAT115

27

Acknowledgement

• Stefano Monti • Jun Liu & Tim Niu • Kenneth Kidd, Judith Kidd and Glenys Thomson • Joel Hirschhorn • Greg Gibson & Spencer Muse • Cheng Li & Yuhyun Park

STAT115