Document

Download Report

Transcript Document

Single Nucleotide Polymorphism Linkage Disequilibrium And Haplotypes

Xiaole Shirley Liu

Outline

•

Definition and motivation

SNP distribution and characteristics

– Allele frequency, LD, population stratification

• SNP and genotyping

• Haplotype inference:

–

Clark ’ s algorithm

EM and Gibbs sampling

Hapmap project and 1000 Genomes

STAT115

Polymorphism

• Polymorphism: sites/genes with “ common ” variation, less common allele frequency >= 1%, otherwise called rare variant and not polymorphic • Single Nucleotide Polymorphism – Come from DNA-replication mistake individual germ line cell, then transmitted – ~90% of human genetic variation • Copy number variations – May or may not be genetic

STAT115

Why Should We Care

• Disease gene discovery – Association studies, e.g. certain SNPs are susceptible for diabetes – Chromosome aberrations, duplication / deletion might cause cancer • Personalized Medicine – Drug only effective if you have one allele

STAT115

SNP Distribution

• Most common, 1 SNP / 100-300 bp – Balance between mutation introduction rate and polymorphism lost rate – Most mutations lost within a few generations • 2/3 are CT differences • In non-coding regions, often less SNPs at more conserved regions • In coding regions, often more synonymous than non-synonymous SNPs

STAT115

SNP Characteristics: Allele Frequency Distribution

• Most alleles are rare (minor allele frequency < 10%)

STAT115

SNP Characteristics: Linkage Disequilibrium

• Hardy-Weinberg equilibrium – In a population with genotypes AA, aa, and Aa, if p = freq(A), q =freq(a), the frequency of AA, aa and Aa will be p 2 , q 2 , and 2 pq respectively at equilibrium.

– Similarly with two loci, each two alleles Aa, Bb

STAT115

•

SNP Characteristics: Linkage Disequilibrium

Equilibrium Disequilibrium 8 • LD: If Alleles occur together more often than can be accounted for by chance, then indicate two alleles are physically close on the DNA – In mammals, LD is often lost at ~100 KB – In fly, LD often decays within a few hundred bases

STAT115

SNP Characteristics: Linkage Disequilibrium

• Statistical Significance of LD – Chi-square test (or Fisher’s exact test) –

e ij = n i . n.

j / n T

 2 

, 

(

n ij



e ij

) 2

e ij

B1 A1 n 11 A2 n 21 Total n.

1 B2 n 12 n 22 n.

2 Total n 1 .

n 2 .

n T

STAT115

SNP Characteristics: Linkage Disequilibrium

• Haplotype block: a cluster of linked SNPs • Haplotype boundary: blocks of sequence with strong LD within blocks and no LD between blocks, reflect recombination hotspots

STAT115

SNP Characteristics: Linkage Disequilibrium

• Haplotype block: a cluster of linked SNPs • Haplotype boundary: blocks of sequence with strong LD within blocks and no LD between blocks, reflect recombination hotspots • Haplotype size distribution

STAT115

SNP Characteristics: Linkage Disequilibrium

• [C/T] [A/G] T X C [A/C] [T/A] – Possible haplotype: 2 4 – In reality, a few common haplotypes explain 90% variations

Redundant

• Tagging SNPs: – SNPs that capture most variations in haplotypes – removes redundancy

STAT115

SNP Genotyping

• One SNP at a time or genome-wide (SNP array)

2.5kb

5.8kb

0.30

STAT115

40 Probes Used Per SNP

• Allele call – AA, BB, AB • Signal – Theoretically 1A+1B, 2A, 2B – But could have 1A+3B Amplified!

STAT115

Haplotype

• Haplotype: cluster of SNPs with LD – Block with 10 SNPs has 2 10 possible haplotypes – Only observe 5-6 haplotypes (> 90% cases) – Tagging SNPs: subset of SNP to ID a haplotype • Association (with disease) studies using haplotype is more accurate than using single SNP locus • Haplotype inference: Aa BB Cc

STAT115

Haplotype Inference

• Genotyping only tells an individual is e.g. Aa BB Cc, but it doesn ’ t tell whether haplotype is: ABC + aBc, or ABc + aBC • Haplotype can often be inferred if parental genotype is known – Similar to blood typing, e.g. F: A, M: AB, C: B  F: , M: , C: • Otherwise, look at the population genotypes, infer common haplotypes

STAT115

17 1.

Haplotype Inference

Clark ’ s Algorithm Construct haplotypes from unambiguous individuals Remove samples that can be explained as combinations of haplotypes discovered already Propose haplotype that would explain most remaining Iterate 2 & 3 until finish

STAT115

Haplotype Inference

is small

STAT115

19 EM and Gibbs Sampling in Motif Finding • Problem – Observe: sequence

– Unknown: motif θ and site location A (alignment), but given one, can infer the other • EM and Gibbs Sampler – Initialize random motif θ – Iterate: • Given θ and sequence

, update site location A • Given A and

, update θ – EM updates by weighted average – Gibbs sampling updates by sampling

STAT115

Statistical Model for Haplotype

Haplotype

T T A C C -- T T A C G -- T T A G C -- T T A G G -- T T C C C -- T T C C G -- T T C G C -- T T C G G ---

Frequency 



8 6 Haplotype Pool 6 2 2 3 4 7 1 1 5 6 6 1 • Each individual ’ s two haplotypes are treated as random draws from a pool of haplotypes with certain frequencies that can satisfy the genotyping

STAT115

Haplotype Inference

EM and Gibbs Sampler • Observe genotype Y, estimate haplotype pair Z for each individual and haplotype frequency  • Initialize haplotype frequencies • Iteration: – Estimate Z given Y,  – Estimate  given Y, Z

STAT115

Haplotype Inference

STAT115

Haplotype Inference

Partition-Ligation • When #SNP is big, # possible haplotypes is too big, so divide and conquer – Consider an inferred sub-haplotype as one allele

STAT115

Hapmap of Human Genome

• HapMap: catalog of common genetic variants in human – What are these variants – Where do they occur in our DNA – How are they distributed within populations and between populations around the world • Goals: – Define haplotype “ blocks ” across the genome – Identify reference set of SNPs: “ tag ” each haplotype – Enable unbiased, genome-wide association studies

STAT115

1000 Genomes Projects

• Characterization of human genome sequence variation • Foundation for investigating the relationship between genotype and phenotype

STAT115

Summary

26 • SNP and CNV • SNP distribution and characteristics – Allele frequency (minor allele > 1%) – LD: linkage ~ physical proximity – Population stratification • SNP genotyping: SNP arrays, sequencing • Haplotype inference – Clarks: resolve unambiguous first, propose new haplotypes to maximize explanation – EM & Gibbs: iteratively infer haplotype frequency and individuals ’ haplotypes

STAT115

Acknowledgement

• Stefano Monti • Jun Liu & Tim Niu • Kenneth Kidd, Judith Kidd and Glenys Thomson • Joel Hirschhorn • Greg Gibson & Spencer Muse • Cheng Li & Yuhyun Park

STAT115

Document

Transcript Document

Single Nucleotide Polymorphism Linkage Disequilibrium And Haplotypes

Outline

Polymorphism

Why Should We Care

SNP Distribution

SNP Characteristics: Allele Frequency Distribution

SNP Characteristics: Linkage Disequilibrium

SNP Characteristics: Linkage Disequilibrium

SNP Characteristics: Linkage Disequilibrium

SNP Characteristics: Linkage Disequilibrium

SNP Characteristics: Linkage Disequilibrium

SNP Characteristics: Linkage Disequilibrium

SNP Genotyping

40 Probes Used Per SNP

Haplotype

Haplotype Inference

Haplotype Inference

Haplotype Inference

Statistical Model for Haplotype

Haplotype Inference

Haplotype Inference

Haplotype Inference

Hapmap of Human Genome

1000 Genomes Projects

Summary

Acknowledgement

Directory