HapMap module

Download Report

Transcript HapMap module

The International HapMap Project:
a Rich Resource of Genetic Information
Julia Krushkal
Department of Preventive Medicine
The University of Tennessee Health Science Center
jkrushka{at}utmem.edu
HapMap Population Samples
Project launched in 2002 to provide a public resource for
accelerating medical genetic research
270 Individuals from 4 Geographically Diverse Populations
YRI: 90 Yorubans from Ibadan, Nigeria
30 parent-offspring trios
CEU: 90 northern and western European-descent living in
Utah, USA from the Centre d’Etude du Polymorphisme
Humain (CEPH) collection
30 parent-offspring trios
CHB: 45 unrelated Han Chinese from Beijing, China
JPT: 45 unrelated Japanese from Tokyo, Japan
http://www.hapmap.org/
HapMap
http://www.genome.gov/page.cfm?pageID=10001688 NHGRI
The International HapMap Project
“…Determine the common patterns of DNA sequence variation
in the human genome, by characterizing sequence variants,
their frequencies, and correlations between them, in DNA
samples from populations with ancestry from parts of Africa,
Asia and Europe.”
Nature (2003)
•
•
•
•
•
•
•
Population-specific sequence variation
Allele frequencies
Linkage disequilibrium patterns
Haplotype information
Tag SNPs
Structural genome variation
Better understanding of human population dynamics and of
the history of human populations
• Cell lines available from Coriell Inst. for Medical Research
• A rich resource for biomedical genetic analysis
International HapMap Project Papers
•The Int. HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs.
Nature 449, 851-861. 2007
•The Int. HapMap Consortium. A Haplotype Map of the Human Genome. Nature 437, 1299-1320. 2005
•The Int. HapMap Consortium. The International HapMap Project. Nature 426, 789-796.. 2003
•The Int. HapMap Consortium. Integrating Ethics and Science in the International HapMap Project.
Nature Reviews Genet 5, 467 -475. 2004
•Thorisson et al. The International HapMap Project Web site. Genome Res 15:1591-1593. 2005
HapMap-related papers
•Sabeti et al. Genome-wide detection and characterization of positive selection in human
populations. Nature 449, 913-918. 2007.
•Clark et al. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res,
15:1496-1502. 2005
•Clayton et al. Population structure, differential bias and genomic control in a large-scale, casecontrol association study. Nature Genet 37(11):1243-1246. 2005
•de Bakker et al. Efficiency and power in genetic association studies. Nature Genet, 37(11):1217-1223
2005
•Goldstein, Cavalleri. Genomics: Understanding human diversity. Nature 437:1241-1242. 2005.
•Hinds et al. Whole genome patterns of common DNA variation in three human populations. Science
307:1072-1079. 2005.
•Myers et al. A fine-scale map of recombination rates and hotspots across the human genome.
Science, 310:321-324. 2005
•Nielsen R et al.Genomic scans for selective sweeps using SNP data.Genome Res 15:1566-1575. 2005
•Smith et al. Sequence features in regions of weak and strong linkage disequilibrium. Genome Res
15: 1519-1534. 2005
•Weir et al. Measures of human population structure show heterogeneity among genomic regions.
Genome Res 15: 1468-1476. 2005.
Nature (2003)
Human Chromosomes
• Contain DNA
• 22 pairs of autosomes +
sex-chromosomes (X and Y) + mitochondrial
genome
• Contain functional units (genes) and other DNA
Human genome sequence is available
as a reference, as a result of the
Human Genome Project
A significant amount of inter-individual
variation exists
Some Basic Definitions
Locus - A site in the genome
The DNA in the human genome is not a static entity.
There are differences between different copies:
Allele – a genetic variant, i.e., a form (state) of a locus
Mutation - a genetic change
An individual carries two copies of each locus on autosomes
Individual alleles are inherited from parents to offspring
(1 from each parent)
Genotype - A set of alleles an individual is carrying at a
given locus
Chromosomes are sets of continuously linked genetic loci
Example:
Integrated map
of chromosome 5
from the
International
HapMap Project,
http://www.hapmap.org
Genetic Variation
• Some DNA loci vary among individuals
• Linked genetic loci are inherited non-independently
• Loci may change with time (mutation, selection, genetic drift)
• Some DNA changes lead to quantitative changes in RNA expression
and to quantitative or qualitative changes in protein production
• Some genetic changes, even small, may lead to disease
• A large amount of natural variation occurs in healthy individuals, i.e.,
many changes are neutral
• Loci genetically linked to the disease-causing locus can be used as
genetic markers to search for the disease locus
SNP1
There are many types of DNA
variation, e.g.
SNP2
DNA
Transcription
RNA
ENVIRONMENT
Translation
Protein
Phenotype
Sequence variation
AAAC/TGGCTA
Microsatellite repeats
…AATG AATG AATG AATG…
Polymorphic Site
A locus with common DNA variation
 2 alleles in a population
Shows difference in DNA sequence among individuals
In most definitions:
the most common allele with frequency < 99%,
or minor allele frequency (MAF)  1%,
or MAF  2%,
or at least two alleles have frequencies  1%.
A rare allele that occurs in <1% of the population is usually
non considered a polymorphic site.
SNP=Single Nucleotide Polymorphism
A SNP locus on the distal end of the long arm of human
chromosome 5 (data from Ensembl)
SNP locus rs6870660
http://www.ensembl.org
CAAATTCCATG[A or C]AGAAGGAAATACAT
A and C are alleles at SNP locus rs6870660
A SNP locus on the distal end of the long arm of chromosome 5
SNP locus rs6870660
http://www.hapmap.org
<>
Regulatory Interactions: The ENCODE Project
2003-Pilot project
launched (1% of
the genome)
2007- Pilot
project
completed;
production phase
launched on the
entire genome
Production Scale Effort
Pilot Scale Effort
Data Coordination Center
Technology Development Effort
High-through-put
experimental and
computational approaches
to studies of DNA
regulatory sites, regulatory
interactions, and DNA
modification
Genome SNP Variation
Size of human genome is
 3.2  109 bp
99.9% identical
9-10 mln SNPs may have MAF 5%
 30,000 genes
HapMap SNP Density Coverage
•Phase I (published in 2005)
1,007,329 SNPs that passed quality control
1 SNP / 3000 bp
11,500 nsSNP
10 ENCODE regions, 500 kb each
The cumulative number of non17,944 SNPs
redundant SNPs (each mapped to a
single location in the genome) is
1 SNP / 279 bp
shown as a solid line, as well as the
•Phase II (published in 2007)
number of SNPs validated by
>3,806,000 SNPs
genotyping (dotted line) and double1 SNP / 875bp
hit status (dashed line). Years are
25-30% of all SNPs with MAF  5%
divided into quarters (Q1–Q4).
http://www.hapmap.org/
SNP Differences among Individuals Far
Exceed Differences among Populations
Phase 1:
Autosomes: Across the 1 million SNPs genotyped,
only 11 have fixed differences between CEU and
YRI, 21 between CEU and CHB/JPT, and 5 between
YRI and CHB/JPT.
X chromosome 123 SNPs were completely
differentiated between YRI and CHB/JPT, but only 2
between CEU and YRI and 1 between CEU and
CHB/JPT.
Haplotypes
A haplotype is a set of alleles at multiple loci
located on the same copy of the chromosome
Genotype calls obtained from sequencing or DNA chip
genotyping do not provide the information about which of
the two chromosomal copies a particular allele belongs to.
E.g., genotypes for individual X:
SNP#
SNP A
SNP B
SNP C
Genotypes
A1 A2
B1 B2
C1 C2
AT
TC
GC
Haplotypes
A C
A1 B2
C
C2
A2 B1
T T
C1
G
Haplotype 1
Haplotype 2
Recombination
“Random” event
Occurs during meiosis
The larger the distance between loci or as more generations
pass, the more likely recombination(s) will occur
A1
B1
A1
B1
A2
B2
x
A2
A1
B2
B1
A2
Nonrecombinant
Haplotypes
B2
A1
Recombination
(crossing-over)
B2 A2
Recombinant
Haplotypes
B1
Two ancestral
chromosomes being
scrambled through
recombination over many
generations to yield
different descendant
chromosomes. If an A
allele on the ancestral
chromosome increases the
risk of a disease, the two
individuals in the current
generation who inherit that
part of the ancestral
chromosome will be at
increased risk.
Source: the International HapMap
Project
Linkage Disequilibrium
Associations among alleles at different loci
A1
B1
A2
Locus A
B2
Locus B
Normalized disequilibrium
coefficient
Correlation coefficient
D = Linkage disequilibrium
coefficient
Coefficient of association
D=pA1B1-pA1pB1
D’=D/|D|max
|D| max = | min(pA1pB2, pA2pB1)|
-1  D’  1
=D/ pA1pA2pB1pB2
In case of no association, D=0 (linkage equilibrium)
Practical implications in fine gene mapping:
Search for locus B using association of marker loci with disease
The value of D decreases geometrically with
each generation
A
D(t)=(1-  ) D(t-1)
D(t)=(1-  ) tD(0)
a
B

b
Unless the two loci are closely linked, the value of D
should rapidly decrease to 0.
The occurrence of association between two loci implies
that they are closely linked.
Haplotype Maps Generated by The
International HapMap Project
3 steps of
construction
the
HapMap
(a) SNPs are identified in
DNA samples from multiple
individuals.
(b) Adjacent SNPs that are
inherited
together
are
compiled into haplotypes.
(c)"Tag" SNPs are identified
within
haplotypes
that
uniquely
describe
those
haplotypes.
Source:
The
HapMap Project
International
Haplotype Maps of the Human Genome
Helmuth 2001, Science 293:583-585
Find correlations among
groups of SNPs
Haplotypes were inferred for the HapMap
project from trios data and from unrelated
individuals using Phase (Stephens 01;
Stephens and Donnely 03)
Haplotype Maps of the Human Genome
Genome regions
decomposed
into discrete haplotype
blocks, which capture
similarity in haplotype
organization
Patil et al. 2001, Blocks of Limited Haplotype Diversity Revealed by HighResolution Scanning of Human Chromosome 21. Science 294(5547):1719-23
Haplotype Block Partition Results for Three Populations
1,586,383 (SNPs) genotyped in 71 Americans of
European, African, and Asian ancestry
Population
Blocks
Average size, kb*
Required SNPs
African-American
European-American
Han Chinese
235,663
109,913
89,994
8.8
20.7
25.2
570,886
275,960
220,809
* Average
distance spanned by segregating sites in each block.
Minimum number of SNPs required to distinguish common
haplotype patterns with frequencies of 5% or higher.
Hinds et al. 2005 Science
Hinds et al 2005
Extended LD bin and haplotype block structure around the CFTR gene. LD bins, where
each bin has at least one SNP with r2 > 0.8 with every other SNP, are depicted as light
horizontal bars, with the positions of constituent SNPs indicated by vertical tick marks as
well as the extreme ends of the bars. Isolated SNPs are indicated by plain tick marks.
Haplotype blocks, within which at least 80% of observed haplotypes could be grouped
into common patterns with frequencies of at least 5%, are depicted as dark horizontal
bars. Unlike haplotype blocks that are by design sequential and nonoverlapping, SNPs
in one LD bin can be interdigitated with SNPs in multiple other overlapping bins
Population differences in local bin structure
Differences in allele and haplotype frequencies
“Although analysis panels are characterized both by different
haplotype frequencies and, to some extent, different combinations of
alleles, both common and rare haplotypes are often shared across
populations” (The Int. HapMap Project, Nature, 2005)
Tag SNP (htSNP) selection
Pairwise LD-based and haploblock-based tagging methods
Partition haplotypes into blocks
Can use haplotype-based (haploblocks) or genotype-based
(LD-blocks) partitioning
Select representative htSNPs from each block
Latest DNA microarrays aim to capture SNPs with r2  0.8
“Tags are the subset of variants genotyped in a disease
study. SNPs that are not typed in the study but whose
effect can be studied through LD with a tag are termed
proxies. A tag with perfect correlation (r2 = 1) to an untyped
putative causal allele is termed a perfect proxy.”
De Bakker et al., 2005
Tag SNP, Haplotypes, and LD
The Int. HapMap Consortium, Nature, 2005
Use of Haplotypes in Association Analysis
•Testing one marker at a time for associations is very
time-consuming
•Problem of multiple testing
•Testing individual SNPs, we are not utilizing information
from other markers
Benefits of Using Haplotypes
•Haplotypes allow us to use information from multiple loci
simultaneously
•LD information between loci is captured
Benefits of Haplotype Analysis
•Construct a single highly informative mega-locus
from a number of less informative but closely linked
loci
•Identify genotyping or data entry errors.
Likelihood ratio tests indicate which typings are
more likely to be an error
•Find boundaries of conserved haplotypes
associated with a trait.
•Employs recombinations from the entire history a
population
Amount of Captured Sequence
Variation in HapMap Phase II
For common variants (MAF  0.05) the mean maximum r2 of any
SNP to a typed one is 0.90 in YRI, 0.96 in CEU and 0.95 in CHB
/JPT.
1.09 million SNPs capture all common Phase II SNPs with r2 
0.8 in YRI.
Very common SNPs with MAF  0.25 are captured extremely well
(mean maximum r2 of 0.93 in YRI to 0.97 in CEU)
Rarer SNPs with MAF,0.05 are less well covered (mean maximum
r2 of 0.74 in CHB/JPT to 0.76 in YRI).
Recombination Hot Spots
Structural Genome Variation
HapMap samples are also used as a resource for CNV analysis
• Large number of copy number variants (CNVs) and other
genome rearrangements found among individuals
• Some variation is assumed normal, other may cause
disease
• Genome databases, e.g. Database of Genomics Variants
at the TCAG of the Toronto Hospital of Sick Children, the
Copy Number Variation Project Map at the Sanger Center
• Segmental duplications are recombination hotspots, causing global
genome rearrangements
HapMap Genome Browser
Perlegen Genotype Browser
UCSC Genome Browser
http://genome.ucsc.edu/
DNA Chips and Resequencing:
High-through-put Analysis of Sequence Variation
An easy way to access genome-wide variation
Both Affymetrix and Illumina DNA chips contain representative SNP and
CNV probes
Affymetrix GeneChip 6.0:
1.8 million markers for genetic variation, including 906,000 SNPs and
946,000 copy number probes.
Illumina 1M Bead Chip and 1M-duo Bead Chip:
~950,000 genome-spanning tag SNPs;
~100,000 additional non-HapMap SNPs,
>565,000 SNPs in and near coding regions such as nsSNPs, promoter
regions, 3’ and 5’ UTRs; dense coverage in ADME and MHC regions.
~260,000 markers located in novel and reported copy number
polymorphic regions.
Sequenom mass arrays (based on Maldi-TOF)
Genome-Wide Association
Select representative htSNPs from low diversity
haplotype blocks
Adjustment for multiple comparisons
LD values highly variable: smoothing function needed
Haplotypes in a sliding window
OR screen for top SNPs
likely functional SNPs
SNPs in genes involved in pathways of interest
Use of Phase-Resolved Data
in Association Analysis
•Find association with haplotypes similar to
analyses of individual SNP alleles; Need to
consider multiple testing
•Test for tendency of cases to ‘cluster’ around
groups of ‘similar’ haplotypes
•Extend log-linear approach to take haplotype
structure into account
Modifications also used for ambiguous phase
http://www.genome.gov/26525384
As of 04/14/2008, GWAS of
150 traits posted
Special Thanks to
• Ken Manly, whose presentation ideas
for the HapMap module 2006 inspired
and helped organized this
presentation