Transcript Lecture

Imputation for GWAS
6 December 2012
• Imputation describes the process of predicting
genotypes that have not been directly typed
in a sample of individuals:
• missing genotypes at typed variants;
• genotypes at un-typed variants that are present in
an external high-density “reference panel” of
phased haplotypes.
• In silico genotypes can be tested for
association within standard generalised linear
regression framework.
How does imputation work?
What is the purpose of imputation?
• Increased power. The reference panel is more
likely to contain the causal variant (or a better
tag) than a GWAS array.
• Fine-mapping. Imputation provides a highresolution overview of an association signal
across a locus.
• Meta-analysis. Imputation allows GWAS
typed with different arrays to be combined up
to variants in the reference panel.
Increased power and improved finemapping resolution
IMPUTEv2 and minimac
• Pre-phasing. Estimate haplotypes at variants
typed in the study sample (scaffold).
• Haploid imputation. Study sample
haplotypes are considered an unknown path
through haplotypes from the reference
• Hidden Markov model (HMM).
• Switch probability between reference
haplotypes depends on recombination
• Allelic mismatch between reference and
observed haplotypes can be incorporated
by allowing for low rate of mutation.
• Less computationally demanding than diploid
imputation that attempts to jointly phase
and impute simultaneously (IMPUTEv1 and
Reference panels
• Large-scale genotyping and re-sequencing
reference panels made available through
HapMap Consortium and 1000 Genomes Project.
• HapMap2. 60 CEU, 60 YRI and 90 CHB/JPT individuals
typed for ~3M variants.
• HapMap3. 1011 individuals from multiple ethnic
groups typed for ~1.6M variants.
• 1000 Genomes. Most recent release includes 1094
individuals from multiple ethnic groups typed for
~30M variants (including indels).
Choice of reference panel
• Imputation software designed for use with 1000
Genomes reference panels, but remain
computationally demanding.
• Making use of the “all ancestries” reference panel
(rather than ethnic-specific reference panel)
improves imputation accuracy for rare variants.
• Formatted reference panels for IMPUTEv2 and
minimac can be downloaded from the software
Factors affecting imputation accuracy
• Scaffold. Number of individuals and GWAS
array used for genotyping (coverage of
• Reference panel. Number of individuals and
density of typing. Similarity of ancestry with
study sample.
• Minor allele frequency.
• Pre-phasing or diploid imputation (minimal).
Imputation accuracy
Imputation quality control
• Pre-imputation. Essential that GWAS scaffold excludes
poor quality variants. Common to exclude MAF<1%
• Post imputation. Imputation quality assessed by
“information measures” in range 0-1.
• Information measure α in a scaffold of N individuals has
equivalent power to αN perfectly genotyped individuals.
• Typical to filter SNPs by α (exclude <0.8, <0.4).
• IMPUTEv2 “info score” and minimac ȓ2.
• In loci identified through imputation, important to
check quality of typed SNPs in the scaffold in the region
by visual inspection of cluster plots.
Analysis of imputed genotypes
• For each individual, imputation provides
probability distribution of possible genotypes at
each un-typed variant from the reference panel.
• Using best guess genotype, or filtering on
probability of best guess genotype can increase
false positives and reduce power.
• Convert probabilities to “expected allele count”,
i.e. p1+2p2.
• Fully take account of the uncertainty in the
imputation in a “missing data likelihood”.
• Software: SNPTEST2 (for IMPUTEv2) and
Mach2Dat (for minimac).
Rare variants and complex disease
• Rare variants are likely to have arisen from
founder effects in the last few generations.
• Rare variants are expected to have larger
effects on complex traits that common
• Statistical methods focus on the accumulation
of minor alleles at rare variants (mutational
load) within the same functional unit.
• Test of association of phenotype with proportion of rare
variants at which individuals carry minor alleles.
0 0
1 0
0 1
pi = 3/10
• Model disease phenotype via regression on pi and any
other covariates in GLM framework.
Reedik Magi
Assaying rare genetic variation
• Gold-standard approach to assaying rare genetic
variation is through re-sequencing, which is
expensive on the scale of the whole genome.
• GWAS genotyping arrays are inexpensive, but are
not designed to capture rare genetic variation.
• Increasing availability of large-scale reference
panels of whole-genome re-sequencing data:
1000 Genomes Project and the UK10K Project.
• Impute into GWAS scaffolds up to these reference
panels to recover genotypes at rare variants at no
additional cost, other than computing.
GRANVIL: imputed variants
• Test of association of phenotype with proportion of rare
variants at which individuals carry minor alleles.
0.2 0.1
0.8 0.1
0.1 0.6
pi = 3.0/10
• Replace direct genotypes with posterior probability of
heterozygous or rare homozygous call from imputation.
• Model disease phenotype via regression on pi and any
other covariates in GLM framework.
Application to WTCCC
• GWAS of seven complex human diseases from the UK
(2000 cases each and 3000 shared controls from 1958
British Birth Cohort and National Blood Service):
• bipolar disease (BD), coronary artery disease (CAD),
Crohn’s disease (CD), hypertension (HT), rheumatoid
arthritis (RA), type 1 diabetes (T1D) and type 2 diabetes
• Individuals genotyped using the Affymetrix GeneChip
500K Mapping Array Set.
• After quality control, 16,179 samples and 391,060
autosomal SNPs (MAF>1%) carried forward for analysis.
Fine-scale UK population structure
• Fine-scale population structure may have
greater impact on rare variants than on
common SNPs because of recent founder
• Utilised EIGENSTRAT to construct principal
components to represent axes of genetic
variation across the UK: 27,770 high-quality
LD pruned (r2<0.2) common autosomal SNPs
Fine-scale UK population structure
• SNPs mapped to NCBI build 37 of human
• Samples imputed up to 1000 Genomes Phase 1
cosmopolitan reference panel (June 2011 interim
• 8.23M imputed autosomal rare variants
(MAF<1%) polymorphic in WTCCC.
• 5.38M (65.3%) were “well-imputed” (i.e. Info
score > 0.4) and carried forward for analysis.
• Mean info score was 0.618, and 17.3% had info
score > 0.8.
Rare variant analysis
• Test for association of each disease with
accumulation of rare variants (MAF<1%) within
genes using GRANVIL.
• Gene boundaries defined from UCSC human
genome database (build 37).
• Analyses adjusted for three principal components
to adjust for fine-scale UK population structure.
• Genome-wide significance threshold p<1.7x10-6:
Bonferroni adjustment for 30,000 genes.
No evidence of residual
population structure
Rare variant association with T1D
• Genome-wide significant evidence of association of T1D
with rare variants in multiple genes from the MHC.
• Strongest signal of association observed for HLA-DRA
• Gene contains 23 well imputed rare variants with mean
MAF of 0.32%.
• Accumulations of minor alleles across these variants were
associated with decreased risk of disease: odds ratio 0.556
(0.476-0.650) per minor allele.
T1D association across the MHC
• Ten genes achieve genome-wide significant evidence of
rare variant association with T1D.
T1D association across the MHC
• After additional adjustment for additive effect of lead
GWAS common variant from the MHC (rs9268645).
T1D association across the MHC
• GRANVIL assumes the same direction of effect on
the trait of all rare variants within the functional
• Methodology allowing for different directions of
effect of rare variants are well established for resequencing data, and are being generalised to
allow for imputation.
• The most powerful rare variant test will depend
on the underlying genetic architecture of the