R Packages for Genome-Wide Association Studies

Transcript R Packages for Genome-Wide Association Studies

R Packages
for Genome-Wide Association Studies
Qunyuan Zhang
Division of Statistical Genomics
Statistical Genetics Forum
March 10,2008
What is R ?
R
is a free software environment for statistical computing and graphics.
Run
s on a wide variety of UNIX platforms, Windows and MacOS (interactive or batch mode)
Free
and open source, can be downloaded from cran.r-project.org
Wide
range of packages (base & contributed), novel methods available
Concise
Help
grammar & good structure (function, data object, methods and class)
from manuals and email group
Slow,
time and memory consuming (can be overcome by parallel computation, and/or
integration with C)
Popular,
used by 70~80% statisticians
R Task Views
http://cran.r-project.org/web/views/
Statistical Genetics Packages in R
http://cran.r-project.org/web/views/Genetics.html
Population Genetics : genetics (basic), Geneland (spatial structures of genetic data),
rmetasim (population genetics simulations), hapsim (simulation), popgen (clustering SNP
genotype data and SNP simulation), hierfstat (hierarchical F-statistics of genetic data), hwde
(modeling genotypic disequilibria), Biodem (biodemographical analysis), kinship (pedigree
analysis), adegenet (population structure), ape & apTreeshape (Phylogenetic and evolution
analyses), ouch (Ornstein-Uhlenbeck models), PHYLOGR (simulation and GLS model),
stepwise (recombination breakpoints)
Linkage and Association : gap (both population and family data, sample size calculations,
probability of familial disease aggregation, kinship calculation, linkage and association
analyses, haplotype frequencies) tdthap (TDT for haplotypes, powerpkg (power analyses for
the affected sib pair and the TDT design),hapassoc (likelihood inference of trait associations
with haplotypes in GLMs), haplo.ccs (haplotype and covariate relative risks in case-control
data by weighted logistic regression), haplo.stats (haplotype analysis for unrelated subjects),
tdthap (haplotype transmission/disequilibrium tests), ldDesign (experiment design for
association and LD studies), LDheatmap (heatmap of pairwise LD),. mapLD (LD and
haplotype blocks), pbatR (R version of PBAT), GenABEL & SNPassoc for GWAS
QTL mapping for the data from experimental crosses: bqtl (inbred crosses and recombinant
inbred lines), qtl (genome-wide scans), qtlDesign (designing QTL experiments & power
computations), qtlbim (Bayesian Interval QTL Mapping)
Sequence & Array Data Processing : seqinr, BioConductor packages
GenABEL
Aulchenko Y.S., Ripke S., Isaacs A., van Duijn C.M. GenABEL: an R package for
genome-wide association analysis. Bioinformatics. 2007, 23(10):1294-6.
GenABEL: genome-wide SNP association analysis
a package for genome-wide association analysis between quantitative or binary
traits and single-nucleotides polymorphisms (SNPs).
Version: 1.3-5
Depends: R (≥ 2.4.0), methods, genetics, haplo.stats, qvalue, MASS
Date: 2008-02-17
Author: Yurii Aulchenko, with contributions from Maksim Struchalin, Stephan
Ripke and Toby Johnson
Maintainer: Yurii Aulchenko <i.aoultchenko at erasmusmc.nl>
License: GPL (≥ 2)
In views: Genetics
CRAN checks: GenABEL results
GenABEL: Data Objects
gwaa.data-class
phdata:
phenotypic data
(data frame)
gtdata:
genotypic data
(snp.data-class)
snp.data()
nbytes: number of bytes used to store data on a SNP
nids: number of people
male: male code
idnames: ID names
nsnps: number of SNPs
nsnpnames: list of SNP names
chromosome: list chromosomes corresponding to SNPs
coding: list of nucleotide coding for SNP names
strand: strands of the SNPs
map: list SNPs’ positions
2-bit storage
gtps: genotypes (snp.mx-class)
0
00
load.gwaa.data(phenofile = "pheno.dat", genofile = "geno.raw“)
1
2
3
Save
01
10
11
75%
convert.snp.text() from text file (GenABEL default format)
convert.snp.ped() from Linkage, Merlin, Mach, and similar files
convert.snp.mach() from Mach format
convert.snp.tped() from PLINK TPED format
convert.snp.illumina() from Illumina/Affymetrix-like format
GenABEL: Data Manipulation
snp.subset():
subset data by snp names or by QC criteria
add.phdata():
merge extra phenotypic data to the gwaa.data-class.
ztransform():
standard normalization of phenotypes
rntransform():
rank-normalization of phenotypes
npsubtreated():
non-parametric adjustment of phenotypes for
medicated subjects
GenABEL: QC & Summarization
summary.snp.data():
summary of snp data (Number of observed
genotypes, call rate, allelic frequency, genotypic distribution, P-value of
HWE test
check.trait(): summary of phenotypic data and outlier check based on
a specified p/FDR cut-off
check.marker(): SNP selection based on call rate, allele frequency and
deviation from HWE
HWE.show(): showing HWE tables, Chi2 and exact HWE P-values
perid.summary(): call rate and heterozygosity per person
ibs():
matrix of average IBS for a group of people & a given set of
SNPs
hom(): average homozygosity (inbreeding) for a set of people, across
multiple markers
GenABEL: SNP Association Scans
scan.glm():
snp association test using GLM in R library
scan.glm((“y~x1+x2+…+CRSNP", family = gaussian(), data, snpsubset, idsubset)
scan.glm((“y~x1+x2+…+CRSNP", family = binomial (), data, snpsubset, idsubset)
scan.glm.2D(): 2-snp interaction scan
Fast Scan (call C language)
ccfast():
case-control association analysis by computing chi-square test from 2x2 (allelic)
or 2x3 (genotypic) tables
emp.ccfast(): Genome-wide significance (permutation) for ccfast() scan
qtscore():
association test (GLM) for a trait (quantitative or categorical)
emp.qtscore(): Genome-wide significance (permutation) for qscaore() scan
mmscore():
score test for association between a trait and genetic polymorphism, in
samples of related individuals (needs stratification variable, scores are computed within
strata and then added up)
egscore():
association test, adjusted for possible stratification by principal components of
genomic kinship matrix(snp correlation matrix)
GenABEL: Haplotype Association Scans
scan.haplo():
haplotype association test using GLM in R library
scan.haplo.2D():
2-haplotype interaction scan
(haplo.stats package required)
Sliding window strategy
Posterior prob. of Haplotypes via EM algorithm
GLM-based score test for haplotype-trait association (Schaid DJ, Rowland CM,
Tines DE, Jacobson RM, Poland GA. 2002. Score tests for association of traits with haplotypes when
linkage phase is ambiguous Am J Hum Genet 70: 425-434. )
GenABEL: GWAS results
from scan.glm, scan.haplo, ccfast, qtscore, emp.ccfast,emp.qtscore
scan.gwaa-class
Names:
snpnames list of names of SNPs tested
P1df: p-values of 1-d.f. (additive or allelic) test for association
P2df: p-values of 2-d.f. (genotypic) test for association
Pc1df: p-values from the 1-d.f. test for association between SNP and trait; the
statistics is corrected for possible inflation
effB: effect of the B allele in allelic test
effAB: effect of the AB genotype in genotypic test
effBB: effect of the BB genotype in genotypic test
Map: list of map positions of the SNPs
Chromosome: list of chromosomes the SNPs belong to
Idnames: list of subjects used in analysis
Lambda: inflation factor estimate, as computed using lower portion (say, 90%) of
the distribution, and standard error of the estimate
Formula: formula/function used to compute p-values
Family: family of the link function / nature of the test
GenABEL: Table & Graphic Functions
descriptives.marker():
descriptives.trait():
descriptives.scan():
table of marker info.
table of trait info.
table of scan results
plot.scan.gwaa(): plot of scan results
plot.check.marker(): plot of marker data (QC etc.)
GenABEL:
Computer Efficiency
2000 subjects x 500K chip
Memory: ~3.2 G
Loading time: ~4 Min.
SNP summary: ~1 Min.
Call ccfast: ~0.5 Min.
Call qtscore: ~2 Min.
Total: < 10 Min.
Permutation test
N=10,000
73~ 120 hrs, 3~5 days
Intel Xeon 2.8GHz processor,SuSE Linux 9.2, R 2.4.1
SNPassoc
An R package to perform whole genome association studies, Juan R. González 1, et al. Bioinformatics, 2007
23(5):654-655
SNPassoc: SNPs-based whole genome association studies
This package carries out most common analysis when performing whole genome
association studies. These analyses include descriptive statistics and exploratory
analysis of missing values, calculation of Hardy-Weinberg equilibrium, analysis of
association based on generalized linear models (either for quantitative or binary
traits), and analysis of multiple SNPs (haplotype and epistasis analysis).
Permutation test and related tests (sum statistic and truncated product) are also
implemented.
Version:1.4-9
Depends:R (≥ 2.4.0), haplo.stats, survival, mvtnorm
Date:2007-Oct-16
Author:Juan R González, Lluís Armengol, Elisabet Guinó, Xavier Solé, and Víctor
MorenoMaintainer:Juan R González <jrgonzalez at imim.es>
License:GPL version 2 or newerURL:http://www.r-project.org and
http://davinci.crg.es/estivill_lab/snpassoc;
In views:Genetics
CRAN checks:SNPassoc results
SNPassoc: Data & Summary
setupSNP(data=snp-pheno.table,
colSNPs=, sep = "/", ...)
summary()
allele frequencies
percentage of missing values
HWE test
info=map.table,
SNPassoc: Association Tests
WGassociation(y~x1+x2,
data=, model = (codominant, dominant, recessive,
overdominant, log-additive or all),quantitative = , level = 0.95)
scanWGassociation(): only p values
association(): only for selected snps, can do stratified, GxE interaction analyses
Results
Summary: a summary table by genes/chromosomes
Wgstats: detailed output(case-control numbers, percentages, odds ratios/ mean
differences, 95% confidence intervals, P-value for the likelihood ratio test of association,
and AIC, etc.)
Pvalues: a table of p-values for each genetic model for each SNP
Plot: p values in the -log scale for plot.Wgassociation()
Labels: returns the names of the SNPs analyzed
SNPassoc: Multiple-SNP Analysis
SNP–SNP Interaction
interactionPval():
epistasis analysis between all pairs of SNPs (and covariates).
Haplotype Analysis
haplo.glm(): using the R package haplo.stats:
association analysis of haplotypes with a response via GLM
haplo.interaction():
interactions between haplotypes (and covariates)
SNPassoc: Computer Efficiency
1000 subjects X 3000 SNPs
5 min. import data
40 min. setupSNP()
30 min. scanWGassociation(): only p values (including permutation test)
Memory usage: 750 MB