Associating Genomic Variations with Phenotypes

Download Report

Transcript Associating Genomic Variations with Phenotypes

Associating Genomic
Variations with
Phenotypes
Model comparison, rare variants,
and analysis pipeline
Qunyuan Zhang
Division of Statistical Genomics & Genome Institute
Washington University School of Medicine
1
Data & Question
i
1
Y
y1
X
x11
x12
... x1m
2 y2
x21
... ... ... ...
x22
...
... x2 m
... ...
n
xn 2
..
yn
Phenotypes
(quantitative,
categorical)
xn1
xnm
Genotypes:
SNP
Insertion
Deletion
Duplication
Inversion
Translocation
…
Relationship
between X and Y ?
2
Linkage & Association
i
1
x11
X
q12
x12
...
2 y2
x21
... ... ... ...
q22
...
x22
...
...
...
n
Y
y1
yn
xn1
qn 2
Genotypes
xn 2 ...
Association: (Y,X)
Linkage: (Y,Q)
Phenotype
Q is unobservable
r1 Q r2
Putative QTL
3
A Fixed-effect Mixture Model For Linkage
Commonly used in plant genetics
P1
X
P2
SNP A
SNP B
F1
r1 Q r2
F2
3
f ( yi )   P(Q j | X i , r )
j 1
1

j
 1 yi   j 2 
exp  (
) 
 2 

2
j


n
L(Y )   f ( yi )
i 1
4
A Variance-component Model For Linkage
Commonly used in human genetics
SNP A
SNP B
ΔQ
r 1 Q r2
 1

T
1
L(Y ) 
exp  (Y   ) V (Y   ) 
n/2
1/ 2
(2 ) | V |
 2

1
V  Cov(Y )  ΔQ  Δg  I
2
Q
QTL IBD
matrix
2
g
Background
IBD matrix
2
e
Diagonal unit
matrix
5
Variance-component Model
= Random-effect Linear Model
Y  μ  ZQ γ Q  Z g γ g  e
MVN (0, Δg )
MVN (0, ΔQ )
2
Q
2
g
V  ΔQ  Δg  I
2
Q
2
g
Random
effects
N (0,  e2 )
2
e
 1

T
1
L(Y ) 
exp  (Y   ) V (Y   ) 
n/2
1/ 2
(2 ) | V |
 2

1
6
From Linkage to Association
QTL effect(s)
Y  μ  ZQ γ Q  Z g γ g  e
Y  μ  Xβ  Zg γ g  e
marker
effect(s)
Linkage model
Family-based
association model
V  Δg  I
2
g
2
e
fixed
effect(s)
 1

T
1
L(Y ) 
exp  (Y    X ) V (Y    X ) 
n/2
1/ 2
(2 ) | V |
 2

1
7
A Simple Association Model
For Unrelated Subjects
Y  μ  Xβ  e
L(Y ) 
1
(2 ) n / 2 | V |1/ 2
n

i 1
1

e
V  I
2
e
 1

T
1
exp  (Y    X ) V (Y    X ) 
 2

 1 yi    X 2 
exp   (
) 
e
2
 2

8
Covariate(s):
Adjusting For Confounder(s)
Y  μ  Xβ  XC βC  e
Observed confounders: age, sex etc.
Hidden confounders: population structure
Population structure can be estimated by:
-PCA
-Clustering
-Admixture/ancestry
9
Modeling Hidden Genetic Correlation
Between Subjects
marker
fixed
effect(s)
covariate
fixed
effect(s)
Genetic
background
random effects
Y  μ  Xβ  XCβC  Z g γ g  e
V  Δg  I
2
g
2
e
Family data, pedigree => IBD matrix
Population data, hidden, marker data => IBS matrix
10
Modeling Rare Variants
Y  μ  Xβ  XCβC  Z g γ g  e
Common variants, tested individually, H0: β1=0. One p-value per variant
Y  μ  1 X1  ...
Rare variants, tested as an entire group (burden test), usually by gene
H0: β1= β2=…=βk=0 . One p-value per group of variants
Y  μ  1 X1  2 X 2  ... k X k  ...
Incorporated with variable selection, with loose criteria
β can be treated as random effects, variance components
test, can be weighted by prior information
11
Collapsing Model
Y  μ  1 X1  2 X 2  ... k X k  ...
Y  μ   X   ...
Collapsing multiple variables into one
subject X 1
X2
X3
X
1
2
0
0
0
1
0
0

1
1
3
1
0
0
1
12
Weighted Sum Model
Y  μ  1 X1  2 X 2  ... k X k  ...
k
Y  μ    ( w j X j )  ...
j 1
Y  μ   S  ...
subject X 1
1
2
3
X3
X2
S
w1  0.2 w1  0.5 w1  0.3
0.0
0
0
0
0.8
0
1
1
1
0
0
0.2
13
Weighting Variants
 Base
on allele frequency, continuous or binary(0,1) weight,
variable threshold;
 Based on function annotation/prediction;
 Based on sequencing quality (coverage, mapping quality,
genotyping quality, validated or not etc.);
 Data-driven, using both genotype and phenotype data,
learning weights (including effect directions) from data,
requiring permutation test;
 Any combination …
Grouping Variants
By gene
By transcript
By gene set / pathway
……
By exon
By protein domain
14
Modeling More Data Types
Generalized Linear (Mixed) Model
g (Y )  μ  Xβ  ... e
Link function
For binary Y, logistic model
 P(Y  1) 

g (Y )  logit (Y )  log
 1  P(Y  0) 
exp(μ  Xβ  ...  e)
P(Y  1) 
exp(μ  Xβ  ...  e)  1
15
Longitudinal Data (quantitative)
Time
Fixed effect, time as covariate
Repeated measures, random effect, correlation within subjects
16
Longitudinal Data (binary)
Time
Linear model, time as covariate
Survival analysis, CoxPH model etc.
17
Tools
SAS Procedures
REG, LOGISTIC, GENMOD, MIXED,
HPMIXED, GLIMMIX, PHREG/LIFETEST
R Functions/Packages
lm (), glm()
gee, nlme, kinship2/coxme, lme4, survival
Other Programs
SOLAR, MMAP, EMMA, EMMAX, SKAT
18
Pipeline
Input (data + options)
Job generating/submitting module
Job number controlling module
LSF bsub
job1
job2
…..
Job
N
Options.jobi => self-programmed modules (SAS, R,…)
Options.jobi => external program modules (MMAP, SKAT,..)
Result
1
…..
Result
2
Result
N
Job status monitoring module (all done ?)
Yes
no
Result summarizing module
Wait …
19
gwas.sh options.gwa
#!/bin/sh
OPFILE=$1
...
…
Pheno
Bmi
YES
Obes
YES
HD
Age
Sex
…
type
qt
covar
age,sex
program analysis
SASGLM mixed
ql
NA
SASGLM gee
ql
…
…
age
SASGLM gee
Program
SASGLM
GSTAT
MMAP
language
SAS
R
C
location
Maintainer
/dsg1/code/sas/glm.sas Q.Zhang
/dsg1/code/R/gstat.R
Q.Zhang
/dsg1/code/sas/mmap.sh J. Czajkowski
…
run
NO
[DATA]
database=SAS
genotype_dir=/dsg1/gwas/fhsgeno
genotype_file=
phenotype_file=fhs100
markerinfo_file=mapall
marker_selection=MAF>0.01
pedigree_file=pediall
subjectID=subject
pedgreeID=famid
markername=snp
…
[ANALYSIS]
phenolist_file=
pheno_list=bmi/qt
covariates=
program=SASGLM
analysis=mixed
[OUTPUT]
output_dir=/dsguser/qunyuan/fhs/bmi
output_file=
output_replace=no
[RUN]
clusterjobname=bmimixed
memsize=1000M
maxjobn=300
…
20
Thanks !
21