Statistical Analysis of Gene Expression Data (A Large

Download Report

Transcript Statistical Analysis of Gene Expression Data (A Large

First step of making a story: Statistical significance
of a particular "Functional cluster"
•Suppose we have analyzed total of N genes, n of which turned out to be
differentially expressed/co-expressed (experimentally identified - call them
significant) - form the Cluster 1
•Suppose that y out of N total genes were classified into a specific
"Functional group" - FCluster 1
•Suppose that x are in both Cluster1 and FCluster1
•Q1: Is this FCluster1 significantly associated with Cluster1?
•Q2: Are significant genes overrepresented in this functional group when
compared to their overall frequency among all analyzed genes?
•Q3: If we randomly draw n out of N genes to be put in Cluster1, what is
the chance of getting x or more of them to fall into FCluster1?
•Q4: If we randomly draw y out of N genes to be in FCluster1, what is the
chance of getting x of them or more to fall into Cluster1
Testing correlations between genes groups and
functional clusters
•Deriving the Fisher's exact test (hypergeometric distribution) a typical approach to deriving probability distribution functions
for discrete data
•Simulation-based Fisher's test - data is simulated according
the "null" distribution many times and the sampling
distribution of the test hypothesis under the null hypothesis is
calculated empirically
•Simulation-based testing in computational biology is
extremely common
•Recreating the true "null" distribution by simulations is often
much easier than deriving the true null distribution
External Validation
•Correlate clusters or just groups of differentially expressed genes with
"functional clusters"
Expression Clusters
Functional Clusters
•f
Genes in
FCluster 1
Genes NOT in
FCluster 1
Genes in
Cluster 1
x
n-x
Genes NOT in
Cluster 1
y-x
N-y-(n-x) N-n
y
N-y
n
N
Statistical significance of a particular "Functional
cluster" - cont
Before finding differentially expressed genes we know that y out of N genes
belong to FCluster1(red boxes)
g1
...
gy+1
gy
...
gN
If there is no association between Cluster1 and FCluster1, number of genes in FCluster1 that turn
out to be differentially expressed is not significantly greater than number we can get by randomly
drawing n out of N genes (boxes with blue border) and counting how many belong to FCluster
Observed
g1
...
gx
gx+1
...
gy
gy+1
...
gn+(y-x)
gn+y+1
...
gN
Outcomes (o1,...,oT): A set of n genes was randomly selected from the list of N
genes and borders of their boxes are painted blue.
Event of interest (E): Set of all outcomes for which the number of red boxes with
blue border is equal to x
Since drawing is random all outcomes are equally probable
P ( o 1 )  P ( o 2 )  ...  P ( o T )
T

t 1
P (ot )  1

P (ot ) 
1
T
, for all t  1,..., T
Statistical significance of a particular "Functional
cluster" - cont
Outcome (o1,...,oT): A set of y genes with selected from the list of N genes
Event of interest (E): Set of all outcomes for which the number of red boxes
among the y boxes drawn is equal to x
M
E  {o ,..., o
E
1
E
M
)

P (E) 

m 1
M
o
E
m

1
T
m 1

M
T
All we have to do is calculating M and N where:
T=number of different sets we can draw a set of y genes out of total of N genes
N 
N ( N  1)( N  2 )...( N  y  1)
T    
1 * 2 * ... * y
 y
Comes from the fact that order in which
we pick genes does not matter
M=number of different ways to obtain x red boxes (significant genes) when
drawing y boxes (genes) out of total of N boxes (genes), x of which are red
(significant)
 n  N  n 

M    

 x  y  x 
First pick x red boxes. For each such set
of x red boxes pick a set of y-x non-red
boxes
Statistical significance of a particular "Functional
cluster" - p-value
Expression Clusters
Functional Clusters
Genes in
FCluster 1
Genes NOT in
FCluster 1
Genes in
Cluster 1
x
n-x
Genes NOT in
Cluster 1
y-x
N-y-(n-x) N-n
y
N-y
n
N
P-value: Probability of observing x or more significant genes under the
null hypothesis
 n  N  n 
 

min( y , n ) 



 r  y  r 
p  value  p ( x )  p ( x  1)  ...  p (min( y , n )) 
N 
rx
 
 y
 

Fisher's exact test or the "hypergeometric" test
Statistically Significant GO Categories
Top 2 GO Categories for genes with FDR< 0.01
GO Term
1
FDR for the category= 0.0442416
GOID = GO:0006936
Term = muscle contraction
Definition = A process leading to shortening and/or development of
tension in muscle tissue. Muscle contraction occurs by a sliding
filament mechanism whereby actin filaments slide inward among the
myosin filaments.
Ontology = BP
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,]
3
12
[2,]
33 9268
GO Term
2
FDR for the category= 0.1769315
GOID = GO:0006937
Term = regulation of muscle contraction
Definition = Any process that modulates the frequency, rate or extent
of muscle contraction.
Ontology = BP
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,]
2
13
[2,]
11 9290
Statistically Significant GO Categories
Top 2 GO Categories for genes with FDR< 0.05
GO Term
1
FDR for the category= 0.006130206
GOID = GO:0005576
Term = extracellular region
Synonym = extracellular
Definition = The space external to the outermost structure of a cell.
For cells without external protective or external encapsulating
structures this refers to space outside of the plasma membrane.
This term covers the host cell environment outside an
intracellular parasite.
Ontology = CC
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,] 160 544
[2,] 1381 7231
GO Term
2
FDR for the category= 0.006130206
GOID = GO:0005615
Term = extracellular space
Synonym = intercellular space
Definition = That part of a multicellular organism outside the cells
proper, usually taken to be outside the plasma membranes, and
occupied by fluid.
Ontology = CC
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,] 149 555
[2,] 1266 7346
Statistically Significant GO Categories
> fisher.test(matrix(c(3,12,33,9268),byrow=T,ncol=2), alternative = "greater")
Fisher's Exact Test for Count Data
data: matrix(c(3, 12, 33, 9268), byrow = T, ncol = 2)
p-value = 2.336e-05
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
16.1434
Inf
sample estimates:
odds ratio
69
> fisher.test(matrix(c(160,544,1381,7231),byrow=T,ncol=2), alternative = "greater")
Fisher's Exact Test for Count Data
data: matrix(c(160, 544, 1381, 7231), byrow = T, ncol = 2)
p-value = 6.089e-06
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
1.310533
Inf
sample estimates:
odds ratio
1.539936
Statistically Significant GO Categories
> fisher.test(matrix(c(4,1752,0,9268),byrow=T,ncol=2), alternative = "greater")
Fisher's Exact Test for Count Data
data: matrix(c(4, 1752, 0, 9268), byrow = T, ncol = 2)
p-value = 0.000642
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
4.74044
Inf
sample estimates:
odds ratio
Inf
> fisher.test(matrix(c(5,1751,0,9268),byrow=T,ncol=2), alternative = "greater")
Fisher's Exact Test for Count Data
data: matrix(c(5, 1751, 0, 9268), byrow = T, ncol = 2)
p-value = 0.0001021
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
6.443299
Inf
sample estimates:
odds ratio
Inf
>
Re-sampling based testing
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,]
3
12
[2,]
33 9268
p-value = 2.336e-05
Randomly Drawing 15 "significant" genes out of the total of
9316 genes
> sample(1:9316,15,
[1] 3387 5063 7111
> sample(1:9316,15,
[1] 4042 1783 2416
> sample(1:9316,15,
[1] 8592 4307 7419
replace =
6228 108
replace =
860 3338
replace =
4042 6643
FALSE, prob = NULL)
5507 710 6341 6145 3930 8396 939 813 1578 7629
FALSE, prob = NULL)
654 1980 3382 8375 7508 1378 5844 5404 7550 2881
FALSE, prob = NULL)
7142 485 1318 1578 1643 5763 1384 2997 7299 1127
Without a loss of generality, can assume that first 36 genes
belong the FCluster1
Overall Strategy:
•Generate Many Random Samples
•For Random Sample, Count How Many Genes Belong to first 36 (x)
•Calculate the proportion of samples for which x3
Re-sampling based testing
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,]
3
12
[2,]
33 9268
p-value = 2.336e-05
Randomly Drawing 15 "significant" genes out of the total of
9316 genes
> for(i in 1:1000000){
+ x<-sum(sample(1:9316,15, replace = FALSE, prob = NULL)<=36)
+ ngreater3<-ngreater3+(x>=3)
+ }
> ngreater3/1000000
[1] 1.7e-05
The approximate, resampling-based, p-value is close to the pvalue based on the Fisher's test and the hypergeometric
distribution
Approximating very small p-values is difficult using resampling
methods-need huge number of replicates to be precise
Re-sampling based testing
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,]
1
14
[2,]
35 9266
> fisher.test(matrix(c(1,14,35,9266),byrow=T,ncol=2), alternative = "greater")
Fisher's Exact Test for Count Data
data: matrix(c(1, 14, 35, 9266), byrow = T, ncol = 2)
p-value = 0.05646
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
0.8822488
Inf
sample estimates:
odds ratio
18.87886
> ngreater3<-0
> for(i in 1:1000000){
+ x<-sum(sample(1:9316,15, replace = FALSE, prob = NULL)<=36)
+ ngreater3<-ngreater3+(x>=1)
+ }
> ngreater3/1000000
[1] 0.056758
>
GO Significance
#Loading Annotation Libraries
library(annotate)
library(mouseLLMappings)
library(GO)
#Forming lists of all GO terms
GO<-as.list(GOTERM)
AllGOTerms<-names(GO)
#Getting LocusLink (i.e. Entrez Gene IDs) genes on the microarray
ACC2LL <- as.list(mouseLLMappingsACCNUM2LL)
AllGenesLL<-ACC2LL[as.character(LimmaDataNickel$genes[,"Name"])]
AllGenesLL<-unique(unlist(AllGenesLL[!is.na(names(AllGenesLL))]))
nGenesLL<-length(AllGenesLL)
#Getting frequencies of genes on the microarray in each GO category
GenesInGO<-sapply(AllGOTerms,
function(x) {llsGO<-intersect(get(x,GOALLLOCUSID),AllGenesLL);nlls<-length(llsGO);nlls})
#Selecting categories with at least 5 genes
AllGE5<-AllGOTerms[GenesInGO>=5]
GO Significance
sig.genes<-(EBayesFDRD<FDR)
sum(sig.genes,na.rm=T)
sig.genes.LL<-ACC2LL[as.character(LimmaDataNickel$genes[sig.genes,"Name"])]
sig.genes.LL<-unique(unlist(sig.genes.LL[!is.na(names(sig.genes.LL))]))
nSigGenes<-length(sig.genes.LL)
nNonSigGenes<-nGenesLL-nSigGenes
nSigGenesInGO<-length(intersect(sig.genes.LL,get(AllGE5[1],GOALLLOCUSID)))
nSigGenesNOTinGO<-nSigGenes-nSigGenesInGO
nNonSigGenesInGO<length(intersect(setdiff(AllGenesLL,sig.genes.LL),get(AllGE5[[1]][1],GOALLLOCUSID)))
nNonSigGenesNOTinGO<-nNonSigGenes-nNonSigGenesInGO
twoBytwo<matrix(c(nSigGenesInGO,nSigGenesNOTinGO,nNonSigGenesInGO,nNonSigGenesNOTinGO),byrow=T,ncol=2)
FisherPValues<-sapply(AllGE5,function(x) {
currentLL<-get(x,GOALLLOCUSID)
nSigGenesInGO<-length(intersect(sig.genes.LL,currentLL))
nSigGenesNOTinGO<-nSigGenes-nSigGenesInGO
nNonSigGenesInGO<-length(intersect(setdiff(AllGenesLL,sig.genes.LL),currentLL))
nNonSigGenesNOTinGO<-nNonSigGenes-nNonSigGenesInGO
twoBytwo<matrix(c(nSigGenesInGO,nSigGenesNOTinGO,nNonSigGenesInGO,nNonSigGenesNOTinGO),byrow=T,ncol=2)
fisher.test(twoBytwo, alternative = "greater")$p.value
})
FisherFDR<-p.adjust(FisherPValues,method="fdr")})
GO Significance
SigGOs<-(FisherFDR<0.05)
MostSigGOs<-order(FisherFDR)
nSigGOs<-2
cat("\n\nTop ",nSigGOs," GO Categories for genes with FDR<",FDR,"\n")
for(TopGO in 1:nSigGOs){
GO<-AllGE5[MostSigGOs[TopGO]]
currentLL<-get(GO,GOALLLOCUSID)
nSigGenesInGO<-length(intersect(sig.genes.LL,currentLL))
nSigGenesNOTinGO<-nSigGenes-nSigGenesInGO
nNonSigGenesInGO<-length(intersect(setdiff(AllGenesLL,sig.genes.LL),currentLL))
nNonSigGenesNOTinGO<-nNonSigGenes-nNonSigGenesInGO
twoBytwo<matrix(c(nSigGenesInGO,nSigGenesNOTinGO,nNonSigGenesInGO,nNonSigGenesNOTinGO),byrow=T,ncol=2)
cat("GO Term ",TopGO,"\n")
cat("FDR for the category= ",FisherFDR[MostSigGOs[TopGO]],"\n")
print(get(GO,GOTERM))
cat("\nTwo By Two matrix of gene memberships in this category\n")
print(twoBytwo)
GO Significance
•This is a bit complicated for our microarray data
•For microarrays with an "annotation package" everything is much
easier
•For example, all Affymetrix arrays have appropriate annotation
packages
•Can use GOstats package to calculate significances directly
•For a nice example that you can follow by simply copying and pasting
commands into your R session:
http://genomebiology.com/2004/5/10/R80