Talk2PreservationHorvath

Download Report

Transcript Talk2PreservationHorvath

Module preservation statistics
Steve Horvath
University of California, Los Angeles
Module preservation is often an
essential step in a network analysis
Construct a network
Rationale: make use of interaction patterns between genes
Identify modules
Rationale: module (pathway) based analysis
Relate modules to external information
Array Information: Clinical data, SNPs, proteomics
Gene Information: gene ontology, EASE, IPA
Rationale: find biologically interesting modules
Study Module Preservation across different data
Rationale:
• Same data: to check robustness of module definition
• Different data: to find interesting modules
Find the key drivers of interesting modules
Rationale: experimental validation, therapeutics, biomarkers
Motivational example: Studying the
preservation of human brain co-expression
modules in chimpanzee brain expression
data.
Modules defined as clusters
(branches of a cluster tree)
Data from Oldam et al 2006
Preservation of modules between human and
chimpanzee brain networks
Standard cross-tabulation based statistics
have severe disadvantages
Disadvantages
1. only applicable for modules defined via a
clustering procedure
2. ill suited for making the strong statement
that a module is not preserved
We argue that network based approaches are
superior when it comes to studying
module preservation
Is my network module preserved
and reproducible?
Langfelder et al PloS Comp Biol. 7(1): e1001057.
Broad definition of a module




Abstract definition of module=subset of nodes in a
network.
Thus, a module forms a sub-network in a larger network
Example: module (set of genes or proteins) defined using
external knowledge: KEGG pathway, GO ontology category
Example: modules defined as clusters resulting from
clustering the nodes in a network
• Module preservation statistics can be used to evaluate
whether a given module defined in one data set (reference
network) can also be found in another data set (test
network)
Network
of
cholesterol
biosynthesis
genes
Message:
female liver network (reference)
Looks most similar to male liver network
Question
•
•
How to measure relationships between
different networks (e.g. how similar is the
female liver network to the male
network)?
Answer: network concepts aka statistics
Connectivity (aka degree)
• Node connectivity = row sum of the adjacency
matrix
– For unweighted networks=number of direct neighbors
– For weighted networks= sum of connection strengths
to other nodes
Connectivityi  ki 
a
ij
j i
ki
Scaled connectivity=K i 
max(k )
Density
• Density= mean adjacency
• Highly related to mean connectivity


Density 
aij
mean(k )

n(n  1)
n 1
where n is the number of network nodes.
i
j i
Network concepts to measure
relationships between networks
Numerous network concepts can be used to
measure the preservation of network
connectivity patterns between a reference
network and a test network
• E.g. Density in the test set
• cor.k=cor(kref,ktest)
• cor(Aref,Atest)
Module preservation
in different types of networks


One can study module preservation in general
networks specified by an adjacency matrix, e.g.
protein-protein interaction networks.
However, particularly powerful statistics are
available for correlation networks

weighted correlation networks are particularly
useful for detecting subtle changes in connectivity
patterns. But the methods are also applicable to
unweighted networks (i.e. graphs)
Network-based
module preservation statistics

Input: module assignment in reference data.

Adjacency matrices in reference Aref and test data Atest

Network preservation statistics assess preservation of

1. network density: Does the module remain densely
connected in the test network?

2. connectivity: Is hub gene status preserved between
reference and test networks?

3. separability of modules: Does the module remain
distinct in the test data?
Several connectivity preservation statistics
For general networks, i.e. input adjacency matrices

cor.kIM=cor(kIMref,kIMtest)


correlation of intramodular connectivity across module nodes
cor.ADJ=cor(Aref,Atest)

correlation of adjacency across module nodes
For correlation networks, i.e. input sets are variable
measurements

cor.Cor=cor(corref,cortest)

cor.kME=cor(kMEref,kMEtest)
One can derive relationships among these statistics in case of
weighted correlation network
Choosing thresholds for preservation
statistics based on permutation test




For correlation networks, we study 4 density and 4
connectivity preservation statistics that take on values <= 1
Challenge: Thresholds could depend on many factors
(number of genes, number of samples, biology, expression
platform, etc.)
Solution: Permutation test. Repeatedly permute the gene
labels in the test network to estimate the mean and standard
deviation under the null hypothesis of no preservation.
Next we calculate a Z statistic
observed − mean permuted
Z=
sd permuted
Permutation
test for estimating
Z scores
Gene modules
in Adipose

For each preservation measure we report the observed value
and the permutation Z score to measure significance.
observed − mean permuted
Z=
sd permuted



Each Z score provides answer to “Is the module significantly
better than a random sample of genes?”
Summarize the individual Z scores into a composite measure
called Z.summary
Zsummary < 2 indicates no preservation, 2<Zsummary<10
weak to moderate evidence of preservation, Zsummary>10
strong evidence
Overview of Preservation statistics
Module preservation statistics are
often closely related
Message: it makes sense to aggregate the statistics
into “composite preservation statistics”
Clustering module preservation statistics based on correlations across
modules
Red=density statistics
Blue: connectivity statistics
Green: separability statistics
Cross-tabulation based statistics
Composite statistic in correlation
networks based on Z statistics
Permutation test allows one to estimate Z version of each statistic
Z cor .Cor 
(q)
cor.Cor ( q )  E (cor.Cor ( q ) | null )
Var (cor.Cor ( q ) | null )
Composite connectivity based statistics for correlation networks
Z connectivity  median( Z cor .Cor , Z cor .kME , Z cor . A , Z cor .kIM )
(q)
(q)
(q)
(q)
(q)
Composite density based statistics for correlation networks
Z density  median( Z meanCor , Z meanAdj , Z propVarExpl , Z meanKME )
(q)
(q)
(q)
(q)
(q)
Composite statistic of density and connectivity preservation
Z connectivity  Z density
(q)
Z summary 
(q)
(q)
2
Analogously define composite statistic:
Gene modules
in
Adipose
medianRank

Based on the ranks of the observed preservation
statistics

Does not require a permutation test

Very fast calculation

Typically, it shows no dependence on the module
size
Summary preservation
• Network based preservation statistics measure different
aspects of module preservation
– Density-, connectivity-, separability preservation
• Two types of composite statistics: Zsummary and medianRank.
• Composite statistic Zsummary based on a permutation test
– Advantages: thresholds can be defined, R function also calculates
corresponding permutation test p-values
– Example: Zsummary<2 indicates that the module is *not* preserved
– Disadvantages: i) Zsummary is computationally intensive since it is
based on a permutation test, ii) often depends on module size
• Composite statistic medianRank
– Advantages: i) fast computation (no need for permutations), ii) no
dependence on module size.
– Disadvantage: only applicable for ranking modules (i.e. relative
preservation)
Application:
Modules defined as KEGG
pathways.
Comparison of human brain (reference) versus
chimp brain (test) gene expression data.
Connectivity patterns (adjacency matrix) is defined
as signed weighted co-expression network.
Preservation of KEGG pathways
measured using the composite preservation
statistics Zsummary and medianRank
• Humans versus chimp brain co-expression modules
Apoptosis module is least preserved
according to both composite preservation statistics
Visually inspect connectivity patterns of the
apoptosis module in humans and chimpanzees
Weighted gene coexpression module.
Red lines=positive
correlations,
Green lines=negative cor
Note that the connectivity patterns look very different.
Preservation statistics are ideally suited to measure differences
in connectivity preservation
Literature validation:
Neuron apoptosis is known to differ
between humans and chimpanzees
• It has been hypothesized that natural selection
for increased cognitive ability in humans led to a
reduced level of neuron apoptosis in the human
brain:
– Arora et al (2009) Did natural selection for increased
cognitive ability in humans lead to an elevated risk of
cancer? Med Hypotheses 73: 453–456.
• Chimpanzee tumors are extremely rare and
biologically different from human cancers
• A scan for positively selected genes in the
genomes of humans and chimpanzees found
that a large number of genes involved in
apoptosis show strong evidence for positive
selection (Nielsen et al 2005 PloS Biol).
Application: Studying the
preservation of a female mouse
liver module in different
tissue/gender combinations.
Module: genes of cholesterol biosynthesis pathway
Network: signed weighted co-expression network
Reference set: female mouse liver
Test sets: other tissue/gender combinations
Data provided by Jake Lusis
Network
of
cholesterol
biosynthesis
genes
Message:
female liver network (reference)
Looks most similar to male liver network
Note that Zsummary
is highest
in the male liver network
Jeremy Miller, et al Dan Geschwind (2010)
Divergence of human and mouse brain
transcriptome highlights Alzheimer
disease pathways.
PNAS 2010
Slide from Jeremy Miller
Why compare human and mouse
brain transcription?
• 1) Module membership (kME) in
conserved modules may be used to
identify reliable markers for cell types and
cellular components.
• 2) Studying differences in network
organization could provide a basis for
better understanding diseases enriched in
human populations, such as Alzheimer’s
Disease
Slide from J. Miller
Co-expression modules
based on multiple human
and mouse gene expression data
Human Brain Modules
Mouse Brain Modules
Human
modules
M7h and M9h
were enriched
with AD genes
These
modules
could not be
found
In mouse
brains
Preservation of human network modules in mouse brains
Zsummary
Human specific modules
M9h and M7h are related to AD
• Module preservation analysis identified two highly humanspecific module, M9h and M7h
– No clear functional annotation
• Guilt by association approaches show these modules are
related to neurodegenerative dementias
– M9H showed significant overlap with an Alzheimer’s
disease module that was identified using independent
data sets run on different brain regions, on different
platforms, and in different labs
– M7h contained two intramodular hub genes related to AD
and frontotemporal dementia (FTD) in humans: GSK3β
and tau
• These two modules provide key targets for
furthering our understanding of neurodegenerative
dementias
Genetic Programs in Human and Mouse
Early Embryos Revealed
by Single-Cell RNA-Sequencing
Zhigang Xue, Kevin Huang, Xiaofei Ye,
et al
Guoping Fan
Background
• Mammalian preimplantation development
is a complex process involving dramatic
changes in the transcriptional architecture.
• Through single-cell RNA-sequencing
(RNA-seq), we report here a
comprehensive analysis of transcriptome
dynamics from oocyte to morula in both
human and mouse embryos.
PCA of RNA seq data reveals
known trajectory
WGCNA analysis
Module eigengenes vs stages
Module preservation analysis
Implementation and R software
tutorials, WGCNA R library
• General information on weighted correlation networks
• Google search
– “WGCNA”
– “weighted gene co-expression network”


R function modulePreservation is part of WGCNA
package
Tutorials: preservation between human and chimp brains
www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/ModulePreser
vation
Network Methods for
Describing Sample
Relationships in Genomic
Datasets: Application to
Huntington's Disease
• Michael C Oldham et al BMC Syst Biol.
2012 PMID: 22691535
Rich but complex HD data
• Affymetrix microarray data from “the HD study”
– Hodges et al: Regional and cellular gene expression
changes in human Huntington’s disease brain. Hum Mol
Genet 2006, 15(6):965-977
• Brain samples of patients with HD (n = 44 individuals) and
unaffected controls (n = 36 individuals, matched for age and
sex)
• caudate nucleus (CN), cerebellum (CB), primary motor
cortex (Brodmann’s area 4; BA4), and prefrontal cortex
(Brodmann’s area 9; BA9)
• across five grades using Vonsattel’s neuropath criteria
• Further, age, sex, the country where the experiment was
performed (samples were processed in the United States
and New Zealand) and the microarray hybridization batch
Defining sample adjacency
Why define this sample network
adjacency measure?
• Our proposed sample adjacency measure (based
on β = 2) also has several other advantages.
– it preserves the sign of the correlation
– while any other power β could be used, the
choice of β = 2 results in an adjacency measure
that is close to the correlation when the
correlation is large (e.g. larger than 0.6, which is
often the case among samples in microarray
data).
• The adjacency measure allows one to define
network concepts.
Connectivity
• Gene connectivity = row sum of the adjacency
matrix
– For unweighted networks=number of direct neighbors
– For weighted networks= sum of connection strengths
to other nodes
ki 
– Scaled connectivity:

j
aij
Clustering Coefficient
Measures the cliquishness of a particular node
« A node is cliquish if its neighbors know each other »
ClusterCoefi
aa



 a   
l i
l i
Clustering Coef of
the black node = 0
m  i ,l
2
il
a
il lm mi
l i
2
il
a
Clustering Coef = 1
This
generalizes
directly to
weighted
networks
(Zhang and
Horvath 2005)
C(k) curve is a plot of scaled clustering
coefficient versus scaled connectivity
Sample network concepts reveal the profound
effect of Huntington’s disease in caudate nucleus.
Summary sample network
• Z.k is a very useful measure for finding array outliers.
• The correlation cor(K,C) between the connectivity
and the clustering coefficient (two important network
concepts) is a sensitive indicator of homogeneity
among biological samples.
– It can distinguish biologically meaningful relationships
among subgroups of samples.
– Advantage: This measure can highlight differences that
cannot be found using differential expression
– Disadvantage: It requires some work to figure out which
genes lead to this effect.
– Here: effect is concentrated in specific modules of genes
• Sample network approach is implemented in an R
function and tutorial
Acknowledgement
Current and former lab members:
• Peter Langfelder first author on many related
articles
• Jason Aten, Chaochao (Ricky) Cai, Jun Dong, Tova Fuller,
Ai Li, Wen Lin, Michael Mason, Jeremy Miller, Mike
Oldham, Chris Plaisier, Anja Presson, Lin Song, Kellen
Winden, Yafeng Zhang, Andy Yip, Bin Zhang
• Colleagues/Collaborators
• Neuroscience: Dan Geschwind, Giovanni
Coppola, Jeremy Miller, Mike Oldham, Roel
Ophoff
• Mouse: Jake Lusis, Tom Drake