Clustering Time-Course Gene

Download Report

Transcript Clustering Time-Course Gene

Introduction to Time-Course
Gene Expression Data
STAT 675
R Guerra
April 21, 2008
Outline
•
•
•
•
The Data
Clustering – nonparametric, model based
A case study
A new model
The Data
• DNA Microarrays: collections of
microscopic DNA spots, often representing
single genes, attached to a solid surface
The Data
• Gene expression changes over time due to
environmental stimuli or changing needs of
the cell
• Measuring gene expression against time
leads to time-course data sets
Time-Course Gene Expression
• Each row represents a
single gene
• Each column
represents a single
time point
• These data sets can be
massive, analyzing
many genes
simultaneously
Time-Course Gene Expression
• k-means to clustering
• “in the budding yeast
Saccharomyces cerevisiae
clustering gene expression data
• groups together efficiently genes
of known similar function,
• and we find a similar tendency in
human data…” Eisen et al.
(1998)
Clustering Expression Data
• When these data sets first became available,
it was common to cluster using nonparametric clustering techniques like KMeans and hierarchical clustering
Yeast Data Set
• Spellman et al (1998) measured mRNA levels on
yeast (saccharomyces cerevisiae)
– 18 equally spaced time-points
– Of 6300 genes nearly 800 were categorized as cellcycle regulated
– A subset of 433 genes with no missing values is a
commonly used data set in papers detailing new timecourse methods
– Original and follow-up papers clustered genes using Kmeans and hierarchical clustering
Spellman et al. (1998)
Yeast cell cycle
Row labels = cell cycle
Rows=genes
Col labels = expts
Cols = time points
Yeast Data Set (Spellman et al.)
K-means
Hierarchical
Which method gives the “right” result???
Non-Parametric Clustering
1. Data curves
2. Apply distance metric to get distance matrix
3. Cluster
Issues with Non-Parametric
Clustering
• Technical
– Require the number of clusters to be chosen a priori
– Do not take into account the time-ordering of the data
– Hard to incoporate covariate data, eg, gene ontology
• Yeast analysis had number of clusters chosen
based on number of cell cycle groups .…no
statistical validation showing that these were the
best clustering assignments
Model-Based Clustering
• In response to limitations of nonparametric
methods, model based methods proposed
–
–
–
–
Time series
Spline Methods
Hidden Markov Model
Bayesian Clustering Models
• Little consensus over which method is “best” to
cluster time course data
K-Means Clustering
Relocation method: Number of clusters predetermined and curves can change clusters
at each iteration
–
–
–
–
Initially, data assigned at random to k clusters
Centroid is computed for each cluster
Data reassigned to cluster whose centroid is closest to it
Algorithm repeats until no further change in assignment
of data to clusters
– Hartigan rule used to select “optimal” #clusters
K-means: Hartigan Rule
• n curves, let k1 =k groups and k2 = k+1 groups.
• If E1 and E2 are the sums of the within cluster
sums of squares for k1 and k2 respectively, then
add the extra group if:
E1 (n  k  1)
 10
E2
K-means: Distance Metric
• Euclidean Distance
• Pearson Correlation
K-means: Starting Chains
• Initially, data are randomly assigned to k clusters
but this choice of k cluster centers can have an
effect on the final clustering
• R implementation of K-means software allows the
choice of “number of initial starting chains” to be
chosen and the run with the smallest sum of within
cluster sums of squares is the run which is given
as output
K-Means: Starting Chains
For j = 1 to B
Random assignment j
 k clusters
 wj = within cluster sum-of-squares
End j
Pick clustering with min(wj)
Insert Initial starting chains
Hierarchical Clustering
• Hierarchical clustering is an addition or
subtraction method.
• Initially each curve is assigned its own
cluster
– The two closest clusters are joined into one
branch to create a clustering tree
– The clustering tree stops when the algorithm
terminates via a stopping rule
Hierarchical Clustering
• Nearest neighbor: Distance between two clusters
is the minimum of all distances between all pairs
of curves, one from each cluster
• Furthest neighbor: Distance between two cluster
is the maximum of all distances between all pairs
of curves, one from each cluster
• Average linkage: Distance between two clusters
is the average of all distances between all pairs of
elements, one from each cluster
Hierarchical Clustering
• Normally the algorithm stops at a predetermined number of clusters or when the
distance between two clusters reaches some
pre-determined threshold
– No universal stopping rule of thumb to find an
optimal number of clusters using this algorithm.
Model-Based Clustering
Many uses mixture models, splines or
piecewise polynomial functions used to
approximate curves
Can better incorporate covariate
information
Models using Splines
• Time course profiles assumed observations from
some underlying smooth expression curve
• Each data curves represented as the sum of:
– Smooth population mean spline (dependent on time and
cluster assignment)
– Spline function representing individual (gene) effects
– Gaussian measurement noise
SSCLUST software
Pan
Model based clustering and data transformations
for gene expression data (2001)
Yeung et al., Bioinformatics, 17:977-987.
MCLUST software
Validation Methods
BIC(C)  2L(C)  mc Log(n)
• L(C) is maximized log-likelihood for model with
C clusters, m is the number of independent
parameters to be estimated and n is the number of
genes
• Strikes a balance between goodness-of-fit and
model complexity
• The non-model-based methods have no such
validation method
Clustering Yeast Data using SSClust
Clustering Yeast Data in MCLUST
Comparison of Methods
•
•
•
•
Ma et al (2006)
Smoothing Spline Clustering (SSClust)
Simulation study
SSClust better than MClust &
nonparameteric
• Comparison: misclassification rates
Functional Form of Ma et al (2006)
Simulation Cluster Centers
MR and OSR
• Misclassification Rate
# of Misclassif ied Curves
MR 
Total # of Curves
• Overall Success Rate
OSR  (% correct# clustersfound) (1- MR)
– To calculate OSR the MR is only for the cases
when the correct number of clusters is found
Comparison of Methods
• From Ma et al. (2006) paper.
Clustering Method Distance Metric
K-means
Euclidean
K-means
Pearson
MCLUST
N/A
SSClust
N/A
MR (%)
9.73
2.64
0.38
0.13
Correct # of Clusters (%)
N/A
N/A
77
100
OSR (%)
NA
NA
69.5
98.7
SSClust Methods Paper
• Concluded that SSClust was the superior
clustering method
• Looking at the data, the differences in scale
between the four true curves is large
– Typical time course clusters differ in location and
spread but not in scale to this extreme
– Their conclusions are based on a data set which is not
representative of the type of data this clustering method
would be used for
Alternative Simulation
Functional Form for five
clusters centers
Example of SSClust Breaking Down
Linear curves joined while sine
curves arbitrarily split into 2
clusters
Simulation Configuration
• Distance Metric
– Euclidean or Pearson
• # of Curves
– Small (100), Large (3000)
• # Resolution of Time Points
– 13 or 25 time points
– evenly spaced or unevenly spaced
• Types of underlying Curves
– Small (4) – Large (8)
Simulation Configuration
• Distribution of curves across clusters
– Equally distributed verses unequally distributed
• Noise Level
– Small (< 0.5*SD of the data set)
– Large (> 0.5*SD of the data set)
• For these cases, found the misclassification
rates and the percent of times that the
correct number of clusters was found
Function Forms of 7 Cluster Centers
Simulation Analysis
Conclusions from Simulations
• MCLUST performed better than SSClust and Kmeans in terms of misclassification rate and
finding the correct number of clusters
• Clustering methods were affected by the level of
noise but, in general, not by the number of curves,
the number of time points or the distribution of
curves across cluster
Effect on Number of Profiles on
OSR
Comparison based on Real Data
• Applied these same clustering techniques to
real data
• Different numbers of clusters found for
different methods for each real data set
Yeast Data
Human Fibroblast Data
Simulations Based on Real Data
– Start with real data, like the yeast data set
– Cluster the results using a given clustering
method
– Perturb the original data (add noise at each
point)
– Evaluate how different the new clustering is in
comparison to the original clustering
• Use MR and OSR
Simulations Based on Yeast Data
Simulations Based on HF Data
Conclusions from these Simulations
• SSClust better than MCLUST and K-means
– This was in contrast to the prior simulations
where MCLUST was best
Gene Ontology
• So far I’ve described my work analyzing
and comparing clustering results on gene
expression data
• Some, like Pan (2006) have argued that
clustering methods, even newer modelbased clustering methods, are incomplete
because they ignore gene function and other
biological aspects in the clustering
Gene Ontology
• Expectation is that incorporating biological
data in with the expression data with yield
to better clustering
Gene Ontology
• Gene Ontology project (Ashburner et al.
2000) provides a structured vocabulary to
describe genes and gene products in
organisms
• Three ontologies developed
– Biological Processes (e.g……)
– Molecular Function (e.g……)
– Cellular Component (e.g……)
Annotations
• Gene Ontology
annotations are
associations made
between gene products
and the GO terms
describing them
• A directed acyclic graph
for a gene from the HF
data set using GO
molecular function
annotation is to the right
Clustering using GO Data
• First, need a distance metric
• Two metrics used are based UnionIntersection distance and the longest path
distance both developed in Gentleman
(2005) and extended by Christian (2007)
• I used the Union-Intersection distance in my
clustering
GO Distances
• The union-intersection distance is defined
as
• Show example using two dags
– Min = 0 when two DAGs are identical,
– Max = 1 when two DAGs have nothing in
common
Showing UI Distance
Clustering Using All Data
• Open question in how to cluster genes using
both time-course expression data and gene
ontology data together
• Two of the methods I used are from
Boratyn et al (2007) and from Fang et al
(2006)
Boratyn et al (2007) Method
• Clusters are based on adding individually scaled
distances matrices
– Take distance matrix from expression clustering and the
distance matrix from gene ontology cluster
– Put them on the same scale [0,1]
– Add the scaled distance matrices together
– Cluster using this new distance matrix which captures
differences in expression profiles and gene function
Yeast: 12 Clusters on Combined
Distance Metric
Fang et al (2006) Method
• In this method,
– Gene ontology is a guide for clustering the
expression profiles
– Biological process is the GO annotation used
– Uses the mean squared residual score to assess
the expression correlation of genes within a
cluster from the clustering by GO data.
Effect of the Choice of Ontologies
• Examined effect of the choice of which
ontology to use in my clustering between
BP, CC, and MF.
• Fang et al (2006) uses BP in their method as
it has tended to be most closely correlated
with gene function among the 3 ontologies
Effect of Choice of Ontology
Conclusions from GO Chapter
• Clustering using expression and ontology
data together proved to provide expression
clusters as good or better as when
expression data is clustered alone but we
have the added bonus of a biological base
filtering out potentially nonsensical
clustering
Conclusions from Paper as a Whole
• Expression clustering by model-based and nonmodel-based clustering methods do not have a
uniform “best clustering method” in all cases
– But, methods are robust in terms of data apportionment
per cluster and the number of curves per dataset
(important for massive gene data banks.)
• Clustering using expression and GO data together
improves upon expression clustering and again
methods vary in complexity, performance, and
ease of use
Further Extensions
• GO analysis was all using K-means and
hierarchical clustering
– Extend GO clustering to model-based
clustering techniques like MCLUST and
SSClust (currently, GO data can be used as
initial conditions in these models but not as
some notion of prior model parameters.)
P. falciparum:
Examination of Correlation Between Spatial
Location and Temporal Expression of Genes
Motivations:
• Evidence for correlation in literature
– Printing artifact
– Biological
• Develop a visualization and statistical
testing methodology
Biological Motivations
Operon control (bacteria)
promoter
ORF1
ORF2
mRNA
Upstream Activating Sequences (yeast)
UAS1
ORF1
UAS2
ORF2
mRNAs
Locus Control Region (mammalian globin cluster)
LCR1
mRNAs
ORF1
ORF2
Hypothesis and Statistic
• Statistical: Correlation between
chromosomal location and gene expression?
• Biological: Gene order random?
• H0: no correlation between location on
chromosome and expression
• Consider correlations in partitions
Approach
Covariogram: General Tool
Partition Chromosome, Develop Statistic
Permutation Testing Framework
Check for Confounding Factors
Biological Significance
Issues
• Confounding (printing) or other artifacts
• Account for inter-gene distances (as
opposed to adjacent pairwise correlation)
• Significance of correlation
operon
Methods: Data
• Need gene information (plasmodb.org
has annotated fastA files):
TCAAGCAATTGTTAGATGAGAACAATAGGAAGAATTTAAATTTTAATGAT
CTGGTTATACACCCTTGGTGGTCTTATAAGAATTAA
>Pfa3D7|pfal_chr1|PFA0135w|Annotation|Sanger(protein
coding) hypothetical protein
Location=join(124752..124823,124961..125719)
ATGATATTTCATAAATGCTTTAAAATTTGTTCGCTCTCTTGTACTGTTTT
ATGGGTTACCGCCATATCATCGATCATTCAACCAGACAAACAACAAGAAA
• Normalized gpr files (2-D loess,
centered and scaled)
Methods: Data
FastA sequence:
5400 predicted
genes
PFA0135w
124752:125719 bp
Intersection:
3500 genes
with common gene
name
PFA0135w
124752:125719
bp
probe a16122_1
t1,t2,…, t48
QC Microarray:
3800 genes
5100 probes
PFA0135w
probe a16122_1
t1,t2,…, t48
Methods: Covariograms
 ( x, y; da , db )  Ave[  ( x, y | da  dist( x, y)  db )]
• Covariogram 1: distance is chromosomal location:
d ( g i , g j )  g i ,midpt ( chrloc )  g j ,midpt ( chrloc )
• Covariogram 2: distance is printed microarray
location:
d ( gi , g j ) 
g
i , x  g j , x   g i , y  g j , y 
2
2

Chr 10:
Covariogram 1
Chr 10:
Covariogram 2
Chr 6:
Covariogram 1
Chr 6:
Covariogram 2
Methods: Partitioning
0 kb
• Partition
• Avg of all pairwise
Pearson
correlations
60 kb
1 21
r1 
ri

21 i 1
7
7 genes,   pairwise correlation
 2
 3
3 genes,   pairwise correlation
 2
120 kb
1 3
r2   ri
3 i 1
Methods: Partitioning
• Chr 6, 40 kb partition
• Significant?
Methods: Permutation Test
• r  .50 in a 40kb
interval on chr 6
• Permutation test
• Null distribution
• Estimated
p-values
gene
obs Perm(1)Perm(2)… Perm(n
g1
e1
e4
e3
e2
g2
e2
e2
e4
e3
g3
e3
e1
e2
e1
g4
e4
e3
e1
e4
…
Methods: Permutation Test
• Distribution of
in 40 kb interval
r
robs  0.57
n genes  2
p  val  0.22
Methods: Permutation Test
• Distribution of
in 40 kb interval
r
robs  0.72
ngenes  6
p  val  0.001
Methods: Permutation Test
• Distribution of
in 40 kb interval
r
robs  0.49
ngenes  9
p  val  0.002
Methods: Permutation Test
• Distribution of
in 40 kb interval
r
robs  0.018
ngenes  12
p  val  0.475
Significant Intervals (Chr 7)
100kb
80kb
60kb
40kb
20kb
10kb
Significant Intervals (Chr 7)
100kb
80kb
60kb
40kb
20kb
10kb
Significant Intervals (Chr 7)
100kb
80kb
60kb
40kb
20kb
10kb
100kb
80kb
60kb
40kb
20kb
10kb
MAL6P1.257: hypothetical protein
MAL6P1.258: malate:quinone oxidoreductase
MAL6P1.259: hypothetical protein
MAL6P1.260: hypothetical protein
MAL6P1.263: hypothetical protein
MAL6P1.265: pyridoxine kinase
MAL6P1.266: hypothetical protein
MAL6P1.267: hypothetical protein
MAL6P1.268: hypothetical protein
MAL6P1.271: cdc2-like protein kinase
MAL6P1.272: ribonuclease
MAL6P1.273: hypothetical protein
Results: Summary Table
10kb
60kb
100kb
10kb in 60kb
Chr 3
3/400
0/68
0/40
0
Chr 4
10/476
5/80
2/48
4
Chr 5
6/528
1/88
3/56
0
Chr 14
4/1304
2/220
1/132
0
Conclusions
• Statistical: Significance for both small
regions of strong correlation and large
regions of weak correlation
• Biological: Evidence for regulation at
multiple levels