C TRI LUSTER An Effective Algorithm for Mining
Download
Report
Transcript C TRI LUSTER An Effective Algorithm for Mining
Computer Science
TRICLUSTER
An Effective Algorithm for Mining
Coherent Clusters in 3D
Microarray Data
Mohammed J. Zaki & Lizhuang Zhao
Department of Computer Science,
Rensselaer Polytechnic Institute (RPI), Troy, NY
{zhaol2, zaki}@cs.rpi.edu
Microarray Data
Computer Science
Essential source of information about the Gene
Expression within a cell
Typically 2D: Genes x Samples (Genes x Time)
Measure the expression level of genes in different
samples
Labeled samples: Classification (cancer vs. non-
cancer)
Non-labeled samples: Clustering (Bi-clusters)
Goal: Identify the “expression” patterns, providing
clues to the gene regulatory networks within a cell
Why Biclustering?
Computer Science
some genes similarly expressed in some samples
Bicluster
full-space cluster
s1
s2
s3
s4
s2
s3
s4
s5
g1
g1
g2
s1
s5
v21 v22 v23 v24 v25
g2
v22 v23
v25
g3
g3
g4
v41 v42 v43 v44 v45
g4
v42 v43
v45
g5
v51 v52 v53 v54 v55
g5
v52 v53
v55
(g2, g4, g5)
(g2, g4, g5)×(s2, s3, s5)
Different “Homogeneity” or
Similarity Criteria
Col
more general
Constant
Order Preserving
All
Row
1
2
5
1
1
1
2
2
2
1
2
5
2
2
2
2
2
2
1
2
5
5
5
5
2
2
2
Shift=0.4
Scale=1.4
Scaling/Shifting
Computer Science
1.0
1.4
2.0
1.0
1.4
2.0
2.0
2.8
4.0
2.0
2.4
3.0
2.5
3.5
5.0
2.5
2.9
3.5
Order: 2
1
3
4
1
7
3
2
5
6
3
8
Note: small noise is allowed in all expression values
Why TriCluster?
Computer Science
Typical microarray data is 2D (gene x sample)
Temporal expression very important tool
How does gene expression evolve in time?
Find clusters over genes x samples x time
Spatial expression also of interest
How does gene expression differ in space (e.g.,
different regions of mouse brain)?
Find clusters over gene x samples x space
Combine temporal and spatial expression
Find clusters over gene x time x space, etc.
There is an emerging need to mine 3D data
TriCluster:
Our Contributions
Computer Science
First algorithm to mine tri-clusters in 3D microarray
data
Complete and deterministic
Mine maximal clusters satisfying given homogeneity
criteria
Constant: column, row, all
Scaling & Shifting
Clusters can be overlapping; optionally delete/merge
clusters having large overlap
Propose a set of metrics for cluster evaluation
Use Gene Ontology (GO) to access biological
significance
Definitions
Computer Science
G is a set of genes {g0, g1, …, gn-1}
S is a set of samples {s0, s1, …, sm-1}
T is a set of time courses {t0, t1, …, tl-1}
3D Real-valued Dataset D = {dijk} G x S x T
dijk is the expression value of gene gi in sample sj at
time tk
triCluster is a maximal submatrix of D that satisfies
some homogeneity conditions
C = X x Y x Z = {cijk}
X G, Y S, Z T
Given homogeneity conditions
Scaling triCluster
Example
Computer Science
2
Time
4
1
Genes
1
1
3
4
2
2
6
8
5
5
15
20
Ratios: 1
3
4
Samples
4
12
16
8
24
32
20
60
80
2
6
8
4
12
16
10
30
40
Note: small noise is allowed
TriCluster Concepts
Computer Science
C = X x Y x Z = {cijk} is a triCluster iff
C is maximal (no C’ C)
C has sufficient size: |X| mg, |Y| ms, |Z| mt
Noise/error threshold is satisfied for any C22
cia cib
is an arbitrary 2x2 submatrix
C22 = c
c
jb
ja
of C
Let ri = | cia/cib| and rj = | cja/cjb|
Max(ri/rj) / Min(ri/rj) – 1
Range threshold a is satisfied for each dim a
= | cijk – cxyz |
g
s
t
If j=y, k=z, then (similarly define , )
TriCluster Flexibility
Cluster definition is symmetric
Any ordering of dimensions allowed
A/C≈B/D ↔ A/B≈C/D ↔ AD≈BC
Computer Science
A C
B D
T
=
A B
C D
Can mine several types of clusters
Typically 0 to allow small noise/error
Approx constant cluster: g 0 and s 0 and t 0
Approx single dim constant: g 0 or s 0 or t 0
Approx two dim constant: (g 0 and s 0) or
(g 0 and t 0) or (s 0 and t 0)
Scaling cluster: g and s and t are unconstrained
Shifting cluster: if eC is a scaling C is a shifting
TriCluster Algorithm
Computer Science
Compute maximal biclusters on G x S for
each time slice t T
Construct range multigraph
Find maximal cliques
Compute triclusters from biclusters
Construct new multigraph (T x biclusters)
Find maximal cliques
Merge/Prune overlapping clusters
Maximal Biclusters
Computer Science
Mine each GxS time-slice for maximal
biclusters
For each pair of samples, get valid ratio
ranges within ε and gene-sets
Construct a Range Multigraph
Mine maximal cliques
Each clique/cluster can contribute to some
valid tricluster
Valid Ratio Ranges:
Each Column Pair
Computer Science
Range Example
Original Data
After row/col permutation
Take ratio s0 and s6 and construct valid ranges:
Range contains at least mg values within ε (noise threshold)
ε=0.05, mg=3, then 3.0×(1+ε)=3.15 range = [3, 3.15]
Other ranges = [3.3, 3.465], and so on
Construct gene-sets: [3, 3.15] has genes {g1, g4, g8}
Range Multigraph:
pair of samples
Computer Science
Construct valid ratios & gene-sets for s1/s4
Ratio = 1/1, gene-set = {g2g6g0g9g7}
Multigraph
Ratio = 5/4, gene-set = {g4g8g1}
Construct ratios/gene-sets for other pairs
Range Multigraph: complete
Computer Science
Construct ratios/gene-sets for all sample pairs
Maximal Clique Mining
s2
Computer Science
s4
s6
s3
s5
s1
Perform recursive depth-first search
Maintain valid gene-sets for each node
Intersect gene-sets with each outgoing edge
{g2g6g0g9g7} {g2g6g0g9} = {g2g6g0g9}
Prune if various criteria not met (size, dim range)
s0
Mine triClusters
Computer Science
Let Bt be the set of maximal biclusters for
time slice t
Construct new multigraph
Each time point is a vertex
Each pair of highly overlapping biclusters
(gene-set, samples) forms an edge
between time ti and tj
Call maximal clique mining to obtain maximal
triclusters
Constructing triClusters
Computer Science
Constructing triClusters
tk
tj
ti
Computer Science
Constructing triClusters
tk
tj
ti
Computer Science
Prune and Merge
A
Ai
B
Prune B
B
Computer Science
A
Aj
Prune B
LB-A/LB <
LB- A/LB <
B
Merge A & B
L(A+B)-A-B/ L(A+B) <
Cluster Span:
LC = {(i,j,k) | gi, sj, tk C }
LAB = LA LB
LA-B = LA – LB
LA+B = (LA – LB) (LB – LA) (LA LB)
Metrics for Measuring
Clustering Quality
Computer Science
NumClusters
Number of Clusters
Span
Span (X×Y×Z)=|X|×|Y|×|Z|
ElementSum
Sum of all cluster Spans (count multiple times)
Coverage
Union of all cluster Spans (count once)
Overlap
(ElementSum - Coverage) / Coverage
We want high coverage with small overlap
Synthetic Data
Generation
Computer Science
Experiments:1.4Ghz, 448MB, Linux/Vmware
Synthetic data for parameter evaluation
Input parameters:
|G|=4000, |S|=30, |T|=20
Number of cluster to embed = 10
Overlap % among clusters = 20%
Noise for expression values = 3%
Cluster size range = 150x6x4 (some
variation)
Generate clusters with values within some range
Fill rest of cells with random noise
Do random permutations along each dimension
We vary one parameter and keep others fixed
Results on Synthetic Datasets
Number of Clusters
Number of Time-points
Number of Samples
Time (sec)
Time (sec)
Time (sec)
Number of Genes
Time (sec)
Time (sec)
Time (sec)
Computer Science
Overlap (%)
Variation (%)
Results on Yeast Cell
Cycle Dataset
Computer Science
http://genome-www.stanford.edu/cellcycle
Elutriation Experiment
7679 genes
14 time points (0 to 390mins @ 30 min gaps)
No real samples: use raw expression values of 13
attributes as samples (Cyc3, Cyc5, ratios, etc)
GxSxT = 7679 x 13 x 14
Note: actual 3D data will become publicly available
soon (e.g. Mouse Brain Atlas: genes x space x time)
Run TriCluster: mg=50, ms= 4, mt= 5, ε = 0.03
Found 5 clusters in 28s, overlap=0, coverage=6250
2D view of cluster C0 (51x4x5) shown next
t=120
Computer
Science
s=CH2I
s=CH2I
s=CH2D
t=270
t=330
t=390
Genes
Sample Curves
s=CH2IN
s=CH2DN
Genes
Time Curves
Expression Values
t=210
Expression Values
Expression Values
2D Views of cluster C0 on yeast data
s=CH2D
s=CH2IN
s=CH2DN
Time points
Gene Curves
Results on Yeast Cell Cycle
Dataset:Gene Ontology
Computer Science
Cluster
C0
C1
C2
C3
C4
#Genes
Process
Function
Cellular Location
51
ubiquitin cycle (n=3, p=0.00346),
protein polyubiquitination (n=2,
p=0.00796),
carbohydrate
biosynthesis (n=3, p=0.00946)
52
G1/S transition of mitotic cell
cycle (n=3, p=0.00468),
mRNA
polyadenylylation (n=2,
p=0.00826)
protein phosphatase regulator
activity (n=2,p=0.00397) ,
phosphatase regulator activity
(n=2, p=0.00397)
lipid transport (n=2, p=0.0089)
oxidoreductase activity (n=7,
p=0.00239),
lipid transporter
activity (n=2, p=0.00627),
antioxidant activity (n=2,
p=0.00797)
cytoplasm (n=41, p=0.00052),
microsome (n=2,
p=0.00627),
vesicular fraction (n=2,
0.00627),
microbody (n=3, p=0.00929),
peroxisome (n=3, p=0.00929)
physiological process (n=76,
p=0.0017),
organelle
organization and biogenesis (n=15,
p=0.00173),
localization (n=21,
p=0.00537)
MAP kinase activity (n=2,
p=0.00209),
deaminase activity
(n=2, p=0.00804),
hydrolase
activity, acting on carbon-nitrogen,
but not peptide,
bonds (n=4,
p=0.00918),
receptor signaling
protein serine/threonine kinase
activity
(n=2, p=0.00964)
membrane (n=29, p=9.36e-06),
cell (n=86,
p=0.0003),
endoplasmic reticulum (n=13,
p=0.00112),
vacuolar membrane (n=6,
p=0.0015),
cytoplasm (n=63, p=0.00169)
intracellular (n=79, p=0.00209),
endoplasmic
reticulum membrane (n=6, p=0.00289),
integral to endoplasmic reticulum membrane
(n=3, p=0.00328),
nuclear envelopeendoplasmic reticulum network (n=6, p=0.00488)
pantothenate biosynthesis (n=2,
p=0.00246),
pantothenate
metabolism (n=2, p=0.00245),
transport (n=16, p=0.00332),
localization (n=16, p=0.00453)
ubiquitin conjugating enzyme
activity (n=2, p=0.00833),
lipid
transporter activity (n=2,
p=0.00833)
Golgi vesicle (n=2, p=0.00729)
57
97
66
Significant (p-value < 0.01) Shared Gene Ontology (GO) Terms
(Process, Function, Location) for Genes in Different Clusters
Results on Yeast Cell Cycle
Specific Cluster
Computer Science
Cluster
C3
#Genes
Process
97
physiological process (n=76, p=0.0017),
organelle organization and biogenesis (n=15,
p=0.00173),
localization (n=21, p=0.00537)
Different clusters show different shared terms
Results could be potentially biologically significant
Summary
Computer Science
Contributions
First algorithm to mine triclusters from 3D
microarrays
Complete, deterministic
Allows small noise
Flexible: constant, single/two dim, scaling, shifting
Allows arbitrary overlap (merge/prune)
Potentially biologically significant clusters (GO)!
Future Work
Extend from 3-D to k-D datasets
Allow different pattern types along different axes
(scaling along GxS, shifting along T, etc.)
Enhance clique mining step from multigraphs