C TRI LUSTER An Effective Algorithm for Mining

Transcript C TRI LUSTER An Effective Algorithm for Mining

Computer Science
TRICLUSTER
An Effective Algorithm for Mining
Coherent Clusters in 3D
Microarray Data
Mohammed J. Zaki & Lizhuang Zhao
Department of Computer Science,
Rensselaer Polytechnic Institute (RPI), Troy, NY
{zhaol2, zaki}@cs.rpi.edu
Microarray Data
Computer Science
 Essential source of information about the Gene
Expression within a cell
 Typically 2D: Genes x Samples (Genes x Time)

Measure the expression level of genes in different
samples
 Labeled samples: Classification (cancer vs. non-
cancer)
 Non-labeled samples: Clustering (Bi-clusters)
 Goal: Identify the “expression” patterns, providing
clues to the gene regulatory networks within a cell
Why Biclustering?
Computer Science
some genes similarly expressed in some samples
Bicluster
full-space cluster
s1
s2
s3
s4
s2
s3
s4
s5
g1
g1
g2
s1
s5
v21 v22 v23 v24 v25
g2
v22 v23
v25
g3
g3
g4
v41 v42 v43 v44 v45
g4
v42 v43
v45
g5
v51 v52 v53 v54 v55
g5
v52 v53
v55
(g2, g4, g5)
(g2, g4, g5)×(s2, s3, s5)
Different “Homogeneity” or
Similarity Criteria
Col
more general
Constant
Order Preserving
All
Row
1
2
5
1
1
1
2
2
2
1
2
5
2
2
2
2
2
2
1
2
5
5
5
5
2
2
2
Shift=0.4
Scale=1.4
Scaling/Shifting
Computer Science
1.0
1.4
2.0
1.0
1.4
2.0
2.0
2.8
4.0
2.0
2.4
3.0
2.5
3.5
5.0
2.5
2.9
3.5
Order: 2
1
3
4
1
7
3
2
5
6
3
8
Note: small noise  is allowed in all expression values
Why TriCluster?
Computer Science
 Typical microarray data is 2D (gene x sample)
 Temporal expression very important tool
 How does gene expression evolve in time?
 Find clusters over genes x samples x time
 Spatial expression also of interest
 How does gene expression differ in space (e.g.,
different regions of mouse brain)?
 Find clusters over gene x samples x space
 Combine temporal and spatial expression
 Find clusters over gene x time x space, etc.
 There is an emerging need to mine 3D data
TriCluster:
Our Contributions
Computer Science
 First algorithm to mine tri-clusters in 3D microarray
data
 Complete and deterministic
 Mine maximal clusters satisfying given homogeneity
criteria


Constant: column, row, all
Scaling & Shifting
 Clusters can be overlapping; optionally delete/merge
clusters having large overlap
 Propose a set of metrics for cluster evaluation
 Use Gene Ontology (GO) to access biological
significance
Definitions
Computer Science
 G is a set of genes {g0, g1, …, gn-1}
 S is a set of samples {s0, s1, …, sm-1}
 T is a set of time courses {t0, t1, …, tl-1}
 3D Real-valued Dataset D = {dijk}  G x S x T
 dijk is the expression value of gene gi in sample sj at
time tk
 triCluster is a maximal submatrix of D that satisfies
some homogeneity conditions



C = X x Y x Z = {cijk}
X  G, Y  S, Z  T
Given homogeneity conditions
Scaling triCluster
Example
Computer Science
2
Time
4
1
Genes
1
1
3
4
2
2
6
8
5
5
15
20
Ratios: 1
3
4
Samples
4
12
16
8
24
32
20
60
80
2
6
8
4
12
16
10
30
40
Note: small noise  is allowed
TriCluster Concepts
Computer Science
 C = X x Y x Z = {cijk} is a triCluster iff
 C is maximal (no C’  C)
 C has sufficient size: |X|  mg, |Y|  ms, |Z|  mt
 Noise/error threshold  is satisfied for any C22
 cia cib 
 is an arbitrary 2x2 submatrix
 C22 = c
c
jb 
 ja
of C
 Let ri = | cia/cib| and rj = | cja/cjb|
 Max(ri/rj) / Min(ri/rj) – 1  

Range threshold a is satisfied for each dim a
 = | cijk – cxyz |
g
s
t
 If j=y, k=z, then    (similarly define  ,  )

TriCluster Flexibility
 Cluster definition is symmetric
 Any ordering of dimensions allowed

A/C≈B/D ↔ A/B≈C/D ↔ AD≈BC
Computer Science
A C
B D
T
=
A B
C D
 Can mine several types of clusters
 Typically   0 to allow small noise/error
 Approx constant cluster: g  0 and s  0 and t  0
 Approx single dim constant: g  0 or s  0 or t  0
 Approx two dim constant: (g  0 and s  0) or
(g  0 and t  0) or (s  0 and t  0)
 Scaling cluster: g and s and t are unconstrained
 Shifting cluster: if eC is a scaling C is a shifting
TriCluster Algorithm
Computer Science
 Compute maximal biclusters on G x S for
each time slice t  T


Construct range multigraph
Find maximal cliques
 Compute triclusters from biclusters


Construct new multigraph (T x biclusters)
Find maximal cliques
 Merge/Prune overlapping clusters
Maximal Biclusters
Computer Science
 Mine each GxS time-slice for maximal




biclusters
For each pair of samples, get valid ratio
ranges within ε and gene-sets
Construct a Range Multigraph
Mine maximal cliques
Each clique/cluster can contribute to some
valid tricluster
Valid Ratio Ranges:
Each Column Pair
Computer Science
Range Example
Original Data
After row/col permutation
 Take ratio s0 and s6 and construct valid ranges:
 Range contains at least mg values within ε (noise threshold)
 ε=0.05, mg=3, then 3.0×(1+ε)=3.15  range = [3, 3.15]
 Other ranges = [3.3, 3.465], and so on
 Construct gene-sets: [3, 3.15] has genes {g1, g4, g8}
Range Multigraph:
pair of samples
Computer Science
 Construct valid ratios & gene-sets for s1/s4
 Ratio = 1/1, gene-set = {g2g6g0g9g7}
Multigraph
 Ratio = 5/4, gene-set = {g4g8g1}
 Construct ratios/gene-sets for other pairs
Range Multigraph: complete
Computer Science
 Construct ratios/gene-sets for all sample pairs
Maximal Clique Mining
s2
Computer Science
s4
s6
s3
s5
s1
 Perform recursive depth-first search
 Maintain valid gene-sets for each node
 Intersect gene-sets with each outgoing edge
 {g2g6g0g9g7}  {g2g6g0g9} = {g2g6g0g9}
 Prune if various criteria not met (size, dim range)
s0
Mine triClusters
Computer Science
 Let Bt be the set of maximal biclusters for
time slice t
 Construct new multigraph
 Each time point is a vertex
 Each pair of highly overlapping biclusters
(gene-set, samples) forms an edge
between time ti and tj
 Call maximal clique mining to obtain maximal
triclusters
Constructing triClusters
Computer Science
Constructing triClusters
tk
tj
ti
Computer Science
Constructing triClusters
tk
tj
ti
Computer Science
Prune and Merge
A
Ai
B
Prune B
B
Computer Science
A
Aj
Prune B
LB-A/LB < 
LB-  A/LB < 

B
Merge A & B
L(A+B)-A-B/ L(A+B) <

Cluster Span:

LC = {(i,j,k) | gi, sj, tk  C }

LAB = LA  LB

LA-B = LA – LB

LA+B = (LA – LB)  (LB – LA)  (LA  LB)
Metrics for Measuring
Clustering Quality
Computer Science

NumClusters
Number of Clusters

Span
Span (X×Y×Z)=|X|×|Y|×|Z|

ElementSum
Sum of all cluster Spans (count multiple times)

Coverage
Union of all cluster Spans (count once)

Overlap
(ElementSum - Coverage) / Coverage
We want high coverage with small overlap
Synthetic Data
Generation
Computer Science
 Experiments:1.4Ghz, 448MB, Linux/Vmware
 Synthetic data for parameter evaluation
 Input parameters:
|G|=4000, |S|=30, |T|=20
 Number of cluster to embed = 10
 Overlap % among clusters = 20%
 Noise for expression values = 3%
 Cluster size range = 150x6x4 (some
variation)





Generate clusters with values within some range
Fill rest of cells with random noise
Do random permutations along each dimension
We vary one parameter and keep others fixed
Results on Synthetic Datasets
Number of Clusters
Number of Time-points
Number of Samples
Time (sec)
Time (sec)
Time (sec)
Number of Genes
Time (sec)
Time (sec)
Time (sec)
Computer Science
Overlap (%)
Variation (%)
Results on Yeast Cell
Cycle Dataset
Computer Science
 http://genome-www.stanford.edu/cellcycle
 Elutriation Experiment
 7679 genes
 14 time points (0 to 390mins @ 30 min gaps)
 No real samples: use raw expression values of 13
attributes as samples (Cyc3, Cyc5, ratios, etc)
 GxSxT = 7679 x 13 x 14
 Note: actual 3D data will become publicly available
soon (e.g. Mouse Brain Atlas: genes x space x time)
 Run TriCluster: mg=50, ms= 4, mt= 5, ε = 0.03
 Found 5 clusters in 28s, overlap=0, coverage=6250
 2D view of cluster C0 (51x4x5) shown next
t=120
Computer
Science
s=CH2I
s=CH2I
s=CH2D
t=270
t=330
t=390
Genes
Sample Curves
s=CH2IN
s=CH2DN
Genes
Time Curves
Expression Values
t=210
Expression Values
Expression Values
2D Views of cluster C0 on yeast data
s=CH2D
s=CH2IN
s=CH2DN
Time points
Gene Curves
Results on Yeast Cell Cycle
Dataset:Gene Ontology
Computer Science
Cluster
C0
C1
C2
C3
C4
#Genes
Process
Function
Cellular Location
51
ubiquitin cycle (n=3, p=0.00346),
protein polyubiquitination (n=2,
p=0.00796),
carbohydrate
biosynthesis (n=3, p=0.00946)
52
G1/S transition of mitotic cell
cycle (n=3, p=0.00468),
mRNA
polyadenylylation (n=2,
p=0.00826)
protein phosphatase regulator
activity (n=2,p=0.00397) ,
phosphatase regulator activity
(n=2, p=0.00397)
lipid transport (n=2, p=0.0089)
oxidoreductase activity (n=7,
p=0.00239),
lipid transporter
activity (n=2, p=0.00627),
antioxidant activity (n=2,
p=0.00797)
cytoplasm (n=41, p=0.00052),
microsome (n=2,
p=0.00627),
vesicular fraction (n=2,
0.00627),
microbody (n=3, p=0.00929),
peroxisome (n=3, p=0.00929)
physiological process (n=76,
p=0.0017),
organelle
organization and biogenesis (n=15,
p=0.00173),
localization (n=21,
p=0.00537)
MAP kinase activity (n=2,
p=0.00209),
deaminase activity
(n=2, p=0.00804),
hydrolase
activity, acting on carbon-nitrogen,
but not peptide,
bonds (n=4,
p=0.00918),
receptor signaling
protein serine/threonine kinase
activity
(n=2, p=0.00964)
membrane (n=29, p=9.36e-06),
cell (n=86,
p=0.0003),
endoplasmic reticulum (n=13,
p=0.00112),
vacuolar membrane (n=6,
p=0.0015),
cytoplasm (n=63, p=0.00169)
intracellular (n=79, p=0.00209),
endoplasmic
reticulum membrane (n=6, p=0.00289),
integral to endoplasmic reticulum membrane
(n=3, p=0.00328),
nuclear envelopeendoplasmic reticulum network (n=6, p=0.00488)
pantothenate biosynthesis (n=2,
p=0.00246),
pantothenate
metabolism (n=2, p=0.00245),
transport (n=16, p=0.00332),
localization (n=16, p=0.00453)
ubiquitin conjugating enzyme
activity (n=2, p=0.00833),
lipid
transporter activity (n=2,
p=0.00833)
Golgi vesicle (n=2, p=0.00729)
57
97
66
Significant (p-value < 0.01) Shared Gene Ontology (GO) Terms
(Process, Function, Location) for Genes in Different Clusters
Results on Yeast Cell Cycle
Specific Cluster
Computer Science
Cluster
C3
#Genes
Process
97
physiological process (n=76, p=0.0017),
organelle organization and biogenesis (n=15,
p=0.00173),
localization (n=21, p=0.00537)
Different clusters show different shared terms
Results could be potentially biologically significant
Summary
Computer Science
 Contributions
 First algorithm to mine triclusters from 3D
microarrays
 Complete, deterministic
 Allows small noise
 Flexible: constant, single/two dim, scaling, shifting
 Allows arbitrary overlap (merge/prune)
 Potentially biologically significant clusters (GO)!
 Future Work
 Extend from 3-D to k-D datasets
 Allow different pattern types along different axes
(scaling along GxS, shifting along T, etc.)
 Enhance clique mining step from multigraphs