Document 7740744
Download
Report
Transcript Document 7740744
Project Phase I
Due on 9/22, send me through email
2-10 Pages
Free style in writing (use 11pt font or
larger)
Project description
Overview
Problem definition
Why it is important
Some review of existing work
Objectives to achieve
Gene Expression
Data Analyses
Dong Xu
Computer Science Department
109 Engineering Building West
E-mail: [email protected]
573-882-7064
http://digbio.missouri.edu
Lecture Outline
Gene expression
Similarity between gene expression profiles
Concept of clustering
K-Means clustering
Hierarchical clustering
Minimum spanning tree-based clustering
Gene expression profiles
Expression (relatively levels to reference point at 0)
Time/Condition
Goal of Microarray Experiments
Regulation/function in pathway/cellular state/phenotype
Disease diagnosis / disease gene identification
Gene
expression
Microarray data
Biological pathway
What Microarray Can Tell Us
Differentially
expressed genes
Under different conditions
Different genotypes (mutant vs. wild type)
Co-expression
and gene function
inference
Regulatory
network inference
Regulatory Networks
Which gene controls what?
Current methods for network reconstruction
Boolean networks
qualitative representation (on/off relationship)
computationally more manageable
differential equations
give “detailed” dynamic properties of networks
mathematically/computationally more problematic
Bayesian networks
define regulatory relationship
Widely used
E-Cell Project (http://www.c-cell.org/): network
modeling
Lecture Outline
Gene expression
Similarity between gene expression profiles
Concept of clustering
K-Means clustering
Hierarchical clustering
Minimum spanning tree-based clustering
Similarity between Profiles
expression
Similarity measure:
Euclidean distance
Correlation coefficient
Trend
…
Correlation coefficient
often works better.
0
time
Expression profile
Pearson Correlation Coefficient
Compares scaled profiles!
Can detect inverse relationships
Most commonly used
xi x yi y
1
r
n 1 i 1 s x s y
n
n=number of conditions
x=average expression of gene x in all n conditions
y=average expression of gene y in all n conditions
sx=standard deviation of x
Sy=standard deviation of y
Correlation Pitfalls
Raw Data
120
100
80
Gene A
60
Correlation=0.97
Gene B
40
20
0
chip 1
chip2
chip 3
chip 4
chip 5
chip 6
chip7
Normalized Data
2.5
2
1.5
1
Gene A
0.5
Gene B
0
-0.5
-1
chip 1
chip2
chip 3
chip 4
chip 5
chip 6
chip7
Euclidean Distance
Scaled
versus unscaled
Cannot
detect inverse
relation ships
For Gene X=(x1, x2,…xn) and Gene Y=(y1, y2,…yn)
d X ,Y
x1 y1 x2 y2
2
2
. . . xn yn
2
Lecture Outline
Gene expression
Similarity between gene expression profiles
Concept of clustering
K-Means clustering
Hierarchical clustering
Minimum spanning tree-based clustering
Data-Mining through Clustering
Assumptions for clustering analysis:
Expression level of a gene reflects the gene’s activity.
Genes involved in same biological process exhibit
statistical relationship in their expression profiles.
Degradation
Synthesis
Chromatin
Glycolysis
Idea of Clustering
Clustering: group objects into clusters so that
o
objects in each cluster have “similar” features;
o
objects of different clusters have “dissimilar” features
Methods of Clustering
•discriminant analysis
(Fisher,1931)
•K-means
(Lloyd,1948)
•hierarchical clustering
•self-organizing maps
(Kohonen, 1980)
•support vector machines
(Vapnik, 1985)
•single linkage (dendrogram)
•minimum spanning tree based clustering
Issues in Cluster Analysis
A lot
of clustering algorithms
A lot of distance/similarity metrics
Which clustering algorithm runs faster
and uses less memory?
How many clusters after all?
Are the clusters stable?
Are the clusters meaningful?
Which Clustering Method
Should I Use?
What
is the biological question?
Do I have a preconceived notion of
how many clusters there should be?
How strict do I want to be? Spilt or
Join?
Can a gene be in multiple clusters?
Hard or soft boundaries between
clusters
Lecture Outline
Gene expression
Similarity between gene expression profiles
Concept of clustering
K-Means clustering
Hierarchical clustering
Minimum spanning tree-based clustering
K-means clustering for
expression profiles
Step 1: Transform n (genes) * m (experiments) matrix
into n(genes) * n(genes) distance matrix
Exp 1
Gene A
Gene B
Gene C
Exp 2
Exp 3
Gene A
Exp 4
Gene A
Gene B
Gene C
Gene B
Gene C
0
?
?
0
?
To transform the n*m matrix into n*n matrix, use
a similarity (distance) metric.
Step 2: Cluster genes based on a k-means
clustering algorithm
0
K-means algorithm
The most popular algorithm for clustering
What is so attractive?
•Simple
•Fast
•Mathematically correct
•Invariant to dimension
•Easy to implement
K-Means Clustering
Basic Ideas : using cluster centre (means) to represent
cluster
Assigning data elements to the closet cluster (centre).
Goal: Minimize square error (intra-class dissimilarity) :
2
= d ( xi , C ( xi ))
i
There is no hierarchy.
Must supply the number of clusters (k) into which the
data are to be grouped.
K-means Clustering : Procedure (1)
Initialization 1
Specify the number of cluster k
-- for example, k = 4
Expression matrix
conditions
gene
Each point is called “gene”
K-means Clustering : Procedure (2)
Initialization 2
Genes are randomly assigned to one of k clusters
or choose random starting centers
K-means Clustering : Procedure (3)
Calculate the mean of each cluster
1
m
NC
i
c
(6,7)
(1,2)
m
i
BLUE
NC
g
i 1
i
(3,4)
(3,2)
1
[(6,7) + (3,4) + …]
4
K-means Clustering : Procedure (4)
Each gene is reassigned to the nearest cluster
Gene i to cluster c
c arg min j | mij gi |2
K-means Clustering : Procedure (5)
Iterate until the means are converged
Convergence of
K-means algorithm
•For each set of starting centers we’ll get a local minimum
Increase number of starts!
Example :
111 data points in 9-dimensional space
N= # of starts for achieving global solution
# of Clusters 2
N
3
4
1000 10000 30000
20
30
40000 1000000
Lecture Outline
Gene expression
Similarity between gene expression profiles
Concept of clustering
K-Means clustering
Hierarchical clustering
Minimum spanning tree-based clustering
Hierarchical clustering (1)
Step 1: Transform genes * experiments matrix into
genes * genes distance matrix
Exp 1
Exp 2
Exp 3
Gene A
Exp 4
Gene A
Gene B
Gene C
Step 2: Cluster genes
based on distance matrix
and draw a dendrogram
until single node remains
Gene A
Gene B
Gene C
Gene B
Gene C
0
?
?
0
?
0
Hierarchical clustering (2)
G1
G2
G3
G4
G5
G1
0
2
6
10
9
G2
G3
0
5
9
8
0
4
5
G4
0
3
G5
G (12)
0
G (12)
6
G3
10
G4
9
G5
2 3 4
G4
G5
0
4
5
0
3
0
0
G (12)
G3
G (45)
1
G3
5
Stage
P5
P4
P3
P2
P1
G (12)
0
6
10
G3
G (45)
0
5
0
Groups
[1], [2], [3], [4], [5]
[1 2], [3], [4], [5]
[1 2], [3], [4 5]
[1 2], [3 4 5]
[1 2 3 4 5]
Hierarchical
Clustering Results
K-Means vs Hierarchical
Clustering
Lecture Outline
Gene expression
Similarity between gene expression profiles
Concept of clustering
K-Means clustering
Hierarchical clustering
Minimum spanning tree-based clustering
Graph Representation
Represent a set of n-dimensional points as a graph
o each data point (gene) represented as a node
o each pair of genes represented as an edge with a weight
defined by the “dissimilarity” between the two genes
0 1 1.5 2 5 6 7 9
1 0 2 1 6.5 6 8 8
1.5 2 0 1 4 4 6 5.5
.
.
.
n-D data points
graph
representation
distance matrix
Minimum Spanning Tree
Spanning tree: a sub-graph that has all nodes
connected and has no cycles
(a)
(b)
(c)
Minimum spanning tree (MST): a spanning tree with
the minimum total distance
How to Construct
Minimum Spanning Tree
Prim’s algorithm and Kruskal’s algorithm
Kruskal’s algorithm
step 1: select an edge with the smallest distance from graph
step 2: add to tree as along as no cycle is formed
step 3: remove the edge from graph
step 4: repeat steps 1-3 till all nodes are connected in tree.
4
8
4
4
4
7
14
5
3
7
10
3
3
3
5
3
6
(a)
(b)
(c)
(d)
(e)
5
Foundation of MST Approach
Significantly simplifies the data clustering problem, while
losing very little essential information for clustering.
We have mathematically proved:
A multi-dimensional clustering problem is
equivalent to a tree-partitioning problem!
Clustering by Cutting Long Edge
Hierarchical cutting
1st cut: longest edge
2nd
1
cut: second longest edge
…
Work well for “easy” cases.
Produce many clusters with
single element for some
“difficult” cases.
2
Tree-Based Clustering
For each edge, calculate
the assessment value
Find the edge that give the
minimum assessment value
as the place to cut
g*
Clustering using iterative method
guarantee to find the global optimality
using tree-based dynamic programming
Automated Selection
of Number of Clusters
Select “transition point” in the assessment value
as the“correct” number of clusters.
Transition Profiles
indicator[n] = (A[n-1] – A[n]) / (A[n] – A[n+1])
A[k] is the assessment value for partition with k clusters
Our clustering of yeast data
Reading Assignments (1)
Suggested reading:
Chapter 10 in “Neil C.Jones and Pavel A.
Pevzner: An Introduction to Bioinformatics
Algorithms (Computational Molecular
Biology). MIT Press, 2004.”
Chapter 11 in “Current Topics in
Computational Molecular Biology, edited by
Tao Jiang, Ying Xu, and Michael Zhang. MIT
Press. 2002.”
Reading Assignments (2)
Optional reading:
1. Ying Xu, Victor Olman, and Dong Xu. Clustering
Gene Expression Data Using a Graph-Theoretic
Approach: An Application of Minimum Spanning
Trees. Bioinformatics. 18:526-535, 2002.
1. Dong Xu, Victor Olman, Li Wang, and Ying Xu.
EXCAVATOR: a computer program for gene
expression data analysis. Nucleic Acid Research.
31: 5582-5589. 2003.
Project Assignment
Develop a program that implement the
K-means clustering algorithm
1.
Allow several random initializations, and
compare their clustering results.
Choose the one that has the best value
2
for objective function d ( xi , C ( xi )) .
i
2.
Test the program using the gene
expression data sent to the mailing list.
3.
Output gene IDs for each cluster.