Document 7740744

Download Report

Transcript Document 7740744

Project Phase I
Due on 9/22, send me through email
 2-10 Pages
 Free style in writing (use 11pt font or
larger)
 Project description

 Overview
 Problem definition
 Why it is important
 Some review of existing work
 Objectives to achieve
Gene Expression
Data Analyses
Dong Xu
Computer Science Department
109 Engineering Building West
E-mail: [email protected]
573-882-7064
http://digbio.missouri.edu
Lecture Outline

Gene expression

Similarity between gene expression profiles

Concept of clustering

K-Means clustering

Hierarchical clustering

Minimum spanning tree-based clustering
Gene expression profiles
Expression (relatively levels to reference point at 0)
Time/Condition
Goal of Microarray Experiments
Regulation/function in pathway/cellular state/phenotype
Disease diagnosis / disease gene identification
Gene
expression
Microarray data
Biological pathway
What Microarray Can Tell Us
 Differentially
expressed genes
Under different conditions
Different genotypes (mutant vs. wild type)
 Co-expression
and gene function
inference
 Regulatory
network inference
Regulatory Networks


Which gene controls what?
Current methods for network reconstruction
 Boolean networks


qualitative representation (on/off relationship)
computationally more manageable
 differential equations


give “detailed” dynamic properties of networks
mathematically/computationally more problematic
 Bayesian networks



define regulatory relationship
Widely used
E-Cell Project (http://www.c-cell.org/): network
modeling
Lecture Outline

Gene expression

Similarity between gene expression profiles

Concept of clustering

K-Means clustering

Hierarchical clustering

Minimum spanning tree-based clustering
Similarity between Profiles
expression
Similarity measure:
Euclidean distance
Correlation coefficient
Trend
…
Correlation coefficient
often works better.
0
time
Expression profile
Pearson Correlation Coefficient

Compares scaled profiles!

Can detect inverse relationships

Most commonly used
 xi  x  yi  y 
1




r

n  1 i 1  s x  s y 
n
n=number of conditions
x=average expression of gene x in all n conditions
y=average expression of gene y in all n conditions
sx=standard deviation of x
Sy=standard deviation of y
Correlation Pitfalls
Raw Data
120
100
80
Gene A
60
Correlation=0.97
Gene B
40
20
0
chip 1
chip2
chip 3
chip 4
chip 5
chip 6
chip7
Normalized Data
2.5
2
1.5
1
Gene A
0.5
Gene B
0
-0.5
-1
chip 1
chip2
chip 3
chip 4
chip 5
chip 6
chip7
Euclidean Distance
 Scaled
versus unscaled
 Cannot
detect inverse
relation ships
For Gene X=(x1, x2,…xn) and Gene Y=(y1, y2,…yn)
d X ,Y  
x1  y1   x2  y2 
2
2
 . . . xn  yn 
2
Lecture Outline

Gene expression

Similarity between gene expression profiles

Concept of clustering

K-Means clustering

Hierarchical clustering

Minimum spanning tree-based clustering
Data-Mining through Clustering
Assumptions for clustering analysis:
Expression level of a gene reflects the gene’s activity.
Genes involved in same biological process exhibit
statistical relationship in their expression profiles.
Degradation
Synthesis
Chromatin
Glycolysis
Idea of Clustering
Clustering: group objects into clusters so that
o
objects in each cluster have “similar” features;
o
objects of different clusters have “dissimilar” features
Methods of Clustering
•discriminant analysis
(Fisher,1931)
•K-means
(Lloyd,1948)
•hierarchical clustering
•self-organizing maps
(Kohonen, 1980)
•support vector machines
(Vapnik, 1985)
•single linkage (dendrogram)
•minimum spanning tree based clustering
Issues in Cluster Analysis
 A lot
of clustering algorithms
 A lot of distance/similarity metrics
 Which clustering algorithm runs faster
and uses less memory?
 How many clusters after all?
 Are the clusters stable?
 Are the clusters meaningful?
Which Clustering Method
Should I Use?
 What
is the biological question?
 Do I have a preconceived notion of
how many clusters there should be?
 How strict do I want to be? Spilt or
Join?
 Can a gene be in multiple clusters?
 Hard or soft boundaries between
clusters
Lecture Outline

Gene expression

Similarity between gene expression profiles

Concept of clustering

K-Means clustering

Hierarchical clustering

Minimum spanning tree-based clustering
K-means clustering for
expression profiles
Step 1: Transform n (genes) * m (experiments) matrix
into n(genes) * n(genes) distance matrix
Exp 1
Gene A
Gene B
Gene C
Exp 2
Exp 3
Gene A
Exp 4
Gene A
Gene B
Gene C
Gene B
Gene C
0
?
?
0
?
To transform the n*m matrix into n*n matrix, use
a similarity (distance) metric.
Step 2: Cluster genes based on a k-means
clustering algorithm
0
K-means algorithm
The most popular algorithm for clustering
What is so attractive?
•Simple
•Fast
•Mathematically correct
•Invariant to dimension
•Easy to implement
K-Means Clustering

Basic Ideas : using cluster centre (means) to represent
cluster

Assigning data elements to the closet cluster (centre).

Goal: Minimize square error (intra-class dissimilarity) :

2
=  d ( xi , C ( xi ))
i

There is no hierarchy.

Must supply the number of clusters (k) into which the
data are to be grouped.
K-means Clustering : Procedure (1)
Initialization 1
Specify the number of cluster k
-- for example, k = 4
Expression matrix
conditions
gene
Each point is called “gene”
K-means Clustering : Procedure (2)
Initialization 2
Genes are randomly assigned to one of k clusters
or choose random starting centers
K-means Clustering : Procedure (3)
Calculate the mean of each cluster
1
m 
NC
i
c
(6,7)
(1,2)
m
i
BLUE
NC
g
i 1
i
(3,4)
(3,2)
1
 [(6,7) + (3,4) + …]
4
K-means Clustering : Procedure (4)
Each gene is reassigned to the nearest cluster
Gene i to cluster c
c  arg min j | mij  gi |2
K-means Clustering : Procedure (5)
Iterate until the means are converged
Convergence of
K-means algorithm
•For each set of starting centers we’ll get a local minimum
Increase number of starts!
Example :
111 data points in 9-dimensional space
N= # of starts for achieving global solution
# of Clusters 2
N
3
4
1000 10000 30000
20
30
40000 1000000
Lecture Outline

Gene expression

Similarity between gene expression profiles

Concept of clustering

K-Means clustering

Hierarchical clustering

Minimum spanning tree-based clustering
Hierarchical clustering (1)
Step 1: Transform genes * experiments matrix into
genes * genes distance matrix
Exp 1
Exp 2
Exp 3
Gene A
Exp 4
Gene A
Gene B
Gene C
Step 2: Cluster genes
based on distance matrix
and draw a dendrogram
until single node remains
Gene A
Gene B
Gene C
Gene B
Gene C
0
?
?
0
?
0
Hierarchical clustering (2)
G1
G2
G3
G4
G5
G1
0
2
6
10
9
G2
G3
0
5
9
8
0
4
5
G4
0
3
G5
G (12)
0
G (12)
6
G3
10
G4
9
G5
2 3 4
G4
G5
0
4
5
0
3
0
0
G (12)
G3
G (45)
1
G3
5
Stage
P5
P4
P3
P2
P1
G (12)
0
6
10
G3
G (45)
0
5
0
Groups
[1], [2], [3], [4], [5]
[1 2], [3], [4], [5]
[1 2], [3], [4 5]
[1 2], [3 4 5]
[1 2 3 4 5]
Hierarchical
Clustering Results
K-Means vs Hierarchical
Clustering
Lecture Outline

Gene expression

Similarity between gene expression profiles

Concept of clustering

K-Means clustering

Hierarchical clustering

Minimum spanning tree-based clustering
Graph Representation
Represent a set of n-dimensional points as a graph
o each data point (gene) represented as a node
o each pair of genes represented as an edge with a weight
defined by the “dissimilarity” between the two genes
0 1 1.5 2 5 6 7 9
1 0 2 1 6.5 6 8 8
1.5 2 0 1 4 4 6 5.5
.
.
.
n-D data points
graph
representation
distance matrix
Minimum Spanning Tree

Spanning tree: a sub-graph that has all nodes
connected and has no cycles
(a)

(b)
(c)
Minimum spanning tree (MST): a spanning tree with
the minimum total distance
How to Construct
Minimum Spanning Tree
Prim’s algorithm and Kruskal’s algorithm
Kruskal’s algorithm
 step 1: select an edge with the smallest distance from graph
 step 2: add to tree as along as no cycle is formed
 step 3: remove the edge from graph
 step 4: repeat steps 1-3 till all nodes are connected in tree.
4
8
4
4
4
7
14
5
3
7
10
3
3
3
5
3
6
(a)
(b)
(c)
(d)
(e)
5
Foundation of MST Approach

Significantly simplifies the data clustering problem, while
losing very little essential information for clustering.

We have mathematically proved:
A multi-dimensional clustering problem is
equivalent to a tree-partitioning problem!
Clustering by Cutting Long Edge
Hierarchical cutting
1st cut: longest edge
2nd
1
cut: second longest edge
…
Work well for “easy” cases.
Produce many clusters with
single element for some
“difficult” cases.
2
Tree-Based Clustering
 For each edge, calculate
the assessment value
 Find the edge that give the
minimum assessment value
as the place to cut
g*
Clustering using iterative method
guarantee to find the global optimality
using tree-based dynamic programming
Automated Selection
of Number of Clusters
Select “transition point” in the assessment value
as the“correct” number of clusters.
Transition Profiles
indicator[n] = (A[n-1] – A[n]) / (A[n] – A[n+1])
A[k] is the assessment value for partition with k clusters
Our clustering of yeast data
Reading Assignments (1)

Suggested reading:
 Chapter 10 in “Neil C.Jones and Pavel A.
Pevzner: An Introduction to Bioinformatics
Algorithms (Computational Molecular
Biology). MIT Press, 2004.”
 Chapter 11 in “Current Topics in
Computational Molecular Biology, edited by
Tao Jiang, Ying Xu, and Michael Zhang. MIT
Press. 2002.”
Reading Assignments (2)

Optional reading:
1. Ying Xu, Victor Olman, and Dong Xu. Clustering
Gene Expression Data Using a Graph-Theoretic
Approach: An Application of Minimum Spanning
Trees. Bioinformatics. 18:526-535, 2002.
1. Dong Xu, Victor Olman, Li Wang, and Ying Xu.
EXCAVATOR: a computer program for gene
expression data analysis. Nucleic Acid Research.
31: 5582-5589. 2003.
Project Assignment
Develop a program that implement the
K-means clustering algorithm
1.
Allow several random initializations, and
compare their clustering results.
Choose the one that has the best value

2
for objective function  d ( xi , C ( xi )) .
i
2.
Test the program using the gene
expression data sent to the mailing list.
3.
Output gene IDs for each cluster.