Transcript Document

Pictorial Demonstration
Rescale features to minimize the LOO bound R2/M2
R2/M2 >1
R
R2/M2 =1
M=R
M
x2
x2
x1
SVM Functional
To the SVM classifier we add an extra scaling parameters for feature selection:
where the parameters , b are computed by maximizing the the following
functional, which is equivalent to maximizing the margin:
Radius Margin Bound
Jaakkola-Haussler Bound
Span Bound
The Algorithm
Computing Gradients
Toy Data
Linear problem with 6
relevant dimensions of 202
Nonlinear problem with 2
relevant dimensions of 52
Face Detection
On the CMU testset consisting of 479 faces and 57,000,000 non-faces we
compare ROC curves obtained for different number of selected features.
We see that using more than 60 features does not help.
Molecular Classification of Cancer
Dataset
Total
Samples
Class 0
Class 1
Dataset
Total
Samples
Class 0
Class 1
Leukemia
Morphology (train)
38
27
ALL
11
AML
Lymphoma
Morphology
77
19
FSC
58
DLCL
Leukemia
Morpholgy (test)
34
20
ALL
14
AML
Lymphoma
Outcome
58
20
Low risk
14
High risk
Leukemia Lineage
(ALL)
23
15
B-Cell
8
T-Cell
Brain Morphology
41
14
Glioma
27
MD
Lymphoma Outcome
(AML)
15
8
7
Low risk High risk
Brain Outcome
50
38
Low risk
12
High risk
Morphology Classification
Dataset
Algorithm
Total
Samples
Total
errors
Class 1
errors
Class 0
errors
Number
Genes
Leukemia
Morphology (trest)
AML vs ALL
SVM
35
0/35
0/21
0/14
40
WV
35
2/35
1/21
1/14
50
k-NN
35
3/35
1/21
2/14
10
SVM
23
0/23
0/15
0/8
10
WV
23
0/23
0/15
0/8
9
k-NN
23
0/23
0/15
0/8
10
SVM
77
4/77
2/32
2/35
200
WV
77
6/77
1/32
5/35
30
k-NN
77
3/77
1/32
2/35
250
SVM
41
1/41
1/27
0/14
100
WV
41
1/41
1/27
0/14
3
k-NN
41
0/41
0/27
0/14
5
Leukemia Lineage
(ALL)
B vs T
Lymphoma
FS vs DLCL
Brain
MD vs Glioma
Outcome Classification
Dataset
Algorithm
Total
Samples
Total
errors
Class 1
errors
Class 0
errors
Number
Genes
Lymphoma
SVM
58
13/58
3/32
10/26
100
LBC treatment
outcome
WV
58
15/58
5/32
10/26
12
k-NN
58
15/58
8/32
7/26
15
Brain
SVM
50
7/50
6/12
1/38
50
MD treatment
outcome
WV
50
13/50
6/12
7/38
6
k-NN
50
10/50
6/12
4/38
5
Outcome Classification
0.6
0.6
0.8
0.8
1.0
1.0
Error rates ignore temporal information such as when a patient dies. Survival
analysis takes temporal information into account. The Kaplan-Meier survival
plots and statistics for the above predictions show significance.
p-val = 0.0015
0.0
0.0
0.2
0.2
0.4
0.4
p-val = 0.00039
0
50
Lymphoma
100
150
0
20
40
60
80
100
Medulloblastoma
120
Part
4
Clustering Algorithms
Hierarchical Clustering
Hierarchical clustering
Step 1: Transform genes * experiments matrix into
genes * genes distance matrix
Exp 1
Exp 2
Exp 3
Gene A
Exp 4
Gene A
Gene B
Gene C
Step 2: Cluster genes
based on distance matrix
and draw a dendrogram
until single node remains
Gene A
Gene B
Gene C
Gene B
Gene C
0
?
?
0
?
0
Hierarchical clustering (continued)
To transform the genes*exp matrix into genes*genes matrix,
use a gene similarity metric.
(Eisen et al. 1998 PNAS 95:14863-14868)
Exactly same as
Pearsons correlation
except the underline
Where Gi equal the (log-transformed) primary data for gene G
in condition i. For any two genes X and Y observed over a series
of N conditions. Goffset is set to 0, corresponding to fluorescence
ratio of 1.0
Hierarchical clustering (continued)
Pearsons correlation example
What if genome expression is clustered
based on negative correlation?
Hierarchical clustering (continued)
G1
G2
G3
G4
G5
G1
0
2
6
10
9
G2
G3
0
5
9
8
0
4
5
G4
0
3
G5
0
G (12)
0
G (12)
6
G3
10
G4
9
G5
G (12)
G3
G (45)
1
2 3 4
5
Stage
P5
P4
P3
P2
P1
G3
G4
G5
0
4
5
0
3
0
G (12)
0
6
10
G3
G (45)
0
5
0
Groups
[1], [2], [3], [4], [5]
[1 2], [3], [4], [5]
[1 2], [3], [4 5]
[1 2], [3 4 5]
[1 2 3 4 5]
Part
5
Clustering Algorithms
k-means Clustering
K-means clustering
This method differs from the hierarchical clustering in many
ways. In particular,
- There is no hierarchy, the data are partitioned. You will be
presented only with the final cluster membership for each case.
- There is no role for the dendrogram in k-means clustering.
- You must supply the number of clusters (k) into which the
data are to be grouped.
K-means clustering(continued)
Step 1: Transform n (genes) * m (experiments) matrix into
n(genes) * n(genes) distance matrix
Exp 1
Gene A
Gene B
Gene C
Exp 2
Exp 3
Gene A
Exp 4
Gene A
Gene B
Gene C
Gene B
Gene C
0
?
?
0
?
0
Step 2: Cluster genes based on a k-means clustering
algorithm
K-means clustering(continued)
To transform the n*m matrix into n*n matrix, use a similarity
(distance) metric.
(Tavazoie et al. Nature Genetics. 1999 Jul;22(3):281-5)
Euclidean distance
Where any two genes X and Y observed over a series of
M conditions.
K-means clustering(continued)
Gene 1
0
1
1
Gene 1
Gene 2
Gene 3
Gene 4
Gene 2
Gene 4
0
1
0
0
1
1
2
3
4
1
2
2
1
Gene 3
K-means clustering algorithm
Step 1: Suppose distance of genes expression
patterns are positioned on a two dimensional
space based a distance matrix
Step 2: The first cluster center(red) is chosen
randomly and then subsequent centers are by
finding the data point farthest from the centers
already chosen. In this example, k=3.
K-means clustering algorithm(continued)
Step 3: Each point is assigned to the cluster
associated with the closest representative center
Step 4: Minimizes the within-cluster sum of
squared distances from the cluster mean by
moving the centroid (star points), that is
computing a new cluster representative
K-means clustering algorithm(continued)
Step 5: Repeat step 3 and 4 with a new
representative
Run step 3, 4 and 5 until no further
changes occur.
Part
6
Clustering Algorithms
Principal Component Analysis
Principal component analysis (PCA)
PCA is a variable reduction procedure. It is useful when you
have obtained data on a large number of variables, and believe
that there is some redundancy in those variables.
PCA (continued)
PCA (continued)
PCA (continued)
- Items 1-4 are collapsed into a single new variable that reflects
the employees’ satisfaction with supervision, and items 5-7 are
collapsed into a single new variable that reflects satisfaction
with pay.
- General form for the formula to compute scores on the first
component
C1 = b11(X1) + b12(X2) + ……. b1p(Xp)
where
C1 = the subject’s score on principal component 1
b1p = the regression coefficient(or weight) for observed variable p,
as used in creating principal component 1
Xp = the subject’s score on observed variable p.
PCA (continued)
For example, you could determine each subject’s score on
principal component 1 (satisfaction with supervision) and
principal component 2 (satisfaction with pay )
by C1 = .44(X1) + .40(X2) + .47(X3) + .32(X4)
+ .02 (X5) + .01 (X6) + .03(X7)
C2 = .01(X1) + .04(X2) + .02(X3) + .02(X4)
+ .48(X5) + .31 (X6) + .39(X7)
These weights can be calculated using special type of equation
called an eigenequation.
PCA (continued)
(Alter et al., PNAS, 2000, 97(18) 10101-10106)
PCA (continued)
Part
7
Clustering Algorithms
Self-Organizing Maps
Clustering
Goals
• Find natural classes in the data
• Identify new classes / gene correlations
• Refine existing taxonomies
• Support biological analysis / discovery
• Different Methods
– Hierarchical clustering, SOM's, etc
Self organizing maps (SOM)
- A data visualization technique invented by Professor Teuvo
Kohonen which reduce the dimensions of data through the use
of self-organizing neural networks.
- A method for producing ordered low-dimensional
representations of an input data space.
- Typically such input data is complex and high-dimensional
with data elements being related to each other in a nonlinear
fashion.
SOM (continued)
SOM (continued)
- Cerebral cortex of the brain is arranged as a two-dimensional
plane of neurons and spatial mappings are used to model complex
data structures.
- Topological relationships in external stimuli are preserved and
complex multi-dimensional data can be represented in a lower
(usually two) dimensional space.
SOM (continued)
(Tamayo et al., 1999 PNAS 96:2907-2912)
-One chooses a geometry of "nodes"for
example, a 3 × 2 grid.
- The nodes are mapped into k-dimensional
space, initially at random, and then iteratively
adjusted.
- Each iteration involves randomly selecting a
data point P and moving the nodes in the
direction of P.
SOM (continued)
- The closest node NP is moved the most,
whereas other nodes are moved by
smaller amounts depending on their
distance from NP in the initial geometry.
- In this fashion, neighboring points in
the initial geometry tend to be mapped to
nearby points in k-dimensional space.
The process continues for 20,000-50,000
iterations.
SOM (continued)
Yeast Cell Cycle SOM
- The 828 genes that
passed the variation
filter were grouped
into 30 clusters.
SOM analysis of data of yeast gene
expression during diauxic shift [2]. Data
were analyzed by a prototype of
GenePoint software
•a: Genes with a similar expression profile
are clustered in the same neuron of a 16 x
16 matrix SOM and genes with closely
related profiles are in neighboring
neurons. Neurons contain between 10 and
49 genes
•b: Magnification of four neurons similarly
colored in a. The bar graph in each
neuron displays the average expression of
genes within the neuron at 2-h intervals
during the diauxic shift
•c: SOM modified with Sammon's
mapping algorithm. The distance between
two neurons corresponds to the difference
in gene expression pattern between two
neurons and the circle size to the number
of genes included in the neuron. Neurons
marked in green, yellow (upper left
Result of SOM clustering of
Dictyostelium expression data with
a 6 x 4 structure of centroids. A 6 x
4 = 24 clusters is the minimum
number of centroids needed to
resolve the three clusters revealed
by percolation clustering
(encircled, from top to bottom:
down-regulated genes, early
upregulated genes, and late
upregulated genes). The remaining
21 clusters are formed by forceful
partitioning of the remaining noninformative noisy data. Similarity of
expression within these 21 clusters
is random, and is biologically
meaningless.
SOM clustering
• SOM - self organizing maps
• Preprocessing
– filter away genes with insufficient biological
variation
– normalize gene expression (across samples) to
mean 0, st. dev 1, for each gene separately.
• Run SOM for many iterations
• Plot the results
SOM results
Large grid 10x10
3 cells
Clustering visualization
2D SOM visualization
SOM output visualization
The Y-Cluster
Part
8
Beyond Clustering
Support vector machines
Used for classification of genes according to function
1) Choose positive and negative examples (lable +/-)
2) Transform input space to feature space
3) Construct maximum margin hyperplane
4) Classify new genes as members /non-members
Support vector machines (continued)
(Brown et al., 2000 PNAS 97(1), 262-267)
- Using the class definitions made by the MIPS yeast genome
database, SVMs were trained to recognize six functional classes:
tricarboxylic acid (TCA) cycle, respiration, cytoplasmic
ribosomes, proteasome, histones, and helix-turn-helix proteins.
Support vector machines (continued)
Examples of predicted functional classifications for previously unannotated genes by the
SVMs
Class
TCA
Resp
Ribo
Prot
Gene
Locus
Comments
YHR188C
Conserved in worm, Schizosaccharomyces pombe, human
YKL039W PTM1
Major transport facilitator family; likely integral membrane protein.
YKR016W
Not highly conserved, possible homolog in S. pombe
YKR046C
No convincing homologs
YKL056C
Homolog of translationally controlled tumor protein, abundant, fingers
YNL053W MSG5
Protein-tyrosine phosphatase, bypasses growth arrest by mating factor
YDR330W
Ubiquitin regulatory domain protein, S. pombe homolog
YJL036W
Member of sorting nexin family
YDL053C
No convincing homologs
YLR387C
Three C2H2 zinc fingers, similar YBR267W not coregulated
Automatic discovery of regulatory patterns in
promoter region
(Juhl and Knudsen, 2000 Bioinformatics, 16:326-333)
From SGD
All 6269 ORFs : up and downstream 200 bp.
5097 ORFs : upstream 500 bp.
DNA chip : 91 data sets. These data sets consists of the
500 bp upstream regions and the red-green ratios
Automatic discovery of regulatory patterns in
promoter region (continued)
- Sequence patterns correlated to whole cell expression data
found by Kolmogorov-Smirnov tests
- Regulatory elements were identified by systematic
calculations of the significance of correlation between words
found in functional annotation of genes and DNA words
occuring in their promoter regions.
Bayesian networks analysis
(Friedman et al. 2000 J. Comp. Biol., 7:601-620)
- Graph-based model of joint multi-variate probability distributions
- The model can captures properties of conditional independence
between variables.
- Can describe complex stochastic processes
- Provide clear methodologies for learning from (noisy) observation
Bayesian networks analysis (continued)
Bayesian networks analysis (continued)
-76 gene expression measurement
of 6177 yeast ORFs.
-800 genes whose expression
varied over cell-cycle stages
were selected.
-Learned networks whose variables
were the expression level of each
of these 800 genes
Movie
http://www.dkfz-heidelberg.de/abt0840/whuber/mamovie.html
Part
9
Concluding Remarks
Future directions
• Algorithms optimized for small samples (the no.
of samples will remain small for many tasks)
• Integration with other data
– biological networks
– medical text
– protein data
• cost-sensitive classification algorithms
– error cost depends on outcome (don’t want to miss
treatable cancer), treatment side effects, etc.
Summary
• Microarray Data Analysis -- a revolution in
life sciences!
• Beware of false positives
• Principled methodology can produce good
results