Data Mining Techniques 1

Transcript Data Mining Techniques 1

C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
Lecture 6
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Machine Learning
Bioinformatics Data Analysis and
Tools
[email protected]
Supervised Learning
observations System (unknown)
property of
interest
Train dataset
?
ML algorithm
new
observation
model
Classification
prediction
2
Unsupervised Learning
ML for
unsupervised
learning
attempts to
discover
interesting
structure in the
available data
Data mining, Clustering
3
What is your question?
•
•
What are the targets genes for my knock-out gene?
Look for genes that have different time profiles between different cell types.
Gene discovery, differential expression
•
Is a specified group of genes all up-regulated in a specified conditions?
Gene set, differential expression
•
•
Can I use the expression profile of cancer patients to predict survival?
Identification of groups of genes that predictive of a particular class of tumors?
Class prediction, classification
•
•
Are there tumor sub-types not previously identified?
Are there groups of co-expressed genes?
Class discovery, clustering
•
•
Detection of gene regulatory mechanisms.
Do my genes group into previously undiscovered pathways?
Clustering. Often expression data alone is not enough, need to incorporate sequence and
other information
4
Basic principles of discrimination
•Each object associated with a class label (or response) Y  {1, 2, …,
K} and a feature vector (vector of predictor variables) of G
measurements: X = (X1, …, XG)
Aim: predict Y from X.
1
2
K
Predefined
Class
{1,2,…K}
Objects
Y = Class Label = 2
X = Feature vector
{colour, shape}
Classification rule ?
X = {red, square}
Y=?
5
Discrimination and Prediction
Learning Set
Data with
known classes
Prediction
Classification
rule
Data with
unknown classes
Classification
Technique
Class
Assignment
Discrimination
6
Example: A Classification
Problem
• Categorize images of fish—
say, “Atlantic salmon” vs.
“Pacific salmon”
• Use features such as length,
width, lightness, fin shape &
number, mouth position, etc.
• Steps
1. Preprocessing (e.g.,
background subtraction)
2. Feature extraction
3. Classification
7
example from Duda & Hart
Classification in Bioinformatics
• Computational diagnostic: early cancer
detection
• Tumor biomarker discovery
• Protein folding prediction
• Protein-protein binding sites prediction
• Gene function prediction
• …
8
Learning set
Predefine
classes
Clinical
outcome
Bad prognosis
recurrence < 5yrs
Good Prognosis
recurrence > 5yrs
Good Prognosis
?
Matesis > 5
Objects
Array
Feature vectors
Gene
expression
new
array
Reference
L van’t Veer et al (2002) Gene expression
profiling predicts clinical outcome of breast
cancer. Nature, Jan.
.
Classification
rule
9
Classification Techniques
• K Nearest Neighbor classifier
• Support Vector Machines
• …
10
Instance Based Learning
Key idea: just store all training examples
<xi,f(xi)>
Nearest neighbor:
• Given query instance xq, first locate nearest
training example xn, then estimate f(xq)=f(xn)
K-nearest neighbor:
• Given xq, take vote among its k nearest
neighbors (if discrete-valued target function)
• Take mean of f values of k nearest
neighbors (if real-valued) f(xq)=i=1k f(xi)/k11
K-Nearest Neighbor
• A lazy learner …
• Issues:
– How many neighbors?
– What similarity measure?
12
Which similarity or dissimilarity
measure?
• A metric is a measure of the similarity or
dissimilarity between two data objects
• Two main classes of metric:
– Correlation coefficients (similarity)
• Compares shape of expression curves
• Types of correlation:
– Centered.
– Un-centered.
– Rank-correlation
– Distance metrics (dissimilarity)
• City Block (Manhattan) distance
• Euclidean distance
13
Correlation (a measure between -1
and 1)
• Pearson Correlation Coefficient
(centered correlation) n  x  x  y  y 
Sx = Standard deviation of x
Sy = Standard deviation of y
1
n1
 
i 1 
i
Sx

 i
 S y 
You can use
absolute
correlation to
capture both
positive and
negative
correlation
Positive correlation
Negative correlation
14
Potential pitfalls
Correlation = 1
15
Distance metrics
• City Block (Manhattan)
distance:
– Sum of differences across
dimensions
– Less sensitive to outliers
– Diamond shaped clusters
d ( X , Y )   xi  yi
• Euclidean distance:
– Most commonly used
distance
– Sphere shaped cluster
– Corresponds to the
geometric distance into the
multidimensional space
d ( X ,Y ) 
i
Y
X
Condition 1
Condition 2
Condition 2
i
2
(
x

y
)
 i i
Y
X
Condition 1
where gene X = (x1,…,xn) and gene Y=(y1,…,yn)
16
Euclidean vs Correlation (I)
• Euclidean distance
• Correlation
17
When to Consider Nearest
Neighbors
• Instances map to points in RN
• Less than 20 attributes per instance
• Lots of training data
Advantages:
• Training is very fast
• Learn complex target functions
• Do not loose information
Disadvantages:
• Slow at query time
• Easily fooled by irrelevant attributes
18
Voronoi Diagram
query point qf
nearest neighbor qi
19
3-Nearest Neighbors
query point qf
3 nearest neighbors
2x,1o
20
7-Nearest Neighbors
query point qf
7 nearest neighbors
3x,4o
21
Nearest Neighbor
(continuous)
1-nearest neighbor
22
Nearest Neighbor
(continuous)
3-nearest neighbor
23
Nearest Neighbor
(continuous)
5-nearest neighbor
24
Nearest Neighbor
• Approximate the target function f(x) at
the single query point x = xq
• Locally weighted regression =
generalization of IBL
25
Curse of Dimensionality
Imagine instances are described by 20 attributes
but only 10 are relevant to target function
Curse of dimensionality: nearest neighbor is easily
misled when the instance space is highdimensional
One approach: weight the features according to
their relevance!
• Stretch j-th axis by weight zj, where z1,…,zn
chosen to minimize prediction error
• Use cross-validation to automatically choose
weights z1,…,zn
• Note setting zj to zero eliminates this dimension
alltogether (feature subset selection)
26
Practical implementations
• Weka – IBk
• Optimized – Timbl
27
Example: Tumor Classification
• Reliable and precise classification essential for successful
cancer treatment
• Current methods for classifying human malignancies rely on a
variety of morphological, clinical and molecular variables
• Uncertainties in diagnosis remain; likely that existing classes
are heterogeneous
• Characterize molecular variations among tumors by monitoring
gene expression (microarray)
• Hope: that microarrays will lead to more reliable tumor
classification (and therefore more appropriate treatments and
better outcomes)
28
Tumor Classification Using Gene
Expression Data
Three main types of ML problems associated with
tumor classification:
• Identification of new/unknown tumor classes using
gene expression profiles (unsupervised learning –
clustering)
• Classification of malignancies into known classes
(supervised learning – discrimination)
• Identification of “marker” genes that characterize
the different tumor classes (feature or variable
selection).
29
Learning set
Predefine
classes
Tumor type
B-ALL
T-ALL
AML
T-ALL
?
Objects
Array
Feature vectors
Gene
expression
new
array
Reference
Golub et al (1999) Molecular classification
of cancer: class discovery and class
prediction by gene expression monitoring.
Science 286(5439): 531-537.
Classification
Rule
30
Nearest neighbor rule
31
SVM
• SVMs were originally proposed by Boser, Guyon and
Vapnik in 1992 and gained increasing popularity in late
1990s.
• SVMs are currently among the best performers for a
number of classification tasks ranging from text to
genomic data.
• SVM techniques have been extended to a number of
tasks such as regression [Vapnik et al. ’97], principal
component analysis [Schölkopf et al. ’99], etc.
• Most popular optimization algorithms for SVMs are SMO
[Platt ’99] and SVMlight [Joachims’ 99], both use
decomposition to hill-climb over a subset of αi’s at a
time.
• Tuning SVMs remains a black art: selecting a specific
32
kernel and parameters is usually done in a try-and-see
SVM
• In order to discriminate between
two classes, given a training
dataset
– Map the data to a higher dimension
space (feature space)
– Separate the two classes using an
optimal linear separator
33
Feature Space Mapping
• Map the original data to some higherdimensional feature space where the training set
is linearly separable:
Φ: x → φ(x)
34
The “Kernel Trick”
• The linear classifier relies on inner product between vectors
K(xi,xj)=xiTxj
• If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is some function that corresponds to an inner
product in some expanded feature space.
• Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 +
2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1
√2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
35
Linear Separators
36
Optimal hyperplane
Support vectors uniquely characterize optimal hyper-plane
ρ
margin
Optimal hyper-plane
Support vector
37
Optimal hyperplane: geometric view
w  xi  b  1 for yi  1
w  xi  b  1 for yi  1
38
Soft Margin Classification
• What if the training set is not linearly separable?
• Slack variables ξi can be added to allow
misclassification of difficult or noisy examples.
ξk
ξj
39
Weakening the constraints
Weakening the
constraints
Allow that the objects do not strictly obey the constraints
Introduce ‘slack’-variables
40
Influence of C
Erroneous objects can still have a (large)
influence on the solution
41
SVM
• Advantages:
– maximize the margin between two classes in the
feature space characterized by a kernel function
– are robust with respect to high input dimension
• Disadvantages:
– difficult to incorporate background knowledge
– Sensitive to outliers
42
SVM and outliers
outlier
43
Classifying new examples
• Given new point x, its class
membership is
sign[f(x, *, b*)], where
f (x, , b )  w x  b  i 1 i* yi xi x  b*  iSV  i* yi xi x  b*
*
*
*
*
N
Data enters only in the form of dot products!
and in general
f (x,  * , b* )  iSV  i* yi K (xi , x)  b*
Kernel function
44
Classification: CV error
• Training error
N samples
– Empirical error
• Error on independent
test set
– Test error
• Cross validation (CV)
error
– Leave-one-out (LOO)
– N-fold CV
splitting
1/n
samples
for testing
N-1/n
samples
for training
Count errors
Summarize CV
error rate
45

Data Mining Techniques 1

Transcript Data Mining Techniques 1

Directory