Data Mining Techniques 1

Transcript Data Mining Techniques 1

C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
Lecture 3
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Machine Learning
(Elena Marchiori’s slides adapted)
Bioinformatics Data Analysis and
Tools
[email protected]
Supervised Learning
observations System (unknown)
property of
interest
Train dataset
?
ML algorithm
new
observation
model
Classification
prediction
2
Unsupervised Learning
ML for
unsupervised
learning
attempts to
discover
interesting
structure in the
available data
Data mining, Clustering
3
What is your question?
•
•
What are the targets genes for my knock-out gene?
Look for genes that have different time profiles between different cell types.
Gene discovery, differential expression
•
Is a specified group of genes all up-regulated in a specified conditions?
Gene set, differential expression
•
•
Can I use the expression profile of cancer patients to predict survival?
Identification of groups of genes that are predictive of a particular class of tumors?
Class prediction, classification
•
•
Are there tumor sub-types not previously identified?
Are there groups of co-expressed genes?
Class discovery, clustering
•
•
Detection of gene regulatory mechanisms.
Do my genes group into previously undiscovered pathways?
Clustering. Often expression data alone is not enough, need to incorporate functional and
other information
4
Basic principles of discrimination
•Each object associated with a class label (or response) Y  {1, 2, …,
K} and a feature vector (vector of predictor variables) of G
measurements: X = (X1, …, XG)
Aim: predict Y from X.
1
2
K
Predefined
Class
{1,2,…K}
Objects
Y = Class Label = 2
X = Feature vector
{colour, shape}
Classification rule ?
X = {red, square}
Y=?
5
Discrimination and Prediction
Learning Set
Data with
known classes
Prediction
Classification
rule
Data with
unknown classes
Classification
Technique
Class
Assignment
Discrimination
6
Example: A Classification
Problem
• Categorize images of fish—
say, “Atlantic salmon” vs.
“Pacific salmon”
• Use features such as length,
width, lightness, fin shape &
number, mouth position, etc.
• Steps
1. Preprocessing (e.g., background
subtraction)
2. Feature extraction/feature
weighting
7
example from Duda & Hart
3. Classification
Classification in Bioinformatics
• Computational diagnostic: early cancer
detection
• Tumor biomarker discovery
• Protein structure prediction (threading)
• Protein-protein binding sites prediction
• Gene function prediction
• …
8
Learning set
Predefine
classes
Clinical
outcome
Bad prognosis
recurrence < 5yrs
Good Prognosis
recurrence > 5yrs
Good Prognosis
?
recurrence > 5 yrs
Objects
Array
Feature vectors
Gene
expression
new
array
Reference
L van’t Veer et al (2002) Gene expression
profiling predicts clinical outcome of breast
cancer. Nature, Jan.
.
Classification
rule
9
Classification Techniques
• K Nearest Neighbor classifier
• Support Vector Machines
• …
10
Instance Based Learning (IBL)
Key idea: just store all training examples
<xi,f(xi)>
Nearest neighbor:
• Given query instance xq, first locate nearest
training example xn, then estimate f(xq)=f(xn)
K-nearest neighbor:
• Given xq, take vote among its k nearest
neighbors (if discrete-valued target function)
• Take mean of values of k nearest
neighbors (if real-valued) f(xq)=i=1k f(xi)/k11
K-Nearest Neighbor
• The k-nearest neighbor algorithm is amongst the
simplest of all machine learning algorithms.
• An object is classified by a majority vote of its neighbors,
with the object being assigned to the class most
common amongst its k nearest neighbors.
• k is a positive integer, typically small. If k = 1, then the
object is simply assigned to the class of its nearest
neighbor.
• K-NN can do multiple class prediction (more than two
cancer subtypes, etc.)
• In binary (two class) classification problems, it is helpful
to choose k to be an odd number as this avoids tied
votes.
12
Adapted from Wikipedia
K-Nearest Neighbor
• A lazy learner …
• Issues:
– How many neighbors?
– What similarity measure?
Example of k-NN classification. The
test sample (green circle) should be
classified either to the first class of
blue squares or to the second class
of red triangles. If k = 3 it is classified
to the second class because there
are 2 triangles and only 1 square
inside the inner circle. If k = 5 it is
classified to first class (3 squares vs.
2 triangles inside the outer circle).
From Wikipedia
13
Which similarity or dissimilarity
measure?
• A metric is a measure of the similarity or
dissimilarity between two data objects
• Two main classes of metric:
– Correlation coefficients (similarity)
• Compares shape of expression curves
• Types of correlation:
– Centered.
– Un-centered.
– Rank-correlation
– Distance metrics (dissimilarity)
• City Block (Manhattan) distance
• Euclidean distance
Correlation (a measure between -1
and 1)
• Pearson Correlation Coefficient
(centered correlation) n  x  x  y  y 
Sx = Standard deviation of x
Sy = Standard deviation of y
1
n1
 
i 1 
i
Sx

 i
 S y 
You can use
absolute
correlation to
capture both
positive and
negative
correlation
Positive correlation
Negative correlation
15
Potential pitfalls
Correlation = 1
16
Distance metrics
• City Block (Manhattan)
distance:
– Sum of differences across
dimensions
– Less sensitive to outliers
– Diamond shaped clusters
d ( X , Y )   xi  yi
• Euclidean distance:
– Most commonly used
distance
– Sphere shaped cluster
– Corresponds to the
geometric distance into the
multidimensional space
d ( X ,Y ) 
i
Y
X
Condition 1
Condition 2
Condition 2
i
2
(
x

y
)
 i i
Y
X
Condition 1
where gene X = (x1,…,xn) and gene Y=(y1,…,yn)
17
Euclidean vs Correlation (I)
• Euclidean distance
• Correlation
18
When to Consider Nearest
Neighbors
• Instances map to points in RN
• Less than 20 attributes per instance
• Lots of training data
Advantages:
• Training is very fast
• Learn complex target functions
• Do not loose information
Disadvantages:
• Slow at query time
• Easily fooled by irrelevant attributes
19
Voronoi Diagrams
•
•
•
Voronoi diagrams partition a space with
objects in the same way as happens when
you throw a number of pebbles in water -you get concentric circles that will start
touching and by doing so delineate the area
for each pebble (object).
The area assigned to each object can now
be used for weighting purposes
A nice example from sequence analysis is
by Sibbald, Vingron and Argos (1990)
Sibbald, P. and Argos, P. 1990. Weighting aligned protein or nucleic
acid sequences to correct for unequal representation. JMB 216:813818.
Voronoi Diagram
query point qf
nearest neighbor qi
21
3-Nearest Neighbors
query point qf
3 nearest neighbors
2x,1o
Can use Voronoi areas for weighting
22
7-Nearest Neighbors
query point qf
7 nearest neighbors
3x,4o
23
k-Nearest Neighbors
•
•
•
•
The best choice of k depends upon the data; generally,
larger values of k reduce the effect of noise on the
classification, but make boundaries between classes less
distinct.
A good k can be selected by various heuristic
techniques, for example, cross-validation. If k = 1, the
algorithm is called the nearest neighbor algorithm.
The accuracy of the k-NN algorithm can be severely
degraded by the presence of noisy or irrelevant features,
or if the feature scales are not consistent with their
importance.
Much research effort has been put into selecting or
scaling features to improve classification, e.g. using
evolutionary algorithms to optimize feature scaling. 24
Nearest Neighbor
• Approximate the target function f(x) at
the single query point x = xq
• Locally weighted regression =
generalization of IBL
25
Curse of Dimensionality
Imagine instances are described by 20 attributes
(features) but only 10 are relevant to target function
Curse of dimensionality: nearest neighbor is easily
misled when the instance space is high-dimensional
One approach: weight the features according to their
relevance!
• Stretch j-th axis by weight zj, where z1,…,zn chosen to
minimize prediction error
• Use cross-validation to automatically choose weights
z1,…,zn
• Note setting zj to zero eliminates this dimension
alltogether (feature subset selection)
26
Practical implementations
• Weka – IBk
• Optimized – Timbl
27
Example: Tumor Classification
• Reliable and precise classification essential for successful
cancer treatment
• Current methods for classifying human malignancies rely on a
variety of morphological, clinical and molecular variables
• Uncertainties in diagnosis remain; likely that existing classes
are heterogeneous
• Characterize molecular variations among tumors by monitoring
gene expression (microarray)
• Hope: that microarrays will lead to more reliable tumor
classification (and therefore more appropriate treatments and
better outcomes)
28
Tumor Classification Using Gene
Expression Data
Three main types of ML problems associated with
tumor classification:
• Identification of new/unknown tumor classes using
gene expression profiles (unsupervised learning –
clustering)
• Classification of malignancies into known classes
(supervised learning – discrimination)
• Identification of “marker” genes that characterize
the different tumor classes (feature or variable
selection).
29
Example
Leukemia experiments (Golub et al 1999)
Goal. To identify genes which are differentially expressed in
acute lymphoblastic leukemia (ALL) tumours in
comparison with acute myeloid leukemia (AML) tumours.
•
38 tumour samples: 27 ALL, 11 AML.
•
Data from Affymetrix chips, some pre-processing.
•
Originally 6,817 genes; 3,051 after reduction.
Data therefore 3,051  38 array of expression values.
Acute lymphoblastic leukemia (ALL) is the most common malignancy in children 2-5
years in age, representing nearly one third of all pediatric cancers.
Acute Myeloid Leukemia (AML) is the most common form of myeloid leukemia in adults
(chronic lymphocytic leukemia is the most common form of leukemia in adults overall). In
contrast, acute myeloid leukemia is an uncommon variant of leukemia in children. The
30
median age at diagnosis of acute myeloid leukemia is 65 years of age.
Learning set
Predefine
classes
Tumor type
B-ALL
T-ALL
AML
T-ALL
?
Objects
Array
Feature vectors
Gene
expression
new
array
Reference
Golub et al (1999) Molecular classification
of cancer: class discovery and class
prediction by gene expression monitoring.
Science 286(5439): 531-537.
Classification
Rule
31
Nearest neighbor rule
32
SVM
• SVMs were originally proposed by Boser, Guyon and
Vapnik in 1992 and gained increasing popularity in late
1990s.
• SVMs are currently among the best performers for a
number of classification tasks ranging from text to genomic
data.
• SVM techniques have been extended to a number of tasks
such as regression [Vapnik et al. ’97], principal component
analysis [Schölkopf et al. ’99], etc.
• Most popular optimization algorithms for SVMs are SMO
[Platt ’99] and SVMlight [Joachims’ 99], both use
decomposition to hill-climb over a subset of αi’s at a time.
• Tuning SVMs remains a black art: selecting a specific
kernel and parameters is usually done in a try-and-see
33
manner.
SVM
• In order to discriminate between
two classes, given a training
dataset
– Map the data to a higher dimension
space (feature space)
– Separate the two classes using an
optimal linear separator
34
Feature Space Mapping
• Map the original data to some higherdimensional feature space where the training set
is linearly separable:
Φ: x → φ(x)
35
The “Kernel Trick”
• The linear classifier relies on inner product between vectors
K(xi,xj)=xiTxj
• If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is some function that corresponds to an inner
product in some expanded feature space.
• Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 +
2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1
√2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
36
Linear Separators
Which one
is the
best?
37
Optimal hyperplane
Support vectors uniquely characterize optimal hyper-plane
ρ
margin
Optimal hyper-plane
Support vector
38
Optimal hyperplane: geometric view
w  xi  b  1 for yi  1 The first class
w  xi  b  1 for yi  1 The second class
39
Soft Margin Classification
• What if the training set is not linearly separable?
• Slack variables ξi can be added to allow
misclassification of difficult or noisy examples.
ξk
ξj
40
Weakening the constraints
Weakening the
constraints
Allow that the objects do not strictly obey the constraints
Introduce ‘slack’-variables
41
Influence of C
Erroneous objects can still have a (large)
influence on the solution
C is slack variable
42
SVM
• Advantages:
– maximize the margin between two classes in the
feature space characterized by a kernel function
– are robust with respect to high input dimension
• Disadvantages:
– difficult to incorporate background knowledge
– Sensitive to outliers
43
SVM and outliers
outlier
44
Classifying new examples
• Given new point x, its class
membership is
sign[f(x, *, b*)], where
f (x, , b )  w x  b  i 1 i* yi xi x  b*  iSV  i* yi xi x  b*
*
*
*
*
N
Data enters only in the form of dot products!
and in general
f (x,  * , b* )  iSV  i* yi K (xi , x)  b*
Kernel function
45
Classification: CV error
• Training error
N samples
– Empirical error
• Error on independent
test set
– Test error
• Cross validation (CV)
error
– Leave-one-out (LOO)
– n-fold CV
splitting
N/n
samples
for testing
N(n-1)/n
samples
for training
Count errors
Summarize CV
error rate
46

Data Mining Techniques 1

Transcript Data Mining Techniques 1

Directory