Introduction to Bioinformatics 1. Course Overview

Download Report

Transcript Introduction to Bioinformatics 1. Course Overview

Functional Genomics and
Microarray Analysis (2)
Version 1.0 – 19 Jan 2009
Version 1.0
Data Clustering
Lecture Overview


Introduction: What is Data Clustering
Key Terms & Concepts
–
–
–
–
–

Hierarchical Clustering
–
–
–

Algorithm
Single/complete/average linkage
Dendrograms
K-means Clustering
–

Dimensionality
Centroids & Distance
Distance & Similarity measures
Data Structures Used
Hierarchical & non-hierarchical
Algorithm
Other Related Concepts
–
–
Self Organising Maps (SOM)
Dimensionality Reduction: PCA & MDS
Version 1.0
Introduction
Analysis of Gene Expression Matrices

In a gene expression matrix, rows represent
genes and columns represent
measurements from different experimental
conditions measured on individual arrays.
The values at each position in the matrix
characterise the expression level (absolute
or relative) of a particular gene under a
particular experimental condition.
Gene Expression Matrix
Samples
Genes

Gene
expression
levels
Version 1.0
Introduction
Identifying Similar Patterns

The goal of microarray data analysis is to find relationships and
patterns in the data to achieve insights in underlying biology.

Clustering algorithms can be applied to the resulting data to find
groups of similar genes or groups of similar samples.
–
–
e.g. Groups of genes with “similar expression profiles (Co-expressed
Genes) --- similar rows in the gene expression matrix
or Groups of samples (disease cell lines/tissues/toxicants) with “similar
effects” on gene expression --- similar columns in the gene expression
matrix
Version 1.0
Introduction
What is Data Clustering

Clustering of data is a method by which large sets of data is grouped into
clusters (groups) of smaller sets of similar data.

Example: There are a total of 10 balls which are of three different colours.
We are interested in clustering the balls into three different groups.

An intuitive solution is that balls of same colour are clustered (grouped
together) by colour

Identifying similarity by colour was easy, however we want to extend this
to numerical values to be able to deal with gene expression matrices, and
also to cases when there are more features (not just colour).
Version 1.0
Introduction
Clustering Algorithms

A clustering algorithm attempts to find natural groups of components (or
data) based on some notion similarity over the features describing them.

Also, the clustering algorithm finds the centroid of a group of data sets.

To determine cluster membership, many algorithms evaluate the distance
between a point and the cluster centroids.

The output from a clustering algorithm is basically a statistical description
of the cluster centroids with the number of components in each cluster.
Version 1.0
Key Terms and Concepts
Dimensionality of gene expression matrix

Clustering algorithms work by calculating
distances (or alternatively similarity in higherdimensional spaces), i.e. when the elements
are described by many features (e.g. colour,
size, smoothness, etc for the balls example)
A gene expression matrix of N Genes x M
Samples can be viewed as:
–
–

N genes, each represented in an M-dimensional
space.
M samples, each represented in N-dimensional
space
We will show graphical examples mainly in 2-D
spaces
–
i.e. when N= 2 or M=2
Gene Expression Matrix
Samples
Genes

Gene
expression
levels
Version 1.0
Key Terms and Concepts
Centroid and Distance
centroid
gene A
+ + + +
+ + + +
+
+
+ + + +
+ + + +
+ + + +
+
+
+
gene B

In the first example (2 genes & 25 samples) the expression values of 2
Genes are plotted for 25 samples and Centroid shown)

In the second (2 genes & 2 samples) example the distance between the
expression values of the 2 genes is shown
Version 1.0
Key Terms and Concepts
Centriod and Distance
Cluster centroid :
The centroid of a cluster is a point whose parameter values are
the mean of the parameter values of all the points in the clusters.
Distance:
Generally, the distance between two points is taken as a common
metric to assess the similarity among the components of a
population. The commonly used distance measure is the
Euclidean metric which defines the distance between two points
p= ( p1, p2, ....) and q = ( q1, q2, ....) is given by :
Version 1.0
Key Terms and Concepts
Distance/Similarity Measures








(x2,y2)
Euclidean (L2) distance
Manhattan (L1) distance
Lm: (|x1-x2|m+|y1-y2|m)1/m
L∞: max(|x1-x2|,|y1-y2|)
(x1, y1)
Inner product: x1x2+y1y2
Correlation coefficient
Spearman rank correlation coefficient
For simplicity we will concentrate on Euclidean and Manhattan
distances in this course
Version 1.0
Key Terms and Concepts
Distance Measures: Minkowski Metric
Suppose two objects x and y both have p features :
x  ( x1 x 2  xp)
y  ( y1 y 2  yp)
The Minkowski metric is defined by
d ( x, y)  r
p
|
i 1
xi  yi |r
Version 1.0
Key Terms
Commonly Used Minkowski Metrics
1, r  2 (Euclidean distance )
d ( x, y )  2
p
|
i 1
xi  yi |2
2, r  1 (Manhattan distance)
p
d ( x , y )   | xi  yi |
i 1
3, r   (" sup" distance )
d ( x , y )  max | xi  yi |
1 i  p
Version 1.0
Key Terms and Concepts
Distance/Similarity Matrices

Gene Expression Matrix
– N Genes x M Samples

Clustering is based on distances, this
leads to a new useful data structure:

Similarity/Dissimilarity matrix
– Represents the distance between
either N Genes (NxN) or M Samples
(MxM)
– Only need half the matrix, since it is
symmetrical
 x11

 ...
x
 i1
 ...
x
 n1
... x1f
... ...
... xif
...
...
... xnf
 0
 d(2,1)
0

 d(3,1) d ( 3,2)

:
 :
d ( n,1) d ( n,2)
... x1p 

... ... 
... xip 

... ... 
... xnp 





0

:

... ... 0
Version 1.0
Key Terms
Hierarchical vs. Non-hierarchical

Hierarchical clustering is the most commonly used methods for
identifying groups of closely related genes or tissues. Hierarchical
clustering is a method that successively links genes or samples
with similar profiles to form a tree structure – much like
phylognentic tree.

K-means clustering is a method for non-hierarchical (flat)
clustering that requires the analyst to supply the number of
clusters in advance and then allocates genes and samples to
clusters appropriately.
Version 1.0
Hierarchical Clustering
Algorithm
•
Given a set of N items to be clustered, and an NxN distance (or
similarity) matrix, the basic process hierarchical clustering is this:
1.
Start by assigning each item to its own cluster, so that if you have N
items, you now have N clusters, each containing just one item.
2.
Find the closest (most similar) pair of clusters and merge them into a
single cluster, so that now you have one less cluster.
3.
Compute distances (similarities) between the new cluster and each of
the old clusters.
4.
Repeat steps 2 and 3 until all items are clustered into a single cluster of
size N.
Version 1.0
Hierarchical Cluster Analysis

11
11
2
(2&3)
3
(2&3)
2
0.28
0.23
1
3
0.6
0.1
Scan matrix for
minimum

Join genes to 1 node

Update matrix
2
3
Version 1.0
Hierarchical Clustering
Distance Between Two Clusters
Whereas it is straightforward to
calculate distance between two
points, we do have various options
when calculating distance between
clusters.
Single-Link Method / Nearest Neighbor
Complete-Link / Furthest Neighbor
Their Centroids.
Average of all cross-cluster pairs.
Min
Average
distance
distance
Max
distance
Version 1.0
Key Terms
Linkage Methods for hierarchical clustering

Single-link clustering (also called the connectedness or minimum
method) : we consider the distance between one cluster and another
cluster to be equal to the shortest distance from any member of one
cluster to any member of the other cluster. If the data consist of
similarities, we consider the similarity between one cluster and another
cluster to be equal to the greatest similarity from any member of one
cluster to any member of the other cluster.

Complete-link clustering (also called the diameter or maximum
method): we consider the distance between one cluster and another
cluster to be equal to the longest distance from any member of one
cluster to any member of the other cluster.

Average-link clustering we consider the distance between one cluster
and another cluster to be equal to the average distance from any member
of one cluster to any member of the other cluster.
Version 1.0
Single-Link Method
Euclidean Distance
a
b
c
b c
a 2 5
b
3
c
a,b
d
d
6
5
4
a,b,c
c
d
(1)
b c
a 2 5
b
3
c
d
6
5
4
a,b,c,d
d
(2)
(3)
c d
a, b 3 5
c
4
d
a, b, c 4
Distance Matrix
Version 1.0
Complete-Link Method
Euclidean Distance
a
b
c
b c
a 2 5
b
3
c
d
d
6
5
4
a,b
a,b
c
d
(1)
c,d
b c
a 2 5
b
3
c
d
6
5
4
a,b,c,d
(2)
(3)
c d
a, b 5 6
c
4
c, d
a, b 6
Distance Matrix
Version 1.0
Key Terms and Concepts
Dendrograms and Linkage
The resulting tree structure is usally referred to as a dendrogram.
In a dendrogram the length of each tree branch represents the distance
between clusters it joins.
Different dendrograms may arise when different Linkage methods are used
a b c d
a b
c d
0
2
4
Single-Link
6
Complete-Link
Version 1.0
Two Way Hierarchical Clustering
Note we can do two way
clustering by performing
clustering on both the rows and
the columns
It is common to visualise the
data as shown using a
heatmap.
Don’t confuse the heatmap
with the colours of a
microarray image.
They are different !
Why?
Version 1.0
K-Means Clustering

Basic Ideas : using cluster centroids (means) to represent cluster

Assigning data elements to the closet cluster (centroid).

Goal: Minimise square error (intra-class dissimilarity)


 d ( xi , C( xi ))
i
Version 1.0
K-means Clustering
Algorithm
1) Select an initial partition of k clusters
2) Assign each object to the cluster with the closest centroid
3) Compute the new centeroid of the clusters:
n 



C ( S )   X i / n, X 1 ,..., X n  S
i 1
4) Repeat step 2 and 3 until no object changes cluster
Version 1.0
The K-Means Clustering Method
Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Version 1.0
Summary

Clustering algorithms used to find similarity relationships between genes,
diseases, tissue or samples

Different similarity metrics can be used – mainly Euclidean and Manhattan)

Hierarchical clustering
–
–
–

Similarity matrix
Algorithm
Linkage methods
K-means clustering algorithm
Version 1.0
Data Classification
Lecture Overview




Introduction: Diagnostic and Prognostic Tools
Data Classification
Classification vs. Classification
Examples of Simple Classification Algorithms
–
–

Decision Trees
–
–
–
–


Centroid-based
K-NN
Basic Concept
Algorithm
Entropy and Information Gain
Extracting rules from trees
Bayesian Classifiers
Evaluating Classifiers
Version 1.0
Introduction
Predictive Modelling

Diagnostic Tools: One of the most exciting areas of Microarray research
is the use of Microarrays to find groups of gene that can be used
diagnostically to determine the disease that an individual is suffering.
–
Tissue Classification Tools: a simple example is given measurements from
one tissue type is to be able to ascertain whether the tissue has markers of
cancer or not, and if so which type of cancer.

Prognostic Tools: Another exciting area is given measurements from
an individual’s sample is to prognostically predict the success of a course
of a particular therapy

In both cases we can train a classification algorithm on previously
collected data so as to obtain a predictive modelling tool. The aim of the
algorithm is to find a small set of features and their values (e.g. set of
genes and their expression values) that can be used in future predictions
(or classification) on unseen samples
Version 1.0
Classification:
Obtaining a labeled training data set

Goal: Identify subset of genes that distinguish between treatments,
tissues, etc.

Method
–
–
–
ID
1
2
3
4
5
6
7
8
9
G1
11.12
12.34
13.11
13.34
14.11
11.34
21.01
66.11
33.11
Collect several samples grouped by type (e.g. Diseased vs. Healthy) or by
treatment outcome (e.g. Success vs. Failure).
Use genes as “features”
Build a classifier to distinguish treatments
G2
1.34
2.01
1.34
11.11
13.10
14.21
12.32
33.3
44.1
G3
1.97
1.22
1.34
1.38
1.06
1.07
1.97
1.97
1.96
G4
11.0
11.1
2.0
2.23
2.44
1.23
1.34
1.34
11.23
Cancer
No
No
Yes
Yes
Yes
No
Yes
Yes
Yes
To Predict categorical class labels
construct a model based on the
training set, and then use the
model in classifying new unseen
data
Version 1.0
Classification:
Generating a predictive model

The output of a classifier is a predictive model that can be used to
classify unseen based on the values of their gene expressions.

The model shown below is a special type of classification models,
known a Decision Tree.
G1
<=22
G3
<=12 >12
Yes
No
>22
G4
<=52 >52
No
Yes
Version 1.0
Classification
Overview


Task: determine which of a fixed set of classes an example
belongs to
Inductive Learning System:
–
–
Input: training set of examples annotated with class values.
Output:induced hypotheses (model/concept description/classifiers)
Learning : Induce classifiers from training data
Training
Data:
Inductive
Learning
System
Classifiers
(Derived Hypotheses)
Version 1.0
Classification
Overview

Using a Classifier for Prediction
Using Hypothesis for Prediction: classifying any example described in the
same manner as the data used in training the system (i.e. same set of
features)
Data to be classified
Classifier
Decision on class
assignment
Version 1.0
Outlook
Sunny
Humidity
Classification
High
Examples in all walks of life
The values of the features
in the table can be
categorical or numerical.
However, we only deal
with categorical variables
in this course
The Class Value has to be
Categorical.
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
No
Overcast
Rain
Wind
Yes
true
Normal
Yes
false
No
Temperature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
Yes
Version 1.0
Classification vs. Clustering
Classification
Clustering
• known number of classes
• unknown number of classes
• based on a training set
• no prior knowledge
• used to classify future observations • used to understand (explore) data
• Classification is a form of
supervised learning
• Clustering a form of unsupervised
learning
Version 1.0
Typical Classification Algorithms








Centroid Classifiers
kNN: k Nearest Neigbours
Bayesian Classification: Naïve Bayes and Bayesian
Networks
Decision trees
Neural Networks
Linear Discriminant Analysis
Support Vector Machines
…..
Version 1.0
A linear discriminant in 2-D is a
straight line.
Types of Classifiers
In N-D it is a hyperplace
Linear vs. non linear
Linear Classifier:
Non Linear Classifier:
G1
G1
* ** o o
* *
o
* * *
o
* * o* o
o
o
* ** o o
* *
o
* * *
o
* * o* o
o
o
a*G1 + b*G2 > t -> o !

o
o
o
o
G2
G2
Linear Classifiers are easier to develop e.g Linear Discriminant Analysis (LDA)
Method, which tries to find a good regression line by minimising the squared
errors of the training data

Linear Classifiers, however, may produce models that are not perfect on the
training data.

Non-linear classifiers tend to be more accurate, may over-fit the data

By over-fitting the data, they may actually perform worse on unseen data
Version 1.0
Types of Classifiers
K-Nearest Neighbour Classifiers

K-NN works by assigning a data point to the
class of its k closest neighbors (e.g. based on
Euclidean or Manhattan distance).

K-NN returns the most common class label
among the k training examples nearest to x.

We usually set K > 1 to avoid outliers

Variations:
–
–
Can also use a radius threshold rather than K.
We can also set a weight for each neighbour
that takes into account how far it is from the
query point
_
_
_
+
_
_
+
+ _ +
+
.
_
+
x_
+
_ _
+
_
+
+

Model Training: None.

Classification:
–
–
Given a data point,
Locate K nearest points.
Assign the majority
class of the K points
Version 1.0
Outlook
Sunny
Overcast
Humidity
Types of Classifiers
Decision Trees

–
–
–
true
Normal
Yes
No
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation
–
–

No
Wind
Yes
Decision tree
–

High
Rain
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Use of decision tree: Classifying an unknown sample
–
Test the attribute values of the sample against the decision tree
false
Yes
Version 1.0
Outlook
Sunny
Humidity
Types of Classifiers
Decision Tree Construction
•
No
Rain
Wind
Yes
true
Normal
Yes
false
No
Yes
General idea:
•
•
•
•
High
Overcast
Using the training data, choose the best feature to be used for the
logical test at the root of the tree.
Partition training data into sub-groups based on the values of the
logical test
Recursively apply the same procedure (select attribute and split) and
terminate when all the data elements in one branch are of the same
class.
Key to Success is how to choose the best feature at each step
•
•
The basic approach to select a attribute is to examine each attribute
and evaluate its likelihood for improving the overall decision
performance of the tree.
The most widely used node-splitting evaluation functions work by
reducing the degree of randomness or ‘impurity” in the current node.
Version 1.0
Decision Tree Construction
Algorithm

Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in
advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)

Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
Version 1.0
Decision Tree
Example
In the simple example
shown, the expression
values which are usually
numbers have been made
into discrete values.
There are more complex
methods that can deal with
numeric features, but are
beyond this course
G1
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
G2
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
G3
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
G4
low
high
low
low
low
high
high
low
low
low
high
high
low
high
diseased
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
In the example, I have chosen to use 3 discrete ranges for Gene1, two ranges
(high/low) for genes 2 and , and expressed (yes/no) for gene 3.
Version 1.0
Decision Trees
Using Information Gain

Select the attribute with the highest information gain

Assume there are two classes, P and N
–
Let the set of examples S contain p elements of class P and n
elements of class N
–
The amount of information (entropy) :
I ( p, n)  
p
p
n
n
log2

log2
pn
pn pn
pn
Version 1.0
Information Gain in Decision
Tree Construction

Assume that using attribute A a set S will be partitioned into
sets {S1, S2 , …, Sv}
–
If Si contains pi examples of P and ni examples of N, the expected
information (total entropy) in all subtrees Si generated by the
partition via A is
pi  ni
E ( A)  
I ( pi , ni )
i 1 p  n


The encoding information that would be gained by branching
on A
Gain( A)  I ( p, n)  E( A)
Version 1.0
Attribute Selection by Information
Gain Computation

Class P: diseased = “yes”

Class N: diseased = “no”

I(p, n) = I(9, 5) =0.940

Compute the entropy for G1:
E (G1) 
5
4
I ( 2,3) 
I ( 4,0)
14
14
5

I (3,2)  0.69
14
Hence
G1
<=30
30…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
Gain(G1)  I ( p, n)  E(G1)
Similarly
Gain(G 2)  0.029
Gain(G3)  0.151
Gain(G 4)  0.048
Version 1.0
Extracting Classification Rules from
Trees

Decision Trees can be simplified by representing the knowledge
in the form of IF-THEN rules that are easier for humans to
understand
–
–
–

One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Example
IF G1 = “<=30” AND G3 = “no”
IF G1 = “<=30” AND G3 = “yes”
IF G1 = “31…40”
IF G1 = “>40” AND G4 = “high”
IF G1 = “>40” AND G4 = “low”
THEN diseased = “no”
THEN diseased = “yes”
THEN diseased = “yes”
THEN diseased = “yes”
THEN diseased = “no”
Version 1.0
Further Notes

We have mainly used examples with two classes in our examples,
however most classification algorithms can work on many class values so
long as they are discrete.

We have also mainly concentrated on examples that work on discrete
feature values

Note that in many cases, the data may be of very high dimensionality,
and this may cause problems for the algorithms, and might need to use
dimensionality reduction methods.
Version 1.0
Summary

Classification algorithms can be used to develop diagnostic and
prognostic tools based on collected data by generating predictive models
that can label unseen data into existing classes.

Simple classification methods: LDA, Centroid-based classifiers and k-NN

Decision Trees:
–
–

Decision Tree Induction works by choosing the best logical test for each tree
node one at a time, and recursively splitting the data and applying same
procedure
Entropy and Information Gain are the key concepts to apply
Not all classifiers generate 100% accuracy, confusion matrices can be
used to evaluate their accuracy.