Module 1: Introduction

Download Report

Transcript Module 1: Introduction

Canadian Bioinformatics Workshops
www.bioinformatics.ca
1
Module #: Title of Module
2
Module 2: Clustering,
Classification and Feature
Selection
Sohrab Shah
Centre for Translational and Applied Genomics
Molecular Oncology Breast Cancer Research Program
BC Cancer Agency
[email protected]
3
Module Overview
• Introduction to clustering
– distance metrics
– hierarchical, partitioning and model based clustering
• Introduction to classification
– building a classifier
– avoiding overfitting
– cross validation
• Feature Selection in clustering and classification
Module #: Title of Module
4
Introduction to clustering
• What is clustering?
– unsupervised learning
– discovery of patterns in data
– class discovery
• Grouping together “objects” that are most similar (or
least dissimilar)
– objects may be genes, or samples, or both
• Example question: Are there samples in my cohort
that can be subgrouped based on molecular
profiling?
– Do these groups have correlation to clinical outcome?
Module #: Title of Module
5
Distance metrics
• In order to perform clustering, we need to have a way
to measure how similar (or dissimilar) two objects are
• Euclidean distance:
p
xy 
(x
i
 yi )2
i1
• Manhattan distance:

p
dxy   x i  y i
dissimilar
similar
i1
• 1-correlation

– proportional to Euclidean
distance, but invariant to range of
measurement from one sample to the next
Module #: Title of Module
6
Distance metrics compared
Euclidean
Manhattan
1-Pearson
Conclusion: distance matters!
Module #: Title of Module
7
Other distance metrics
• Hamming distance for ordinal, binary or
categorical data:
p
dxy   I(x i  y i )
i1

Module #: Title of Module
8
Approaches to clustering
• Partitioning methods
– K-means
– K-medoids (partitioning around medoids)
– Model based approaches
• Hierarchical methods
– nested clusters
• start with pairs
• build a tree up to the root
Module #: Title of Module
9
Partitioning methods
• Anatomy of a partitioning based method
– data matrix
– distance function
– number of groups
• Output
– group assignment of every object
Module #: Title of Module
10
Partitioning based methods
• Choose K groups
– initialise group centers
• aka centroid, medoid
– assign each object to the
nearest centroid according to
the distance metric
– reassign (or recompute)
centroids
– repeat last 2 steps until
assignment stabilizes
Module #: Title of Module
11
K-medoids in action
Module #: Title of Module
12
K-means vs K-medoids
K-means
K-medoids
Centroids are the ‘mean’ of
the clusters
Centroids are an actual
object that minimizes the
total within cluster distance
Centroids need to be
recomputed every iteration
Centroid can be determined
from quick look up into the
distance matrix
Initialisation difficult as
notion of centroid may be
unclear before beginning
Initialisation is simply K
randomly selected objects
kmeans
pam
Module #: Title of Module
13
Partitioning based methods
Advantages
Disadvantages
Number of groups is well
defined
Have to choose the number
of groups
A clear, deterministic
Sometimes objects do not
assignment of an object to a fit well to any cluster
group
Simple algorithms for
inference
Module #: Title of Module
Can converge on locally
optimal solutions and often
require multiple restarts with
random initializations
14
Agglomerative hierarchical clustering
Module #: Title of Module
15
Hierarchical clustering
• Anatomy of hierarchical clustering
– distance matrix
– linkage method
• Output
– dendrogram
• a tree that defines the relationships between objects and
the distance between clusters
• a nested sequence of clusters
Module #: Title of Module
16
Linkage methods
single
complete
distance between centroids
average
Module #: Title of Module
17
Linkage methods
• Ward (1963)
– form partitions that minimizes the loss associated with each
grouping
– loss defined as error sum of squares (ESS)
– consider 10 objects with scores (2, 6, 5, 6, 2, 2, 2, 2, 0, 0, 0)
ESSOnegroup = (2 -2.5)2 + (6 -2.5)2 + ....... + (0 -2.5)2 = 50.5
On the other hand, if the 10 objects are classified according to their scores into four sets,
{0,0,0}, {2,2,2,2}, {5}, {6,6}
The ESS can be evaluated as the sum of squares of four separate error sums of squares:
ESSOnegroup = ESSgroup1 + ESSgroup2 + ESSgroup3 + ESSgroup4 = 0.0
Thus, clustering the 10 scores into 4 clusters results in no loss of information.
Module #: Title of Module
18
Linkage methods in action
• clustering based on single linkage
•
•
single <- hclust(dist(t(exprMatSub),method="euclidean"), method=”single");
plot(single);
Module #: Title of Module
19
Linkage methods in action
• clustering based on complete linkage
•
•
complete <- hclust(dist(t(exprMatSub),method="euclidean"), method="complete");
plot(complete)
Module #: Title of Module
20
Linkage methods in action
• clustering based on centroid linkage
•
•
centroid <- hclust(dist(t(exprMatSub),method="euclidean"), method=”centroid");
plot(centroid);
Module #: Title of Module
21
Linkage methods in action
• clustering based on average linkage
•
•
average <- hclust(dist(t(exprMatSub),method="euclidean"), method=”average");
plot(average);
Module #: Title of Module
22
Linkage methods in action
• clustering based on Ward linkage
•
•
ward <- hclust(dist(t(exprMatSub),method="euclidean"), method=”ward");
plot(ward);
Module #: Title of Module
23
Linkage methods in action
Conclusion: linkage matters!
Module #: Title of Module
24
Hierarchical clustering analyzed
Advantages
Disadvantages
There may be small clusters Clusters might not be
nested inside large ones
naturally represented by a
hierarchical structure
No need to specify number
groups ahead of time
Its necessary to ‘cut’ the
dendrogram in order to
produce clusters
Flexible linkage methods
Bottom up clustering can
result in poor structure at
the top of the tree. Early
joins cannot be ‘undone’
Module #: Title of Module
25
Model based approaches
• Assume the data are ‘generated’ from a mixture of K distributions
– What cluster assignment and parameters of the K distributions best
explain the data?
• ‘Fit’ a model to the data
• Try to get the best fit
• Classical example: mixture of Gaussians (mixture of normals)
• Take advantage of probability theory and well-defined distributions in
statistics
Module #: Title of Module
26
Model based clustering: array
CGH
Module #: Title of Module
27
Model based clustering of aCGH
Problem: patient cohorts often exhibit molecular heterogeneity making rarer shared
CNAs hard to detect
Approach: Cluster the data by extending the profiling to the multi-group setting
A mixture of HMMs: HMM-Mix
Group g
Sparse profiles
…
…
Distribution of calls in a group
Profile
State c
CNA calls
State k
Patient p
Shah et al (Bioinformatics, 2009)
Raw data
28
Advantages of model based approaches
• In addition to clustering patients into groups, we output
a ‘model’ that best represents the patients in a group
• We can then associate each model with clinical
variables and simply output a classifier to be used on
new patients
• Choosing the number of groups becomes a model
selection problem (ie the Bayesian Information
Criterion)
– see Yeung et al Bioinformatics (2001)
Module #: Title of Module
29
Clustering 106 follicular lymphoma
patients with HMM-Mix
Initialisation
Profiles
Converged
Clinical
 Recapitulates known FL subgroups
 Subgroups have clinical relevance
Module #: Title of Module
30
30
Feature selection
• Most features (genes, SNP probesets, BAC clones) in high
dimensional datasets will be uninformative
– examples: unexpressed genes, housekeeping genes, ‘passenger
alterations’
• Clustering (and classification) has a much higher chance of
success if uninformative features are removed
• Simple approaches:
– select intrinsically variable genes
– require a minimum level of expression in a proportion of samples
– genefilter package (Bioonductor): Lab1
• Return to feature selection in the context of classification
Module #: Title of Module
31
Advanced topics in clustering
•
•
•
•
Top down clustering
Bi-clustering or ‘two-way’ clustering
Principal components analysis
Choosing the number of groups
– model selection
• AIC, BIC
• Silhouette coefficient
• The Gap curve
• Joint clustering and feature selection
Module #: Title of Module
32
What Have We Learned?
• There are three main types of clustering approaches
– hierarchical
– partitioning
– model based
• Feature selection is important
– reduces computational time
– more likely to identify well-separated groups
• The distance metric matters
• The linkage method matters in hierarchical clustering
• Model based approaches offer principled probabilistic methods
Module #: Title of Module
33
Module Overview
• Clustering
• Classification
• Feature Selection
Module #: Title of Module
34
Classification
• What is classificiation?
– Supervised learning
– discriminant analysis
• Work from a set of objects with predefined classes
– ie basal vs luminal or good responder vs poor responder
• Task: learn from the features of the objects: what is the
basis for discrimination?
• Statistically and mathematically heavy
Module #: Title of Module
35
Classification
poor response
poor response
poor response
learn a classifier
good response
good response
good response
new patient
What is the most likely response?
Module #: Title of Module
36
Example: DLBCL subtypes
Wright et al,
PNAS (2003)
Module #: Title of Module
37
DLBCL subtypes
Wright et al, PNAS (2003)
Module #: Title of Module
38
Classification approaches
• Wright et al PNAS (2003)
• Weighted features in a linear predictor score:
• aj: weight of gene j determined by t-test statistic
• Xj: expression value of gene j
• Assume there are 2 distinct distributions of LPS:
1 for ABC, 1 for GCB
Module #: Title of Module
39
Wright et al, DLBCL, cont’d
• Use Bayes’ rule to determine a probability that a sample
comes from group 1:
•
: probability density function that represents
group 1
Module #: Title of Module
40
Learning the classifier, Wright et al
• Choosing the genes (feature selection):
– use cross validation
– Leave one out cross validation
•
•
•
•
•
Pick a set of samples
Use all but one of the samples as training, leaving one out for test
Fit the model using the training data
Can the classifier correctly pick the class of the remaining case?
Repeat exhaustively for leaving out each sample in turn
– Repeat using different sets and numbers of genes based on tstatistic
– Pick the set of genes that give the highest accuracy
Module #: Title of Module
41
Overfitting
• In many cases in biology, the number of features is much larger
than the number of samples
• Important features may not be represented in the training data
• This can result in overfitting
– when a classifier discriminates well on its training data, but does
not generalise to orthogonally derived data sets
• Validation is required in at least one external cohort to believe
the results
• example: the expression subtypes for breast cancer have been
repeatedly validated in numerous data sets
Module #: Title of Module
42
Overfitting
• To reduce the problem of overfitting, one can use
Bayesian priors to ‘regularize’ the parameter
estimates of the model
• Some methods now integrate feature selection and
classification in a unified analytical framework
– see Law et al IEEE (2005): Sparse Multinomial Logistic
Regression (SMLR): http://www.cs.duke.edu/~amink/software/smlr/
• Cross validation should always be used in training a
classifier
Module #: Title of Module
43
Evaluating a classifier
• The receiver operator characteristic curve
– plots the true positive rate vs the false positive rate
• Given ground truth and a
probabilistic classifier
– for some number of probability
thresholds
– compute the TPR
– proportion of positives that
were predicted as true
– compute the FPR
– number of false predictions
over the total number of
predictions
Module #: Title of Module
44
Other methods for classification
•
•
•
•
•
Support vector machines
Linear discriminant analysis
Logistic regression
Random forests
See:
– Ma and Huang Briefings in Bioinformatics (2008)
– Saeys et al Bioinformatics (2007)
Module #: Title of Module
45
Questions?
Module #: Title of Module
46
Lab: Clustering and feature
selection
• Get familiar clustering tools and plotting
–
–
–
–
Feature selection methods
Distance matrices
Linkage methods
Partition methods
• Try to reproduce some of the figures from
Chin et al using the freely available data
Module #: Title of Module
47
Module 2: Lab
Coffee break
Back at: 15:00
Module #: Title of Module
48