CSE 181 Project guidelines

Download Report

Transcript CSE 181 Project guidelines

I519 Introduction to Bioinformatics, 2011

Microarray & clustering algorithms

Yuzhen Ye ([email protected]) School of Informatics & Computing, IUB

Outline

 

Microarrays Clustering algorithms

– Hierarchical Clustering – K-Means Clustering

Microarrays and expression analysis

   Microarrays allow biologists to infer gene function even when sequence similarity alone is insufficient to infer function.

Microarrays measure the activity (expression level) of the genes under varying conditions/time points Expression level is estimated by measuring the amount of mRNA for that particular gene – – A gene is active if it is being transcribed More mRNA usually indicates more gene activity

Steps in microarray experiment

   Experimental Design Signal Extraction – – Image Analysis Normalization: remove the artifacts across arrays Data Analysis – – Selection of Genes differentially expressed Clustering and classification

Microarray experiments

       Produce cDNA from mRNA (DNA is more stable) Attach phosphor to cDNA to see when a particular gene is expressed Different color phosphors are available to compare many samples at once Hybridize cDNA over the microarray Scan the microarray with a phosphor-illuminating laser Illumination reveals transcribed genes Scan microarray multiple times for the different color phosphor ’ s

Microarray experiments

(con ’ t) High-density oligonucleotide arrays (the other type is cDNA arrays) www.affymetrix.com

Microarray experiment design

Type I

: (n = 2) – How is this gene expressed in target 1 as compared to target 2? (e.g., treated versus untreated; disease versus normal) – Which genes show up/down regulation between the two targets?

Type II

: (n > 2) – How does the expression of gene A vary over time, tissues, or treatments?

– Do any of the expression profiles exhibit similar patterns of expression?

Using microarrays (type I)

   

Green

: expressed only from control

Red

: expressed only from experimental cell

Yellow

: equally expressed in both samples

Black

: NOT expressed in either control or experimental cells

Using microarrays (type II)

• Track the sample over a period of time to see gene expression over time • Track two different samples under the same conditions to see the difference in gene expressions Each box represents one gene ’ s expression over time

Computational & statistical problems

  Image and data processing – – Normalization Noise reduction Data analysis – – – Clustering Classification Network inference

Differential expression

Look for genes with vastly different expression under different conditions (type I) – Significantly different? (statistical analysis)

Gene 1 vs Gene 2

60000 50000 40000 30000 20000 10000 0 0 10000 20000 30000

Gene 1

40000 50000 60000

  

Statistical considerations are essential for analysis of microarray data

Data quality & background substraction – Poorer-quality or uninformative data should be removed Data normalization – To remove uneven variation between two labels (e.g.. cy5 vs cy3) in one slide or between slides – An observed intensity is the sum of contributions from variables, such as slide-to-slide variation, dye variation, variation, etc.

– Lowess normalization (data within a small window of expression values are fitted to a straight line by linear regression) Differential expression – – Analysis of variance (ANOVA) T-test

Coordinated

gene expression

Time series microarray data

Microarray data are usually transformed into an intensity matrix (below)   The intensity matrix allows biologists to make

correlations

between different genes (even if they are dissimilar) and to understand how genes functions might be related Clustering comes into play Intensity (expression level) of gene at measured time Time: Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Time X 10 10 4 7 1 Time Y 8 0 8.6

8 2 Time Z 10 9 3 3 3

Applications of clustering

 Viewing and analyzing vast amounts of biological data as a whole set can be perplexing  It is easier to interpret the data if they are partitioned into clusters combining similar data points.

Clustering of microarray data

    Plot each datum as a point in N-dimensional space Make a distance matrix for the distance between every two gene points in the N-dimensional space Genes with a small distance share the same expression characteristics and might be functionally related or similar.

Clustering reveal groups of functionally related genes

Clustering of microarray data (cont

d)

Clusters Two key factors: 1) What distance measure is used 2) What principle is used to construct clusters

Distance measure

 Correlation coefficient (between two variables X and Y) – Pearson correlation coefficient (sensitive to outliers) (with values range from -1 to 1; value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases) covariance   variance – Spearman correlation coefficient (nonparametric version; use the ranks) Absolute value of correlation coefficient, |r| Euclidean distance

Homogeneity and separation principles

  

Homogeneity:

to each other Elements within a cluster are close

Separation:

Elements in different clusters are further apart from each other …clustering is not an easy task!

Given these points, a clustering algorithm might make two distinct clusters as follows

Bad clustering

This clustering violates both Homogeneity and Separation principles Close distances from points in separate clusters Far distances from points in the same cluster

Good clustering

This clustering satisfies both Homogeneity and Separation principles

  

Clustering techniques

Agglomerative:

Start with every element in its own cluster, and iteratively join clusters together

Divisive:

Start with one cluster and iteratively divide it into smaller clusters

Hierarchical:

subtrees Organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes. Similar genes lie within the same

Hierarchical clustering

Hierarchical clustering

(cont ’ d)  Hierarchical Clustering is often used to reveal evolutionary history

Hierarchical clustering algorithm

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

Hierarchical Clustering ( Form n T clusters each with one element Construct a graph while there is more than one cluster Find the two closest clusters Merge C 1 and C 2 T d , n by assigning one vertex to each cluster into new cluster Compute distance from Add a new vertex C to T ) C Add a row and column to return d C Remove rows and columns of 1 d and C C 2 with |C 1 | to all other clusters + |C and connect to vertices corresponding to 2 C | 1 C 1 elements and C and 2 C 2 corrsponding to the new cluster C The algorithm takes a

n

x

n

distance matrix

d

of pairwise as an input.

Different ways to define distances between clusters may lead to different clusterings

Hierarchical clustering: Recomputing distances 

d min

(

C

,

C *

) = min

d(x,y) for all elements x in C and y in C *

 – Distance between two clusters is the

smallest

between any pair of their elements distance

d avg

(

C

,

C *

) = (1 /

|C * ||C|

) ∑

d(x,y) for all elements x in C and y in C *

– Distance between two clusters is the

average

between all pairs of their elements distance

K-Means clustering problem

 

Input

: A set,

V

, consisting of

n

points and a parameter

k

Output

: A set

X

consisting of possible choices of

X

k

points (

cluster centers

) that minimizes the squared error distortion d(V,X) over all Given a data point

v

and a set of points

X

, define the

distance

from

v

to

X

d

(

v, X

) as the (Euclidian) distance from

v

to the

closest

point from

X

. Given a set of

n

data points

V={v 1 …v n }

and a set of define the

Squared Error Distortion

k

points

X

,

d

(

V,X

) = ∑

d

(

v i , X

) 2 /

n

1 <

i

<

n

1-Means clustering problem

Input

: A set,

V

, consisting of

n

points 

Output

: A single point

x

(cluster center) that minimizes the squared error distortion

d(V,x)

possible choices of

x

over all 1-Means Clustering problem is easy.

However, it becomes very difficult (NP-complete) for more than one center. An efficient

heuristic

method for K-Means clustering is the Lloyd algorithm

1.

2.

3.

4.

5.

K-Means clustering: Lloyd algorithm

Lloyd Algorithm Arbitrarily assign the while the cluster centers keep changing Assign each data point to the cluster cluster ≤ k ) k cluster centers C i representative (center) (1 ≤ After the assignment of all data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is ∑ v \ |C| for all v in C for every cluster C i *This may lead to merely a locally optimal clustering.

5 4 3 2 1 0 0 x 2 x 1 x 3 1 2 3 4

expression in condition 1

5

5 2 1 4 3 0 0 x 2 x 1 x 3 1 2 3 4

expression in condition 1

5

5 2 1 4 3 0 0 x 1 x 3 x 2 1 2 3 4

expression in condition 1

5

2 1 0 0 5 4 3 x 2 x 3 1 2 3 4

expression in condition 1

x 1 5

  

Biclustering

If two genes are related (have similar functions or are co-regulated), their expression profiles should be similar (e.g. low Euclidean distance or high correlation).

However, they can have similar expression patterns only under some conditions (e.g. they have similar response to a certain external stimulus, but each of them has some distinct functions at other time).

Similarly, for two related conditions, some genes may exhibit different expression patterns (e.g. two tumor samples of different sub-types).

Biclustering

 As a result, each cluster may involve only a subset of genes and a subset of conditions, which form a “ checkerboard ” structure: In reality, each gene/condition may participate in multiple clusters.

Co-clustering: simultaneous clustering of the rows and columns of a matrix

 

Biclustering

To discover such data patterns, some “ biclustering ” methods have been proposed to cluster both genes and conditions simultaneously.

Differences with projected clustering (by observation, not be definition): – Projected clustering has a primary clustering target, biclustering usually treats rows and columns equally.

– Most projected clustering methods define attribute relevance based on value distances, most biclustering methods define biclusters based on other measures.

– Some biclustering methods do not have the concept of irrelevant attributes.

Sample classification

Classify samples based on gene expression pattern AML: acute myeloid leukemia ALL: acute lymphoblastic leukemia c: idealized expression pattern c*: random idealized expression pattern The prediction of a new sample is based on "weighted votes" of a set of informative genes (each gene votes for AML or ALL depending its expression level) Ref: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring (Science, 1999)

Packages for microarray data analysis

 Packages in R – LIMMA, a library for the analysis of gene expression microarray data, especially the use of linear models for analysing designed experiments and the assessment of differential expression; part of Bioconductor – Bioconductor is open source software for bioinformatics, and it consists of 352 packages (BioC 2.5)

genes g5 g6 ….

gn g1 g2 g3 g4

Working with big matrices

Time, tissue, sample, environments

t1

3.5

0.4

0.4

5.6

2.3

4.6

t2

5.6

0.5

t3

3.4

t4 t5 t6 t7 t8 … t9

Gene expression by microarray, RNAseq Gene abundance … 1.3