Introduction to microarry
Download
Report
Transcript Introduction to microarry
Introduction to microarray
Bin Yao
[email protected]
Types of Microarray
• Affymetrix GeneChip (Oligo)
• Spotted array (cDNA /Oligo)
Affymetrix GeneChip
• in-situ Synthesis: photolithography and
combinatorial chemistry.
• Each probe set contain13-21 pairs of 25- mer oligo
probes.
• PM and MM
Spotted array
cDNA or Oligo are printed on glass slides
using arrayer
Procedures
Sample1 mRNA
Cy3
Cy5
Array
Sample2 mRNA
ADC
Image
PMT
Array
Laser
Data
Image quantification
• Pixel value
• Image: 16 bits gray scale image. Range of value 0-65535
216 values. Signal>65535 is saturated.
Image segmentation: separate signal, background and
contamination
•Output data files: Spotted array
–Signal Mean
–Background Mean
–Signal Median
–Background Median
–Signal Stdev
–Background Stdev
•
Output data files: Affymetrix
–
–
.DAT: Pixel data
.CEL: Intensity information for a given probe on an array
–
–
.EXP: Experiment information
.CHP: Analysis result from a Microarray Suite analysis
Get gene expression value
from probe level data
Consolidate 26 (13 PM data and 13 MM data) data into one
gene expression value
1. MAS (4&5): Affymetrix algorithm
Gene expression=weighted average (PM-MM)
2.
Dchip: model based expression index
PMij – MM ij = i j + εij
with invariant Set Normalization
3.
RMA: robust multi-array average
Normalized log (PMij -BKG)=i+ j + εij
With quantile normalization
Data analysis
• What are problems for microarray data analysis?
– Different sources of variance
– Large number of genes (high false positives)
– Small number of replicates (low sensitivity)
Data pre-processing
•
•
•
Background correction: Signal of a spot contains specific
binding signal, non-specific binding signal and
background signal.
Background estimation: local background, global
background and negative control spots.
Data filtering: Low signal spots and contaminated spots.
Data transformation
Ratio is not symmetric.
2 fold decrease
0.5
2 fold increase
1
2
Log ratio is symmetric
Log2(2 fold decrease)
-1
Log2(2 fold increase)
1
1
Multiplicative in ratioAdditive in logarithm log(A/B)=logA-logB
Fold change distribution
Log(fold) distribution
0.
4
0
0.
2
0.
24
0.
28
0.
32
0.
36
150
0
0.
04
0.
08
0.
12
0.
16
250
-0
.4
-0
.3
6
-0
.3
2
-0
.2
8
-0
.2
4
-0
.2
-0
.1
6
-0
.1
2
-0
.0
8
-0
.0
4
2.
5
2.
65
2.
2
2.
35
1.
9
2.
05
1.
6
1.
75
1.
3
1.
45
1
1.
15
0.
7
0.
85
0.
4
0.
55
Frequency
Frequency
140
120
200
100
80
100
60
40
50
20
0
Sources of Variance
•
•
•
•
•
Printing pin
Scanning (laser and detector, PMT, focus)
Hybridization (temperature, time, mixing, etc.)
Probe labeling
RNA preparation
• Biological variability
Normalization
Many other effects (systematic errors) beside treatment
effect can also change gene signal values. Normalization
eliminates systematic errors so that gene signals can be
compared directly.
Numerous normalization methods are available. How to
choose?
1. Understand sources of variation in your data.
2. Understand assumptions behind each method.
3. Diagnostic plot
Normalization methods
• Dividing by mean or median
Normalized signal =(signal of a spot on an
array)/(mean|median intensity of all spots on the array)
This can be done for subset of genes e.g. excluding
genes whose intensity is in top 10% or bottom 10%
percentile to minimize the effect of outliers or
differentially expressed genes.
• Subtracting mean: Used for log transformed data
• Z-transformation
Normalized signal =(signal of a spot –mean signal of
the array)/signal standard deviation of the array
Normalization methods
• Quantile normalization:
•Housekeeping gene
Normalized signal =(signal of a spot)/(signal of house
keeping gene(s))
•Intensity dependent normalization
Use local regression to correct non-linear intensity
dependency.
0.5
0.5
0.0
0.0
-0.5
-.5
-1.0
2.000
3.000
4.000
Before Normalization
2.000
3.000
4.000
After Normalization
Which genes are differentially
expressed?
One of goals of microarray experiment is to find lists of
genes that are up or down regulated between treatments
• Fold change:
Simple
Low sensitivity
High false positives
• Hypotheses test
Take into consideration of both magnitude of the change
and uncertainty of the measurement.
T-test: two-group comparison
– Student t-test: assume equal variance, normal
distribution.
– Welch method: assume normal distribution, variance is
not equal.
– Wilcoxon and Mann-Whitney: Non-parametric, no
assumption for distribution
• Analysis of Variance (ANOVA):
– Compare multiple groups: Which genes are differentially
expressed at least in one condition. Post Hoc test finds the
condition(s) that changes gene expression.
– Tow- or higher-way ANOVA
One-way ANOVA test only one factor, treatment effect. In
microarray there are more than one factors. Some of these are
the factors that we are not interested but are not avoidable.
An ANOVA model for two-color microarray
Y=A+D+G+A*D+G*T
Where A=array effect, D=dye effect, G=gene effect, T=treatment
effect, A*D=array gene interaction, G*T=gene treatment
interaction (usually this is what we are interested)
Multiple test and p value adjustment
If the probability to make a false positive when doing t
test for a single gene is p=0.05, for 5000 genes you can
expect 5000x0.05=250 false positives.
To ensure the probability to make one mistake over the
entire 5000 genes is still 0.05 (Family-wised error rate)
p-value for each gene need to be adjusted.
Bonferroni adjustments: simple but conservative
p*=min{pxN,1} where p is the raw p value and N is the
total number of tests.
Holm or step-down Bonferroni: less conservative
Wellfall and Young’s permutation: Take into consideration
of possible correlations between genes. Slow
False discovery rate: Percentage of expected false positives
in the gene list.
Cluster Analysis
• First used by Tryon, 1939 to organize observed
data into meaningful structures
• Find genes have similar expression profile
• Types of cluster analysis: Hierarchical cluster and
k-means cluster
Hierarchical cluster
Dendrogram or tree shows hierarchical relationship.
– Bottom up (agglomerative): Start from
individual genes. Measure distance of all pairs of
genes/nodes Joint the tow genes/nodes with
shortest distance iterate until all genes are
jointed
g1
g1
g2
g3
g4
d1
d2
d3
d4
d5
g2
g3
g12
g12
d6
d1’
d2’
d3’
g4
Find minimum of {d1…d6}
Find minimum of {d1’…d3’}
d1
g124
g3
g4
g3
g4
g124
g3
d2’
g3
d1’’
d1’’
g1
g2
g4
g3
• K-means cluster: find k clusters that separate as far as
possible.
– Start from k random clusters and move elements
between clusters to minimize the variability within
clusters and maximize variability between clusters.
Iterate until converged or specified number of iteration
is reached.
– Some methods are developed to estimate the number of
cluster e.g Silhouette plot. However there is no
completely satisfactory method for determining the
number clusters.
Time
Distance measurement
• Euclidean distance
distance(x,y) =
n
2
(
x
y
)
i i
i 1
D
C
A
B
•CCity-block (Manhattan) distance
distance(x,y) =
n
| x y |
i 1
i
i
c
d
b
a
d(A,B)=a+b+c+d
Result is similar to Euclidean distance. Effect of single outlier is
smaller
Both methods measure geometric distance
•Angle distance
Euclidean distance does not take into account
magnitude. Angle distance measure Angle
distance between two vectors. Moving alone the
lines do not change distance between A and B
A
d
n
x y
d(x,y)=
i 1
n
2
x
i
i 1
i
A’
i
n
2
y
i
i 1
x
B
d’
B’
Angle distance
y
• Pearson correlation
Measure how close are two genes change in same way.
rxy
n
i 1
( xi x)( yi y )
i 1 ( xi x)
n
2
2
(
y
y
)
i 1 i
n
rxy is between –1 and 1. rxy <0 two genes change in opposite ways.
Distance is defined as 1- | rxy |
•Spearman correlation
A non-parametric method, similar to Pearson correlation
Linkage
Determine distance between clusters.
– Single linkage (nearest neighbor)
Distance between two nodes is determined by the
distance of the two closest objects (nearest
neighbors) in the different nodes
– Complete linkage (furthest neighbor)
Distances between nodes are determined by the
greatest distance between any two objects ("furthest
neighbors") in the different nodes.
– Average (Centroid)
• The centroid of a node is the average point in the
multidimensional space. It is the center of the node.
The distance between two clusters is determined as
the distance between centroids.
1. Single linkage
2. Average linkage
3. Complete linkage
Self-Organizing Map
Self-Organizing Map (SOM) was introduced by Teuvo
Kohonen in 1982.
In artificial neural network, neurons that forms an one
or two dimensional elastic net lattice are trained with
input data. neurons competes to approximate the
density of the data. After the training is over, input
data vectors map to n adjacent map neurons
neurons
Input layer
Neurons compete for the input pattern. The winner take all.
Winner and neighbors move toward the input pattern.
Neighborhood: Which neurons move with the winner.
Learning rate: How much dose the winner move each time.
Other methods
• Principle component analysis (PCA)
– Reduce the dimensionality of the data matrix by finding
new variables. Intended to narrow number of variables
down to only those that are of importance.
y’
x’
B
x
A
y
• Machine learning: Trained with data set with known
classification. Predict or classify new data set.
Biological data mining
GeneOntology: Gene functions are classified into
hierarchical structures. The top 3 are : molecular
function, biological process and cellular component.
• Tools using GO: Onto-Express, EASE, eGOn,
GoSurfer
Pathway: KEGG, GeneMapp
Regulatory region analysis:
• Tools for regulatory region analysis: Genomatix,
Transfac
Gene network:
• Tools for gene network: Pathway Assist, iHOP
Microarray Standard
MIAME: Minimal Information About a Microarray Experiment.
Defining data standards
Information Required to Interpret and Replicate
•Experimental Design
•Array Design
•Biological Samples
•Hybridizations
•Measurements
•Data Normalization and Transformation
•MIAME checklist: http://www.mged.org/Workgroups/
MIAME/miame_checklist.html
•Public database
•ArrayExpress (EBI)
•GEO (NCBI)
•CIBEX (DDBJ)
•Other microarray database: BASE, SMD, Oncomine,
YMD