Statistical Analysis of Gene Expression Data

Download Report

Transcript Statistical Analysis of Gene Expression Data

Introduction to Statistical
Analysis of Gene Expression
Data
Feng Hong
Beespace meeting
April 20, 2005
The Central Dogma
DNA
Transcription
RNA
Translation
Protein
Source: http://www.accessexcellence.org/



A gene is a sequence of nucleotides that
codes for a protein
All cells contain the same gene information in
DNA, but only a few genes are expressed in
certain cell
The presence of mRNA in a cell indicates that
a gene is active;
Microarray Technololgy
http://www.accessexcellence.org/RC/VL/GG/microArray.html
Microarray



Examine how active the thousands of genes
are at once
Florescent-dye-labeled mRNA from different
samples hybridize to the DNA on the array
Intensity of florescent indicates the
expression level of the gene in the sample
Steps in Microarray experiment


Experimental Design
Signal Extraction



Image Analysis
Normalization: remove the artifacts across arrays
Data Analysis


Selection of Genes differentially expressed
Clustering and classification
Experimental Design


For two-color cDNA experiment, only two
sample mRNA can be hybridized on the one
array
Factors influencing choice of experimental
design




Number of different samples
Aim of the experiment: which comparisons are of
primary interest
Constraint of resources
Power of the experiment
Experimental Design

Direct Comparison :



Reference Sample:




More than two MRNA samples
All comparison are of interest
Loop Design


Compare several samples with reference
Indirect comparison between the samples
Saturated Design


compare only two mRNA samples
Dye-swap is recommended to minimize the
Used in time couse
More complicated designs
Design used in Whitfield et al.(2003)
Source: Whitfield, Cziko, Robinson, 2003, Gene Expression Profiles in the brain predict behavior in
individual honey bees, Science, supplement materials
Gene expression measurements


Gene expression data are noisy
Source of errors





Microarray manufacturing
Preparation of mRNA from biological samples
Hybridization
Scanning
Imaging
Image Analysis



Preprocess the raw scanned image
Gridding, edge detection, segmentation,
summarization of pixel intensities
Output: foreground intensities (R, G),
background intensities(Rb, Gb), “flagged”
spots
Statistical Data Analysis of the data

Objective: identifying as many genes that are
differentially expressed across conditions as
possible while keeping the probability of
making false declarations of expression
acceptably low
Software for statistical microarray
analysis

Generic statistical plat form





SAS
Splus
R
Matlab
Specific packages for microarray data analysis




Maanova
Bioconductor (www.bioconductor.org): limma,
Etc. etc.
Our own programs
Visualize data and check quality


Look at original image
Use MA plot(log fold change vs log intensity)


y-axis: M = log2 (R) - log2 (G)
x-axis: A = log2 (R) + log2 (G)
Raw image
MA plot
Normalization




“to adjust micro array data for effects which arise from
variation in the technology rather than from biological
differences between RNA samples” (Smyth and Speed,
2003)
“an iterative process of visualization, identification of
likely artifacts and removal of artifacts when feasible”
(Parmgiani et al. 2003)
Two places
 Within-array normalization
 Across-array normalization
Method: check MA plot, transform the data: loess
transformation, lin-log transformation, etc.
Examples of Normalization
ANOVA (Analysis of Variance)Model
Let yijkg be the fluorescent intensity measured from Array i,
Dye j, Variety k, and Gene g, on the appropriate scale
(such as log). A typical analysis of variance (ANOVA)
model is:
yijkg = µ + Ai + Dj + Vk + Gg + (AG)ig + (DG)jg + (VG)kg + ijkg
•
•
•
•
•
µ, A, D, V are “normalization” terms
G are the overall gene effects
AG’s are “spot” effects
DG’s are gene-specific dye effects
VG’s are the effects of interest. The capture the expression of
genes specifically attributable to varieties.
•  is random error
Two stage ANOVA
Global ANOVA model
yijkgr = µ + Ai + Dj + Vk + Gg + (AG)ig + (DG)jg + (VG)kg + εijkg
However, fitting the global model is computationally
prohibitive. In stead, breaking the model into two stages
 Two stage ANOVA
 Fit the “normalization model”
yijkg = µ + Ai + Dj + Vk + rijkgr
 Fit residuals on per gene basis
rijkr = G + (AG)i + (DG)j + (VG)k + εijk

Report significant genes: Multiple Test
Adjustment




P-values
 P-value = if gene is not differentially expressed, the chance that
we will observe more extreme case than what we observed. The
smaller p-value, the more significant the result.
 If we set the cutoff point at 0.05, and we test on 8000 genes, and
assume that none of the gene is differentially expressed, we will
expect to declare 400 genes are significant.
 adjusted p-values
Posterior probability
False Discovery Rate (FDR)
 FDR = E(#genes falsely declared diff. expr. / # genes decleared
diff. expr.)
Ranking the genes
Clustering



After selecting the list of differentially
expressed genes, we want to investigate the
relationship between these genes
Look at “profile” of gene expressions across
the samples
Cluster the selected genes into clusters,
genes with similar profiles are clustered
together


Kmeans
Hierarchical clustering
Example of Clustering from Whitfield
et al 2003.
Principal Component Analysis





Reduce the high dimension data into a small
number of summary variables (principal
components).
Use correlation matrix
1st component is the direction along which there is
greatest variation in the data
2nd component is orthogonal to 1st component,
which represent the greatest variation in data after
controlling 1st component
Can be used to visually identify clusters or assist
classifications. (for example, Whitfield 2003)
Example of PCA
Source: Whitfield, Cziko, Robinson, 2003, Gene Expression Profiles in the brain predict behavior in
individual honey bees, Science