High-dimensional data analysis: Microarrays

Transcript High-dimensional data analysis: Microarrays

High-dimensional data analysis:
Microarrays and multiple testing
Mark van de Wiel1,2
1. Dep. of Mathematics, VU University Amsterdam
2. Dep. of Biostatistics & Dep. of Pathology, VU University
medical center, Amsterdam
Genomics: a short history (1)
Some history
1. Watson & Crick: double helix structure of DNA (1953)
Source: http://ghr.nlm.nih.gov/handbook/illustrations/
Genomics: a short history (2)
2. Human Genome Project: Identification of all 20.000-25.000 human
genes (1990-2003)
June 25, 2000
PRESIDENT CLINTON ANNOUNCES THE COMPLETION OF THE FIRST SURVEY OF THE ENTIRE HUMAN
GENOME Hails Public and Private Efforts Leading to This Historic Achievement
THE WHITE HOUSE Office of the Press Secretary
For Immediate Release June 25, 2000 PRESIDENT CLINTON ANNOUNCES THE COMPLETION OF THE FIRST
SURVEY OF THE ENTIRE HUMAN GENOME Hails Public and Private Efforts Leading to This Historic Achievement June
26, 2000 Today, at a historic White House event with British Prime Minister Tony Blair, President Clinton announced that
the international Human Genome Project and Celera Genomics Corporation have both completed
an initial sequencing of the human genome -- the genetic blueprint for human beings.
Genomics: a short history (3)
3a. 1961 DNA hybridisation discovered
3b. 1994 Introduction of robotics (Hoheisel et al.)
3c. 1995 First microarray publication (Schena et al.)
3d. 1997 First whole genome microarray experiments (De
Risi et al.)
3e. 1999 First publication on microarrays for cancer
classification (Golub et al.): Leukemia / Affymetrix arrays
Central dogma
1. DNA is the same in each cell (tumours are an exception)
2. Function of the cell is determined by proteins
3. The path from DNA to proteins goes via messenger RNA (mRNA)
4. DNA is transcribed to mRNA according to the needs of that cell
5. mRNA contains the instructions for what proteins to build
DNA
mRNA
protein
Microarrays measure the amount of mRNA
Microarrays (1)
Source: http://research.yale.edu/ysm/
Source: http://www.cottongenomics.org/
Microarrays (2)
1. Isolation of mRNA (single-stranded DNA; genes)
2. Labeling with color molecule
3. Chip contains probes which uniquely correspond to genes
4. Hybridization to the chip
5. Laser to read labeled molecules
6. Image analysis converts colors to numbers, intensities
7. Result: data matrix with 2 intensities for each array
Microarray Movie
The result
Probe ID Gene
m1_g
A_52_P616356
Ccr1
34.46396
A_52_P580582
Nppa
68.61412
A_52_P403405
Aqp7
54.3694
A_52_P819156
AK046412 40.35896
A_51_P331831
Hvcn1
1139.168
A_51_P430630
Gpr33
35.93206
A_52_P502357
C230086J09Rik
34.30417
A_52_P299964
Maml2
33.37359
A_51_P356389
A330106F07Rik
37.64724
A_52_P684402
Ptdss2
96.73227
A_51_P414208
1110014K05Rik
39.31122
A_51_P280918
Itfg1
42.2577
A_52_P613688
Elmo1
51.93495
A_52_P258194
Crtac1
42.62472
A_52_P229271
Pnpt1
34.99725
A_52_P214630
Sox9
35.1932
A_52_P579519
Tmem144 35.21073
A_52_P979997
AK039768 34.48014
A_52_P453864
Syne1
35.06627
m2_r
38.87202
63.78335
43.58079
40.19367
1239.731
33.36196
34.11315
34.92393
38.64861
114.3885
40.92528
47.87027
41.54138
43.35425
38.55899
33.39075
34.24402
39.14762
37.30331
m3_g
39.8253
64.54471
48.42171
42.21101
1331.944
35.95886
41.22859
41.02386
41.41
115.1037
41.12306
58.40548
47.34302
42.28466
39.07093
37.21131
40.30091
35.53075
38.93411
m4_r
37.60986
59.00334
40.02895
39.46673
1201.655
34.107
34.82119
36.41753
37.02367
93.56061
37.04293
50.16758
66.62057
41.94087
37.96528
34.75895
37.07755
35.00453
38.40689
m5_g
34.46396
68.61412
54.3694
40.35896
1139.168
35.93206
34.30417
33.37359
37.64724
96.73227
39.31122
42.2577
51.93495
42.62472
34.99725
35.1932
35.21073
34.48014
35.06627
m6_r
39.74775
66.14105
44.35261
40.97604
1491.437
34.09339
35.15055
36.563
39.60467
107.1197
43.00788
55.26483
43.62045
44.39193
40.0605
34.67025
35.37331
39.19561
37.66834
m7_g
43.21416
67.13218
50.96373
46.80699
1109.039
40.05874
45.26332
45.34714
45.60977
179.2954
45.41691
63.48501
53.96656
46.12511
42.41069
41.13792
43.38546
39.61501
43.29757
m8_r
41.64688
58.91294
44.11335
45.51824
1516.419
39.5299
39.64404
41.52166
43.02076
120.0925
41.51804
69.59664
59.24562
44.2117
42.81282
40.00255
41.40743
39.7067
41.74719
• Nr of rows (eg 44.000) is determined by nr of probes (> nr of genes)
• More genes than samples: high-dimensional setting
Statistical issues before data analysis
1. Design of the experiment (not discussed)
2. Quality control (not discussed)
3. Normalization
Data visualized by MA plot
Use of different dyes (colours) may leed to a non-linear dye-bias
This needs to be removed since it is artificial
M = log2(R/G) =
log2(R)-log2(G)
A = log2(R*G)=
log2(R)+log2(G)
Normalization
Purpose: remove artificial dye effects to obtain unbiased M values.
Most popular method: Loess.
Assumption: mean M value equals 0 for all intensity ranges.
Algorithm
1. Sort A values: A’1, ..., A’p.
2. For A’i, window Wi = [A’i – L, A’i + L]
3. For each Wi linearly regress:
M = a + bA + ε
4. M’i(pred) = ai + bi A’i
5. Subtract M’i(pred) from M’i.
Loess
Before
After
After normalization
Probe ID Gene
p1
A_52_P616356
Ccr1
-0.17364
A_52_P580582
Nppa
0.105326
A_52_P403405
Aqp7
0.319102
A_52_P819156
AK046412 0.005921
A_51_P331831
Hvcn1
-0.12205
A_51_P430630
Gpr33
0.107068
A_52_P502357
C230086J09Rik
0.008056
A_52_P299964
Maml2
-0.06551
A_51_P356389
A330106F07Rik
-0.03787
A_52_P684402
Ptdss2
-0.24187
A_51_P414208
1110014K05Rik
-0.05805
A_51_P280918
Itfg1
-0.17992
A_52_P613688
Elmo1
0.322157
A_52_P258194
Crtac1
-0.02448
A_52_P229271
Pnpt1
-0.13983
A_52_P214630
Sox9
0.075848
A_52_P579519
Tmem144 0.040163
A_52_P979997
AK039768 -0.18316
A_52_P453864
Syne1
-0.08922
p2
0.082575
0.129502
0.27461
0.096982
0.14851
0.076279
0.24368
0.17183
0.161531
0.298961
0.150749
0.219348
-0.49282
0.011778
0.041415
0.098357
0.120267
0.021527
0.019669
p3
-0.20578
0.05296
0.293776
-0.02189
-0.38872
0.07578
-0.03516
-0.13168
-0.07313
-0.14715
-0.12966
-0.38715
0.251701
-0.05861
-0.19494
0.021599
-0.00665
-0.18493
-0.10327
p4
0.053296
0.18842
0.208256
0.040279
-0.45136
0.019173
0.191238
0.127147
0.08431
0.578193
0.129491
-0.1326
-0.13464
0.061124
-0.01361
0.040377
0.067322
-0.00333
0.052607
Log2-ratios for further analysis. Ratios: cancel out experimental spot
effect, log to obtain symmetric scale. However, nowadays log-intensities
(both dyes) are used more and more often.
Data
Gene expression matrix X : Xij , i = 1,..., p, j = 1,..., n; p > n
Response vector y : y j , j = 1,..., n.
y j ∈R
Type of response
• Nominal. Eg tumor type. R = {Benigne, Maligne}
• Ordinal. Stage of a tumor. R={1,2,3,4}
• Continuous. Disease severity score. R = R+
• Censored. Survival. R= R+ x {0,1}.
Typical data analyses for microarrays (1)
Multivariate
• Unsupervised Clustering
• Principle component analysis
• Classification (statistical learning, discriminant analysis,
supervised clustering)
• Multivariate regression with penalty for overfitting (eg
Lasso / Ridge regression)
• Prognostic multivariate survival models
Typical data analyses for microarrays (2)
Univariate
• Inference (Hypothesis testing). Expression of each gene is related to
clinical response using, for example,
–
–
–
–
ANOVA
Linear Regression
Cox regression (survival)
Permutation (nonparametric) tests
Hybrid
• Inference for sets of genes that are functionally related
Two-step ANOVA (1)
y acdg = μ + α a + τ c + δ d + u acdg
(1)
u acdg = γ g + ( γα)ga + ( γδ)gd + ( γτ)gc + ε acdg
( 2)
Indices a: array; c: condition; d: dye; g: gene
(1) is the normalization model; it only includes a gene factor in the
residual u. That is residual u contains all gene specific factors.
(2) is the differential expression model
Two-step ANOVA (2)
y acdg = μ + α a + τ c + δ d + u acdg
(1)
u acdg = γ g + ( γα)ga + ( γδ)gd + ( γτ)gc + ε acdg
( 2)
Use of the two-step ANOVA: first fit (1) on all data, then estimate
residuals u for each gene, then fit (2) for each gene separately.
Main advantage with respect to one-level model: computational.
One-level model would require fitting many parameters
simultaneously in one ANOVA.
Computation of raw p-values is the same as for usual ANOVA.
Multiple Testing, Motivation.
Histogram of 20.000 p-values generated under H0
Even when all 20.000 null-hypotheses are true, we expect
20.000*0.05 = 1.000 p-values smaller than α = 0.05!!!
Multiple Testing. Illustration of Benjamini-Hochberg procedure
Multiple Testing
M

High-dimensional data analysis: Microarrays

Transcript High-dimensional data analysis: Microarrays

Directory