Transcript Document

Statistics for Microarrays

Differential Expression and Tree-based Modeling Class web site: http://statwww.epfl.ch/davison/teaching/Microarrays/

Genes cDNA gene expression data 1 2 3 4 5 Data on G genes for n samples mRNA samples

sample1 sample2 sample3 sample4 sample5 …

0.46

-0.10 0.15

-0.45

-0.06

0.30

0.49

0.74

-1.03

1.06

0.80

0.24

0.04

-0.79

1.35

1.51

0.06

0.10

-0.56

1.09

0.90

0.46

0.20

-0.32

-1.09

...

...

...

...

...

Gene expression level of gene

i

in mRNA sample

j

=

(normalized) Log( Red intensity / Green intensity )

Identifying Differentially Expressed Genes

• • Goal : Identify genes associated with covariate or response of interest Examples : – Qualitative covariates or factors: treatment, cell type, tumor class – Quantitative covariate: dose, time – Responses: survival, cholesterol level – Any combination of these!

Differentially Expressed Genes

• Simultaneously test m null hypotheses, one for each gene j : H j : no association between expression level of gene j and covariate/response • Combine expression data from different slides and estimate effects of interest • Compute test statistic T j for each gene j • Adjust for multiple hypothesis testing

Test statistics

• Qualitative covariates: e.g. two-sample t-statistic, Mann-Whitney statistic, F statistic • Quantitative covariates: e.g. standardized regression coefficient • Survival response: e.g. score statistic for Cox model

QQ-Plot Used to assess whether a sample follows a particular (e.g. normal) distribution (or to compare two samples) A method for looking for outliers when data are mostly normal Sample quantile is 0.125

Recall that for the normal distribution, approximately: 68% within 1 SD of the mean 95% within 2 SDs 99.7% within 3 SDs Value from Normal distribution Theoretical which yields a quantile of 0.125 (= -1.15)

Typical Deviations from Straight Line Patterns

• Outliers • Curvature at both ends (long or short tails) • Convex/concave curvature (asymmetry) • Horizontal segments, plateaus, gaps

Outliers

Long Tails

Short Tails

Asymmetry

Plateaus/Gaps

Example: Apo AI experiment

(Callow et al., Genome Research, 2000) GOAL: mice Identify genes with altered expression in the livers of one line of mice with very low HDL cholesterol levels compared to inbred control Experiment: • Apo AI knock-out mouse model • 8 knockout (ko) mice and 8 control (ctl) mice (C57Bl/6) • 16 hybridisations: mRNA from each of the 16 mice is labelled with Cy5, pooled mRNA from control mice is labelled with Cy3 Probes: ~6,000 cDNAs, including 200 related to lipid metabolism

Which genes have changed?

This method can be used with replicated data: 1. For each gene and each hybridisation (8 ko + 8 ctl) use M=log 2 (R/G) 2. For each gene form the t-statistic: average of 8 ko Ms - average of 8 ctl Ms sqrt(1/8 (SD of 8 ko Ms) 2 + 1/8 (SD of 8 ctl Ms) 2 ) 3. Form a histogram of 6,000 t values 4. Make a normal Q-Q plot; look for values “off the line” 5. Adjust for multiple testing

Histogram & Q-Q plot

ApoA1

Plots of t-statistics

Assigning p-values to measures of change

• Estimate p-values for each comparison (gene) by using the permutation distribution of the t statistics.

• The    12,870 permutation of the trt / ctl labels, compute the two-sample t-statistics t* for each gene.

unadjusted p-value for a particular gene is estimated by the proportion of t*’s greater than the observed t in absolute value.

Multiple Testing

# true H totals # not rej U m - R # rejected totals V (False +) m 0 # false H T (False -) S R m m 1 * Per-comparison = E(V)/m * Family-wise = p(V ≥ 1) * Per-family = E(V) * False discovery rate = E(V/R)

Apo AI: Adjusted and unadjusted p-values for the 50 genes with the larges absolute t-statistics

Genes with adjusted p-value  0.01

Gene Apo A1 Sterol C5 desaturase Apo A1 Apo CIII ApoA1 EST Apo CIII Sterol C5 desaturase Adjusted p-values 0 0 t -22.9

-13.1

0 0 0 0 0 0 -12.2

-11.9

-11.4

-9.1

-8.4

-7.7

Num -3.2

-1.1

-1.9

-1.0

-3.1

-1.0

-1.0

-1.0

Den 0.14

0.08

0.16

0.09

0.2

0.11

0.12

0.13

Single-slide methods

• Model-dependent rules for deciding whether (R,G) corresponds to a differentially expressed gene • Amounts to drawing two curves in the (R,G)-plane; call a gene differentially expressed if it falls outside the region between the two curves • At this time, not enough known about the systematic and random variation within a microarray experiment to justify these strong modeling assumptions • n = 1 slide may not be enough (!)

Single-slide methods

• • • • Chen et al: Each (R,G) is assumed to be normally and independently distributed with constant CV; decision based on R/G only (purple) Newton et al: Gamma-Gamma-Bernoulli hierarchical model for each (R,G) (yellow) Roberts et al: Each (R,G) is assumed to be normally and independently distributed with variance depending linearly on the mean Sapir & Churchill: (turquoise) Each log R/G assumed to be distributed according to a mixture of normal and uniform distributions; decision based on R/G only

Difficulty in assigning valid p values based on a single slide Matt Callow’s Srb1 dataset (#8). Newton’s, Sapir & Churchill’s and Chen’s single slide method

Another example: Survival analysis with expression data

• Bittner et al. looked at differences in survival between the two groups (the ‘cluster’ and the ‘unclustered’ samples) • ‘Cluster’ seemed to have longer survival

Kaplan-Meier Survival Curves, Bittner et al.

Average Linkage Hierarchical Clustering, survival only unclustered cluster

Kaplan-Meier Survival Curves, reduced grouping

Identification of genes associated with survival

For each gene j, j = 1, …, 3613, model the instantaneous failure rate hazards model: , or hazard function, h(t) with the Cox proportional h(t) = h 0 (t) exp(  j x ij ) and look for genes with both: • large effect size  • large standardized j effect size  j /SE(  j )

Findings

• Top 5 genes by this method not in Bittner et al. ‘weighted gene list’ - Why?

• weighted gene list based on entire sample; our method only used half • weighting relies on Bittner et al. cluster assignment • other possibilities?

Statistical Significance of Cox Model Coefficients

Limitations of Single Gene Tests

• May be too noisy in general to show much • Do not reveal coordinated effects of positively correlated genes • Hard to relate to pathways

Some ideas for further work

• Expand models to include more genes and possibly two-way interactions • Nonparametric tree-based subset selection – would require much larger sample sizes

(BREAK)

Trees

• Provide means to express knowledge • Can aid in decision making • Can be portrayed graphically or by means of a chart or ‘ key ’, e.g. (MASS space shuttle):

stability

any xstab stab stab stab any

Etc… error

any any LX XL MM any

sign

any any any any nn any

wind

any any any any tail any

magnitude visibility DECISION

any any any any any no yes yes yes yes Out of range yes auto noauto noauto noauto noauto noauto

Tree-based Methods –

(MASS) • Ripley, 1996

References

• Hastie, Tibshirani, Friedman 2001 – The Elements of Statistical Learning • Venables and Ripley, 1999 – Modern Applied Statistics with S-Plus – Pattern Recognition and Neural Networks • Breiman, Olshen, Friedman, Stone 1984 – Classification and Regression Trees

Tree-based Methods

• Automatic construction of decision trees dates from social science work in the early 1960’s ( AID ) • Breiman et al. (1984) proposed new algorithms for tree construction ( CART ) • Tree construction can be seen as a type of variable selection

Response types

• Categorical outcome – Classification tree • Continuous outcome – Regression tree • Survival outcome – Survival tree • Software – Available R packages include tree, rpart (tssa available in S)

Trees Partition the Feature Space • End point of tree is a (labeled) partition of the (feature) space of possible observations

X

• Tree-based methods partition

X

regions; try to make the (average) responses in each box as different as possible into rectangular • In logical problems it is assumed that there does exist a partition of

X

that will correctly classify all observations; task is to find a tree to succinctly describe this partition

Partitions and CART

Yes No X 2 t 2 R 2 R 1 R 3 R 5 t 4 R 4 X 2

XX XX XX

t 1 X 1 t 3 X 1

Partitions and CART

X 1  t 1 X 2 t 2 R 2 R 1 R 3 R 5 t 4 X 2  t 2 R 4 R 1 t 1 X 1 t 3 R 2 R 3 R 4 X 1  t 3 X 2 R 5  t 4

Tree Comparison

• Measure how well the partition created by a tree corresponds to the correct decision rule (classification) • For a logical errors problem, count number • For statistical problem, usually overlapping class distributions, so that no partition unambiguously describes classes – estimate misclassification prob.

Three Aspects of Tree Construction

• Split Selection Rule • Split-stopping Rule • Assignment of predicted values

Split Selection

• Binary splits • Look only one step ahead – avoids massive computational time by not attempting to optimize whole tree performance • Choose an impurity measure to optimize each split – Gini index or entropy , rather than misclassification rate for classification tree, deviance (squared error) for regression tree

Split-stopping

• Issue: A very large tree will tend to overfit the data (and therefore lack generalizability), while too small a tree might not capture important structure • Usual solution: grow large /maximal tree (stop splitting only when some minimum node size, 5 or 10 say, is reached), followed by (cost-complexity) pruning

Pruning

• Sequence of rooted subtrees • Measure R i (e.g. deviance) at leaves, R =  R i • Minimize the cost-complexity measure • R  = R +  * size  governs tradeoff between tree size goodness of fit and • Choose  to minimize cross-validated error (misclassification or deviance)

Assignment of Predicted Values

• Assign value to each leaf (terminal node) • In Classification: (weighted) voting observations in the node among • In Regression: mean node of observations in the

Other Issues (I)

• Loss matrix – Procedures can be modified for asymmetric losses • Missing predictor values – Can create ‘missing’ category – Surrogate predictors splits exploit correlation between • Linear combination splits

Other Issues (II)

• Tree Instability – Small changes in the data can result in very different series of splits – difficulties in interpretation – Aggregate trees to reduce (e.g. bagging ) • Lack of smoothness – More of an issue in regression trees – Multivariate Adaptive Regression Splines (MARS) • Difficulty in capturing additive structure with binary trees

Acknowledgements

• Sandrine Dudoit • Jane Fridlyand • Yee Hwa (Jean) Yang • Debashis Ghosh • Erin Conlon • Ingrid Lonnstedt • Terry Speed