Automated Causal Inference

Download Report

Transcript Automated Causal Inference

Report on IHMC- CMU-Pitt
Research
Full Report
NRA A2-37143
“Automated Discovery Procedures for Gene Expression
and Regulation from Microarray and Serial Analysis of
Gene Expression Data”
NCC 2-1295
“Multi-Domain Network Learning Algorithms of Latent
Variable Interpretation and Discovering Genetic Regulation”
April 2001 – April 2002
http://www.phil.cmu.edu/projects/genegroup
1
Research Team









William Buckles (Ph.D, Professor, Tulane)
Tianjiao Chu (Ph.D Student, Logic,
Methodology and Computation, CMU)
Greg Cooper (M.D. Ph.D Associate Professor,
School of Medicine, Pitt
David Danks (Ph.D, Research Scientist,
IHMC)
Clark Glymour (Ph.D, P.I., Senior Resarch
Scientist and John Pace Scholar, IHMC;
Alumni University Professor, CMU)
Dan Handley (M.S. Student, Logic,
Methodology and Computation, CMU
Subramani Mani (Ph.D Student, Biomedical
Informatics, Pitt)
Rob O’Doherty (Ph.D ,Assistant Professor,
School of Medicine, Pitt)
Dave Peters (Ph.D , Human Genetics, Pitt










Joseph Ramsey (Ph.D, Research Programmer,
CMU)
Jaime Robins, (M.D. School of Public Health,
Harvard)
Raul Saavedra (Ph.D, Student, Computer
Science, Tulane)
Richard Scheines (Ph.D, Associate Professor,
CMU)
Nicoleta Servan (Ph.D Student, Statistics,
CMU)
Ricardo Silva (Ph.D student, Computer
Science, CMU)
Peter Spirtes (Ph.D, Research Scientist
IHMC; Professor, CMU)
Larry Wasserman (Ph.D, Professor, CMU)
Frank Wimberly (Ph.D, Research
Programmer, IHMC)
Changwon Yoo (Ph.D Student, Biomedical
Informatics, Pitt)
2
Two Related Goals


Investigating the prospects for more rapid and
accurate determination of genetic regulatory
networks using recently developed technologies
(microarrays and SAGE)
Investigating the prospects for determining the
underlying components of measured phenomena,
and the influences such components have on one
another
3
Background on Genetics




Proteins do most of the work in the cell
Cell reproduction, metabolism, and responses to
the environment are all controlled by proteins
Each gene is a machine for constructing
(approximately) a single protein
The rate at which a gene constructs proteins is
influenced by concentrations of regulator proteins
4
Gene Regulatory Networks



Some genes manufacture proteins which control
the rate at which other genes manufacture proteins
(either promoting or suppressing)
Hence some genes indirectly (via the proteins they
create) regulate other genes, which in turn regulate
the operation of the cell
The system by which genes regulate each other is
called the genetic regulatory network, and can be
represented by a directed graph (which is a special
case of a Bayes network)
5
Measuring Gene Expression Levels


A gene’s “expression level” is an approximate measure of
the concentration of mRNA transcripts and an more
indirect measure of the rate of synthesis of corresponding
proteins.
Recently developed technologies--microarrays and Serial
Analysis of Gene Expression, or SAGE--allow thousands
of gene expression levels to be measured simultaneously
 The
kinds of measurement errors that these technologies
introduce is not well understood
 The best way to use these tools to discover gene regulatory
networks is not known
6
Relevance to NASA

Gene expression in microgravity has been shown
to differ significantly from expression in Earth
gravity
 Understanding
gene regulation in plants, animals and
humans is likely to be important for long term
extraterrestrial habitation
 Determining regulatory structure is a present laborious,
slow and costly
 Need for systematic study of the reliability and
accuracy of scores of proposals for applying
statistical/machine learning procedures to speed up the
process
7
Background on Latent Structure
Analysis



Measurements are often of effects of other
scientifically interesting variables not directly
mesured.
Number and identity of underlying causal or
compositional variables may not be entirely
known.
Measured effects can influence other measured
effects (e.g., through between channel signal
leakage in multi-channel
8
Background on Latent Structure
Analysis

With no prior cluster information and with the possibility of
measured-measured and latent-latent influences, none of the
standard data analysis procedures (e.g., factor analysis,
principal components, independent components) give
reliable (i.e., asymptotically correct) information about all
of:
 Number
of latent variables
 Clustering of measured
 Causal or compositional relations among latent variables
9
Relevance to NASA

NASA collects vast quantities of observational
data on the Earth, the solar system and the cosmos,
much of it spectral
 Need
for automated, fast, reliable procedures extracting
relevant causal information from diverse datasets —
procedures that integrate expert knowledge
 Inadequacy of current methods (model specific,
clustering algorithms) for this task
 Principled procedures using Bayes network methods
offer promising alternatives
 They have succeeded in other spectral applications
 (J. Ramsey, et al., “Automated Identification of Carbonate Composition from
Reflectance Spectra,” Data Mining and Knowledge Discovery, in press.)
10
Structure of the Projects

Statistical Foundations



Search Algorithms



Different kinds of inputs
Different assumptions about background knowledge
Experiments



Multiple testing problem
Measurement error models
Microarray
SAGE
Testing


Application to known genetic regulatory networks
Application to simulated data
11
First Year Results: Algorithms



Many algorithms for inferring causal networks that have been applied
to inferring gene regulatory networks assume the input is associations
between measured features of individuals
But microarrays and SAGE measure average gene expression levels
over many cells rather than for a single cell
What is the feasibility of inferring regulatory networks from
associations between averages?
 Feasibility
for linear and local-linear regulatory functions
 Impossibility for the mathematical form of the regulatory function
of sea urchin Endo 16 gene, one of the best established.
 T. Chu, C. Glymour, R. Scheines and P. Spirtes, “A Statistical
Problem for Inference to Regulatory Structure form
Associations of Gene Expression Measurements with
Microarrays” Bioinformatics, submitted.
12
First Year Results: Statistics



Current methods for determining from SAGE
measurements which genes are changing in response to
experimental manipulations are incorrect
Correct method requires estimating additional experimental
parameters, and leads to the conclusion that many fewer
genes are changing than had been previously thought
 T. Chu, “Computation of Variance in SAGE
Measurements of Gene Expression” Technical Report,
Logic, Methodology and Computation, 2002.
Future plan – apply the new method to SAGE
measurements of the response of genes to shear stress (data
already gathered)
13
First Year Results: Statistics



Standard techniques for testing whether a gene expression
level has changed due to an experimental manipulation
were not designed to be applied to test thousands of genes
simultaneously
Recent developments (False Discovery Rate tests) do allow
simultaneous testing of thousands of genes
Further improvements of the False Discovery Rate
procedure have been made
 C. Genovese, and L. Wasserman, “Bayesian and
Frequentist Multiple Testing”, CMU Department of
Statistics Technical Report 764, April, 2002.
14
First Year Results: Algorithms



Implementation and testing (on simulated data) of a correct (under
explicit assumptions) algorithm for causal clustering and for
determining latent structure
 R. Silva, CMU Master’s Thesis, Center for Automated Learning
and Discovery
Extension to time series of learning algorithms for dynamical Bayes
Nets
 D. Danks, “Constraint-Based Learning Algorithm for Dynamical
Bayes Nets, Conference on Uncertainty in Artificial Intelligence,”
submitted.
Development and proof of correctness for an improved algorithm for
inferring Bayes networks across distinct data sets with overlapping
variable sets
 D. Danks, “Efficient Learning of Bayes Nets from Databases with
Overlapping Variables,” IHMC Technical Report, 2002.
15
First Year Results: Algorithms



Development and testing of algorithms for maximizing
information obtained from “knockout” experiments
 R. Silva, C. Glymour, D. Danks, “Inferring Genetic
Regulatory Structure from First and Second Moments,”
Technical Report, Logic, Methodology and Computation,
2002.
Development, implementation and testing of a genetic algorithm
for linear Bayes networks (structural equation models)
 S. Harwood and R. Scheines, “Learning Linear Causal
Structure Equation Models with Genetic Algorithms” (2001)
Tech Report CMU-PHIL-128, submitted to Conference on
Knowledge Discovery and Data Mining.
 S. Harwood and R. Scheines, “Genetic Algorithm Search over
Causal Models” (2001) Tech Report CMU-PHIL-131,
submitted to Conference on Uncertainty in Artificial
Intelligence.
Development of an algorithm for regulatory structure from mixed
observational and knockout data
16
First Year Results: Testing



Very few genetic regulatory networks are known, and
even fewer details about the functional relationships
among the genes are known
How can the accuracy of a causal discovery algorithm
be tested?
Generate simulated data from made up gene regulatory
networks, so that the generating mechanism is known
17
First Year Results: Testing


Implementation of a flexible program for generating
simulated microarray data that allows the user to
conveniently specify many different
 Functional relationships between cells
 Measurement errors
 Averaging over different numbers of cells
 Gene regulatory network structures (including
varying time lags)
 J. Ramsey and R. Scheines, (2001) “Simulating
Genetic Regulatory Networks,” Technical Report
CMU-PHIL-124.
Implementation of half a dozen algorithms proposed in
the literature for inferring regulatory structure from
expression associations in microarray measurements
(more to be implemented)
18
First Year Results: Experiments
Fat cells from mice are treated with
troglitazone, which increases the efficiency
of the biological actions of insulin in
diabetes and obesity
 Which genes are activated?
 Microarray chips used to make 47
measurements of gene expression level at
35 time points for 5355 genes

19
First Year Results: Experiments


Normalize data to
remove chip-to-chip
effects
Perform statistical
tests to determine
which genes are
changing, adjusting for
multiple tests
Comparing 20 genes that change
most with 20 that change least
20
Current Work: Experiments




Remove outlying genes
Improve the test performed for whether a gene is
changing over time
Introduce clustering methods for data
Use slower but more accurate measurement
techniques (Northern Blots) to
 Test
the hypotheses about which genes change
according to the microarray analysis
 Learn about errors in measurement when using
microarrays
21
Gene Research Plans: May 2002 – May 2003
Study statistical properties of multiple decisions and of conditional independence among
averaged variables
Develop new algorithms for optimal information extraction and implement algorithms
proposed in the literature
Implement Simulator
Test algorithms on real and simulated data
Make Predictions
Laboratory SAGE and microarray study
of expression under varying surface
flows and drug treatments
Where we are
Analyze data
Where we will be
Knockout Experiments
Overall Evaluation
22
Latent Structure Research Plans, 20022003
Improve efficiency
 Test on large simulated data sets
 Prove asymptotic correctness
 Investigate non-linear generalizations

23
Supplementary Material – Outline






Discovering the Structure of Genetic Regulatory
Networks
Testing Algorithms – Simulator
Analysis of Gene Expression Levels Averaged
Over Many Cells
Analysis of SAGE Data
Latent Structure---Causal Clustering
Experiments
 Experiment
1 – Microarray analysis
 Experiment 2 – SAGE analysis
24
Discovering the Structure of
Genetic Regulatory Networks
25
Simplified Gene Regulatory Network
Environment
G1
G2
G3
G4
mRNA1
mRNA2
mRNA3
mRNA4
protein1
protein2
protein3
protein4
G5
mRNA5
protein5
G6
mRNA6
protein6
26
Still More Simplified
Environment
G1
G2
G5
G3
G4
G6
27
Two Strategies for Discovering
Gene Regulatory Networks
(Difference) Enhance or suppress specific genes
and measure the changes in expression levels of
other genes. Infer effects of manipulated gene
from differences in expression levels of other
genes versus unmanipuated controls
 (Association). Use wild-type cells or cells with
specific enhanced or suppressed levels of other
genes. Infer effects from associations of
expression levels of all genes

28
Measurement Techniques




Microarray techniques allow measurements of
relative mRNA concentrations from multiple
tissue sources
mRNA concentrations for thousands of genes can
be measured simultaneously
Measurements can be taken in time sequence,
every few minutes
Serial Analysis of Gene Expression (SAGE)
allows estimation of concentrations of mRNA
transcripts for essentially the entire genome—does
not require prior knowledge of all genes
29
Difference Method


Several examples of partial identification of part of the
regulatory network for several species
Limitations:


Laborious and expensive
Each experiment can only tell us which genes are regulated by a
manipulated gene, nothing about the pathway of regulation
 E.g, If gene A is suppressed and genes B and C change in
consequence, the experiment does not distinguish among:
A BC
A CB
CA B
30
Difference Method -
Fundamental Problems
How to make optimal multiple statistical
decisions about expression differences
 How to efficiently extract all information from
an experiment
 How to dynamically schedule experiments for
maximal information

31
Association Method

An example or two of recovery of
regulatory structure previously established
by Difference methods. No novel
discoveries so far.
 Requires
larger number of experimental
repetitions
 Depends on statistical methods for implicitly or
explicitly estimating conditional probability
relations among cellular expression levels
32
Testing Algorithms - Simulator
33
Simulator

User specifies
 Functional
relationships between cells
 Measurement errors
 Averaging over different numbers of cells
 Gene regulatory network structures (including
varying time lags)
 Type of experiment

This provides a known structure to test
algorithms on, under a variety of assumptions
about how genes are related
34
Simulating MicroArray Data
Tetrad 4 (www.phil.cmu.edu/projects/tetrad)
Network structure
Functional form
Parameters
35
Specifying the Network Structure
36
Specifying the Parameters
37
Data Output
Cell by Cell: Raw data
Aggregrated Measurements
38
Simulating MicroArray Data
Simulated correlation between genes 1 and 3, using
different sizes averaged over (10, 100, and 1,000
cells/dish) over 450 time steps
0.9
0.8
0.7
rho(g1, g3)

0.6
agg10
0.5
agg100
0.4
agg1000
0.3
0.2
0.1
0
0
100
200
300
400
500
tim e steps
39
Analysis of Gene Expression Levels
Averaged Over Many Cells
40
Averaging and Association



Goal is to discover the structure of a regulatory
network from associations among expression
levels of each pair of genes, and their
associations conditional on values of other genes
But we measure only concentrations—
averages—formed from the mRNA of many cells
For many systems, conditional associations are
altered by averaging
41
The Endo 16 Regulatory Function

Regulation of the Endo16 gene of the sea urchin (from C. Yuh, H. Bolouri, E.
Davidson “Genomic Cis-Regulatory Logic: Experimental and Computational
Analysis of a Sea Urchin Gene” Science, 1998, March 20; 279: 1896-1902
42
The Endo16 Regulatory Function
43
The Endo 16 Regulatory Function,
Slightly More Algebraically
If ( CG1 * P) (B(t) + G(t)) > 0, then
Q (t) = 2 (1 – (F + E + CD) Z) (1 + CG2 * CG3 * CG4)
(CG1 * P) (B(t) + G(t))
Else
Q (t) = 2 (1 – (F + E + CD) Z) ( 1 + CG2 * CG3 *
CG4)Otx(t)
and “ + “ is Boolean sun
44
Conditional Independence Is Not
Invariant in a Simplified Form of Endo
16 Regulation



X takes values in a discrete set, say {0,1,2,3,4}
Y = g(X), g nonlinear, say Y = X2
Z = a Y*W, a real, W Boolean (values in {0.1},
with a Bernoulli distribution
X
Y
Z
W
45
Conditional Independence Is Not
Invariant in a Simplified Form of Endo
16 Regulation



X is independent of Z conditional on Y,
but….
S X is not independent of S Y conditional
on S Z, where the sum is over values in n
= 4 or more identically and independently
distributed units
For large n this result generalizes to all
cases in which the range of X is finite (but
not binary), g is polynomial, and W is as
above
46
General Pessimistic Conclusion
(not a Theorem)


Conditional probability relations that hold among
regulator and regulated gene transcript
concentrations at the cellular level will not be
preserved in probability relations as measured in
microarrays taking from multiple cell sources
They will be preserved for linear systems and
“locally linear” systems (see Chu, et al.), but no
regulatory systems are as yet known to have such
a structure
47
Analysis of SAGE Data
48
Difference Strategy and SAGE
Estimating whether expression levels of
genes change in different environments, or
which other genes removed, requires a
comparison of expression levels across
samples
 Decision must be made as to whether
observed differences are or are not due to
chance

49
SAGE and Variance


Decisions as to whether differences expression
levels are or are not due to chance depend on the
estimate of the variance of the underlying
probability distribution
Standardly, a multinomial model is used which
gives a very large variance—meaning decisions
about the constancy of a gene’s expression across
environments cannot be reliably made
50
SAGE and Variance





One step in SAGE measurements is an amplification of the
amount of mRNA measured through PCR amplification
The multinomial model does not correctly represent the
statistics of PCR
A correct estimate of variance requires an approximate
estimate of the original total number of transcripts before
PCR amplification
Relevant measurements can easily be made
Lead to a much lower estimate of variance of SAGE
estimates
51
Causal Clustering
52
The General Problem
Given data on a number of variables, find
features of the underlying processes that
generated the data
 Example: Spectral measurements of solar
radiation intensities. Variables are intensities
at each measured frequency

53
The Most Common Solution: Principal
Components Factor Analysis




Explains data by new “theoretical” variables that
are linear functions of linear combinations of
measured variables
Chooses “theoretical variables” to account for as
much of the variance of measured variables as
possible
“Theoretical” variables are not unique—
appropriate transformations will do as well
Gives no clues to dependencies among real
underlying factors — assumes they are
independent of one another
54
General Problems with Clustering
Algorithms
Tend to give misleading results if some of
the measured variables influence other
measured variables (e.g., through signal
leakage between channels)
 Assume no correlations among the
underlying factors
 E.g., Independent Components algorithms

55
A New Approach: General
Considerations





For the time being, consider only linear models
Think graphically and let the algebra take care of
itself
Be willing to make multiple hypothesis tests on
the same data set
Insist on computational tractability, but be
adventurous
Require asymptotic reliability under specifiable
assumptions
56
Think Graphically
A system represented by the equations:
Xi = ai T + ei, ai a real constant, ei random, i =
1,…m
ei independent of ek for i not equal to k, is
represented as
T
X1
X2
…………………………….Xm-1 Xm
57
Causal Clustering

Assumptions (for some while)
 Linear
Systems
 Non recursive (acyclic graph)
 Independent noises or error terms
 Normal distributions of error variables
 Independent, identically distributed cases
 Faithfulness: vanishing partial correlations, if
any, hold for all values of the linear coefficients
58
Input
Values for variables X1 ….Xn for a number
of cases
 Significance level (a level) to be used in
hypothesis tests
 Nothing else

59
Output



Disjoint clusters of some of the observables; a set
of directed acyclic graphs (DAGs) among
theoretical variables, one variable for each cluster
Each DAG determines a linear model
Just write each variable (node) in the graph as a
linear functional of its parent variables in the
graph and add an error term for each equation
60
The True Graph
61
Purify: Start of Round 1
62
Purify: Round 1


For each measured variable X, do a test of the
one factor model, with latent common cause
T0, and with all measured variables except X,
against the one factor model with all
measured variables including X (Difference of
chi squares)
If the model without X is not rejected, put X in
set Hold For 1
63
Purify: Steps into Round 1
…
64
Purify: End of Round 1
65
Washdown, Round 1
 Put
all measured variables in
Hold For 1 in a new cluster with
a single common latent factor, T1
 Correlate the new factor with the
previous latent factor, T0
 Empty Hold For 1
66
Washdown: Round 1
67
Purify Round 2
 Repeat
the Purify procedure on all
measured variables remaining in the
first cluster. Put any rejected
variables in Hold For 1
 Apply the Purify procedure to all
measured variables in the second
cluster. Put any rejected variables in
Hold for 2
68
Purify: Round 2
69
Washdown, Round 2
 Add
variables in Hold For 1 to the
remaining variables in cluster T1
 Form a new cluster, with a new latent
common cause T2 with the variables
in Hold For 2
 Correlate all of the latent variables
 Empty Hold For 1 and Hold For 2
70
Washdown: Round 2
71
Purify/Washdown Output
(after 5 rounds)
72
Clean Up
 Remove
any clusters with fewer
than 3 observed variables
73
Determining Latent Structure


For each pair of latent variables, Tj and Tk,
and their measured effects, test the model in
which there is a directed edge Tj  Tk against
the model in which there is no directed edge
If the model with a directed edge is not
rejected, keep an undirected edge between Tj
 Tk If the model with a directed edge is
rejected, remove the Tj – Tk undirected edge
74
MIMBuild
Step 1: Testing Marginal Independencies
Testing T2
T3
versus
2
= 11.42
df =8
Not significantly
different (a = 0.05)
Keep edge
2 = 12.42
df =9
75
Testing for Conditional Independence
To test if Tj is independent of Tk conditional
on Tm, form the complete graph among Tj,
Tk and Tm (with measured variable effects)
and test against the same model without the
Tj  Tk edge
 Similarly for conditioning on multiple
variables

76
MIMBuild
Step N: Testing Independencies
Conditioned in a Set of Size N
Other example: N = 3, testing T0
T4 | {T1, T2, T3}
versus
77
Orienting Edges
 If, for example, there is a structure T0 –T1 – T2
but no T0 – T2 edge, and the T0 – T2 was removed
without conditioning on T1, orient T0 – T1 – T2 as
T0  T1  T2 (as a collider)
 Orient undirected edges adjacent to a collider
node away from a collider
T0
T1
T3
T2
78
Final Outcome
Purify/Washdown/MIMBuild
output
True graph
79
General Idea
 Measured variables are assigned to clusters by testing
whether the one factor model fits the data better with them
or without them
 Every rejected variable is tested on each succeeding
cluster until it fits
 The latent structure is determined by the PC algorithm
(Spirtes, et al. 1993) , known to be asymptotically correct
under the Faithfulness assumption, and (in this case) under
the assumption that there are no unmeasured causes of the
latent cluster factors
80
Generalizations

Using another algorithm for latent structure, the FCI
algorithm, procedure can be applied when there may be
unmeasured common causes of cluster latent factors




Can be used with any distribution family for which there are good
tests of conditional independence (not that there are many)
The algorithm can be easily integrated with prior substantive
knowledge about the actual structure
For linear systems, can be generalized to latent structures with
cyclic graphs (feedback systems)
Improved performance expected if Bayesian search algorithms
supplement constraint based search, or with genetic algorithms
81
Limitations

Only works for unmeasured causes having at least
3 unconfounded measured variables
 But
if there is a known or suspected common cause of
all measures (or any set of measures), it can be
estimated and partialed out




Does not give orientations of all edges
Requires large sample sizes
Computationally intensive
No error probabilities are possible
82
Experiments
83
Experiment 1 – Microarray analysis
84
Background of the Experiment





Fat cells from mice are treated with troglitazone, which is a
member of the family of drugs known as thiazolidendiones
(TZD’s)
TZD’s are used in humans to increase the efficiency of the
biological actions of insulin in diabetes and obesity
Decreased insulin sensitivity is a hallmark of both diabetes
and obesity
The action is to activate the expression of specific genes
At the end of a particular incubation the cells were quickly
frozen to stop all biological processes in the cell
85
cDNA Microarray Analysis of
the 3T3-L1 Adipocyte response to
Troglitazone




3T3-L1 pre-adipocytes cultured in vitro
3T3-L1 pre-adipocytes differentiated into mature
adipocytes by addition of insulin and
dexamethasome
Mature adipocytes exposed to 10μM Troglitazone
for durations of between 15 minutes and 24 hours
Cells harvested directly in Trizol reagent and total
cellular RNA extracted by standard procedures
86
cDNA Microarray Analysis of
the 3T3-L1 Adipocyte response to
Troglitazone



First strand cDNA synthesized by Reverse
Transcriptase in the presence of α-33P-dCTP
cDNA hybridized to Research Genetics GF400
(mouse) Gene Filters using standard methods
Hybridized signal captured using Storm
(Molecular Dynamics) phosphorimager and genespecific signal intensity extracted using Pathways
4TM software (Research Genetics).
87
Data Scheme






20 array chips with 47 measurements
3 uses for each chip: 20 for the first hybridization, 20 for
the second hybridization, 7 for the third hybridization
3 treatments: control without DMSO, control with DMSO,
test sample (drug + DMSO)
35 time points
5355 genes
The data contains information about background,
chromosomes, release plates, the coordinates of each spot
on the plate, etc.
88
Normalization

The data was logged because:
 it
gives a better sense of the amount of variation
 the amount of variance in a gene expression leval was
proportional to the gene expression level


Each chip was adjusted to have median zero in
order to remove global chip-to-chip variations
Outliers were removed because very high and low
intensity gene intensities are not reliably measured
89
Determine the Effect of the Drug Treatment
on the Gene Expression Level Over Time
Compare 20 genes with highest variability in use-1
data with 20 genes with lowest variability
Perform statistical tests of hypothesis that genes are
not changing, adjusted for multiple testing problem
90
Are the Measurements for the
Second Use reliable?



Chips are supposed to be re-usable
However, the second measurement on each chip
resembles the first measurement on each chip
more closely than it resembles measurements that
occurred at the same time
Figure in next slide shows close resemblance
between different measurements on same chip, but
taken at different times
91
Are the Measurements for the
Second Use Reliable?
92
Concerns
Is it an experimental error?
 Should we use the chips only once?
 Is at least the use-1 data set reliable?
 We are using other more reliable, but more
expensive tests to evaluate these hypotheses

93
Future Plans
Remove outlying genes
 Improve the test performed for data in use-1
 Clustering methods for data in use-1
 Check the data for use-2

94
Experiment 2 – SAGE Analysis
95
Serial Analysis of Gene Expression
(SAGE)






Analysis of the effect of laminar shear stress on gene expression in
the vascular endothelium
Primary coronary artery endothelial cells (HCAEC) grown to
confluency on glass microscope slides
Slides placed in parallel plate flow chamber and cells exposed to
laminar shear stress for 0, 4, 8, 12, 20 and 24 hours
Cells harvested directly into Trizol reagent (InVitrogen) and total
RNA extracted
RNA used as substrate for construction of SAGE library and SAGE
tags analyzed by automated DNA sequencing
SAGE tag data analyzed using SAGE2000 software and gene
expression measurement recorded for all genes present
96
SA GE II 2...
Preliminary Clustering Analysis of
Genes Regulated >2-fold
SA GE II 2...
File Name 0h shear SA GE.tx t
File Name 12h s hear SA GE.tx t
File Name 4h shear SA GE.tx t
SA GE 3 2...
File Name 8h shear SA GE.tx t
File Name 0h shear SA GE.tx t
File Name 20h s hear SA GE.tx t
File Name 12h s hear SA GE.tx t
File Name 24h s hear SA GE.tx t
File Name 4h shear SA GE.tx t
SA GE 3 2...
NB: Samples are clustered using the Pearson correlation. Red, yellow and blue bars
File Name 8h shear
indicate high, medium and low levels of gene expression respectively.
SA GE.tx t
File Name 20h s hear SA GE.tx t
97
Flow Loop
Flow chamber
Flow Direction
Upper
Reservoir
Flow Cell
Flow
Direction
Peristalic Pump
Flow Meters
Flow Direction
Wave Driver
Lower
Reservoir
Flow
Regulators
Function Generator
98
Parallel Plate Flow Cell
Reservoir
Flow In
Cells
Flow Out
Reservoir
99