Computational Discovery of Communicable Knowledge

Download Report

Transcript Computational Discovery of Communicable Knowledge

Computational Discovery of
Communicable Scientific Knowledge
Pat Langley
Institute for the Study of Learning and Expertise
Palo Alto, California
and
Center for the Study of Language and Information
Stanford University, Stanford, California
http://www.isle.org/~langley
[email protected]
Thanks to S. Bay, V. Brooks, S. Klooster, A. Pohorille, C. Potter, K. Saito, J. Shrager,
M. Schwabacher, and A. Torregrosa.
Motivations for Computational Discovery
Humans strive to discover new knowledge from experience so that
they can:
 better predict and control future events
 understand both previous and future events
 communicate that understanding to others
Computational techniques should let us automate and/or assist this
discovery process.
Recent research on computer-aided discovery has focused on some
of these issues but downplayed others.
The Data Mining Paradigm
One computational discovery paradigm, known as data mining or
KDD, can be best characterized as:
 emphasizing the availability of vast amounts of data;
 drawing on heuristic search methods to find regularities in
these data;
 using formalisms like decision trees, association rules, and
Bayes nets to describe those regularities.
Thus, most KDD researchers favor their own formalisms over
those used by scientists and engineers.
As a result, their discoveries are seldom very communicable to
members of those communities.
Myths About Understandability
Within the data mining paradigm, one quite popular myth is that:
 decision trees and rules are inherently understandable
 because logical formalisms are easier to interpret than other
notations.
However, Kononenko found that doctors felt that naïve Bayesian
classifiers were easier to interpret than decision trees.
Conclusion: Any formalism’s understandability depends on the
interpreter’s familiarity with that formalism.
Myths About Understandability
Another popular myth in the data mining community is that:
 connectionist methods produce results that are opaque
 because the set of weights they learn cannot be easily
interpreted.
However, Saito and Nakano (1997) have shown that one can use
such methods to discover explicit numeric equations.
Conclusion: Understandability depends on the resulting formalism,
not on the search method used to discover knowledge.
Computational Scientific Discovery
An older paradigm, computational scientific discovery, can be
characterized as:
 drawing on heuristic search to find regularities in scientific
data, either historical or novel;
 using formalisms like numeric laws, structural models, and
reaction pathways to describe regularities.
Thus, researchers in this framework favor representations used by
scientists and engineers.
As a result, their systems’ discoveries are more communicable to
members of those communities.
Time Line for Research on
Computational Scientific Discovery
1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Abacus,
Coper
Bacon.1–Bacon.5
AM
Glauber
Dendral
Dalton,
Stahl
Numeric laws
Hume,
ARC
DST, GPN
LaGrange
IDSQ,
Live
NGlauber
Stahlp,
Revolver
IE
Legend
Fahrehneit, E*,
Tetrad, IDSN
Gell-Mann
BR-3,
Mendel
RL, Progol
Pauli
Coast, Phineas,
AbE, Kekada
Qualitative laws
SDS
HR
BR-4
Mechem, CDP
Structural models
SSF, RF5,
LaGramge
Process models
Astra,
GPM
Successes of Computational Scientific Discovery
Over the past decade, systems of this type have helped discover
new knowledge in many scientific fields:
• stellar taxonomies from infrared spectra (Cheeseman et al., 1989)
• qualitative chemical factors in mutagenesis (King et al., 1996)
• quantitative laws of metallic behavior (Sleeman et al., 1997)
• qualitative conjectures in number theory (Colton et al., 2000)
• temporal laws of ecological behavior (Todorovski et al., 2000)
• reaction pathways in catalytic chemistry (Valdes-Perez, 1994, 1997)
Each of these has led to publications in the refereed literature of
the relevant scientific field (see Langley, 2000).
The Developer’s Role in Computational Discovery
problem
formulation
algorithm
manipulation
algorithm
invocation
representation
engineering
data
manipulation
filtering and
interpretation
Themes of the Research
We aim to extend previous approaches to computational scientific
discovery by:
 generating explanations that involve hidden objects/variables
 revising existing models rather than starting from scratch
 drawing on domain knowledge to constrain the search process
 developing interactive discovery tools for use by scientists
As in earlier work, the notation for discovered knowledge will be
the same as that used by domain scientists.
Two promising fields in which to pursue this research agenda are
Earth science and molecular biology.
Some Interesting Questions in Earth Science
 What environmental variables determine the production of
carbon and the generation of various gases?
 What functional forms relate these predictive variables to the
ones they influence?
 How do extreme values of these variables affect behavior of
the ecosystem?
 Are the Earth ecosystem parameters constant or have values
changed in recent years?
The Task of Ecological Model Revision
Given: A model of Earth’s ecosystem (CASA) stated as equations
that involve observable and hidden variables.
Given: Inferred values for global parameters and intrinsic properties
associated with discrete variables (e.g., ground cover).
Given: Observations about numeric variables (rainfall, sunlight,
temperature, NPPc) as they change over space and time.
Find: A revised ecosystem model with altered equations and/or
parametric values that fits the data better.
The NPPc Portion of CASA
NPPc = Smonth max (E · IPAR, 0)
E = 0.56 · T1 · T2 · W
T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
W = 0.5 + 0.5 · EET / PET
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0
PET = 0 if Tempc < 0
A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49
IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver
FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95]
SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)
The NPPc Portion of CASA
NPPc
E
e_max
W
A
PET
AHI
PETTWM
IPAR
T2
EET
Tempc
T1
SOLAR
Topt
SR
NDVI
FPAR
VEG
Improving the NPPc Portion of CASA
One way to improve the NPPc model’s fit to observed data is to:
1. Transform the model into a multilayer neural network that
makes the same predictions.
2. Identify portions of the model that are candidates for revision.
3. Use an error-driven connectionist learning algorithm to revise
those portions of the model.
4. Transform the revised multilayer network back into numeric
equations using the improved components.
This approach is similar to Towell’s (1991) method for revising
qualitative models.
The RF6 Discovery Algorithm
Saito and Nakano (2000) describe RF6, a discovery system that:
1. Creates a multilayer neural network that links predictive with
predicted variables using additive and product units.
2. Invokes the BPQ algorithm to search through the weight space
defined by this network.
3. Transforms the resulting network into a polynomial equation
of the form y = S ci P x jd ij .
They have shown this approach can discover an impressive class
of numeric equations from noisy data.
Three Facets of Model Revision
We have adapted RF6 to revise an existing quantitative model in
three distinct ways:
 Altering the value of parameters in a specified equation;
 Changing the associated values for an intrinsic property; and
 Replacing the equation for a term with another expression.
Rather than initializing weights randomly, the system starts with
weights based on parameters in the original model.
We have applied this strategy to revise six different portions of the
NPPc submodel.
Altering Parameters in the NPPc Model
Initial model:
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
Cross-validated RMSE = 467.910
Behavior: Gaussian-like function of temperature difference.
Revised model:
T2 = 1.80 / [(1 + e 0.05 · (Topt – Tempc – 10.8) ) · (1 + e 0.3 · (Tempc – Topt – 90.33) )]
Cross-validated RMSE = 461.466 [ one percent reduction ]
Behavior: nearly flat function in actual range of temperature difference.
Conclusion: The T2 temperature stress term contributes little to the
overall predictive ability of the NPPc submodel.
Revising Intrinsic Values in the Model
The NPPc submodel includes one intrinsic property, SR, associated with
the variable for vegetation type, UMD-VEG.
The corresponding RF6 network includes one hidden node for SR and
one dummy input variable for each vegetation type.
Veg type A
B
C
D
E
F
G
H
I
J
K
Initial
3.06 4.35 4.35 4.05 5.09 3.06 4.05 4.05 4.05 5.09 4.05
Revised 2.57 4.77 2.20 3.99 3.70 3.46 2.34 0.34 2.72 3.46 1.60
RMSE = 467.910 for the original model;
RMSE = 448.376 for the revised model, an improvement of four percent.
Observation: Nearly all intrinsic values are lower in the revised model.
Revising Equations in the NPPc Model
Initial model:
E = 0.56 · T1 · T2 · W
Cross-validated RMSE = 467.910
Behavior: Each stress term decreases the photosynthetic efficiency E.
Revised model:
E = 0.521 · T10.00 · T2 0.03 · W 0.00
Cross-validated RMSE = 446.270 [ five percent reduction ]
Behavior: T1 and W have no effect on E and T2 has only a minor effect .
Conclusion: The stress terms are not useful to the NPPc model, most
likely because of recent improvements in NDVI measures.
Future Work on Ecological Model Revision
 Apply the revision method to other parts of NPPc submodel
and other static parts of CASA model.
 Extend the revision method to improve parts of CASA that
involve difference equations.
 Develop software for visualizing both spatial and temporal
anomalies, as well as relating them to the model.
 Implement an interactive system that lets scientists direct
high-level search for improved ecosystem models.
Visualizing an Improved Model
One way to visualize a model involves plotting its rules spatially.
Our Earth science collaborators found this useful, as regions often
correspond to recognizable ecological zones.
Some Interesting Biological Questions
 How do organisms acclimate to increased temperature or
ultraviolet radiation?
 Why do we observe bleaching of plant cells under high
light conditions?
 What differences in biological processes exist between a
mutant organism and the original?
 What are the effects on an organism’s biological processes
when one of its important genes is removed?
Modeling Microarrary Results on Photosynthesis
Given: Qualitative knowledge about reactions and regulations for
Cyanobacteria in a high light situation.
Given: Knowledge about the genes in Cyanobacteria relevant to
the photosynthetic process.
Given: Observed expression levels, over time, of the organism’s
genes in the presence of high ultraviolet light.
Find: A revised model with altered reactions and regulations that
explains the expression levels and bleaching.
A Model of Photosynthesis Regulation
How do plants modify their photosynthetic apparatus in high light?
NBLR
+
NBLA
-
PBS
+
-
DFR
psbA1
-
+
+
psbA2
Light
+
-
-
RR
Health
cpcB
+
Photo
Collecting Data on Photosynthetic Processes
www.affymetrix.com/
/wwwscience.murdoch.edu.au/teach
Microarray
Trace
Continuous Culture (Chemostat)
Health of Culture
Stress (e.g., High Light)
Adaptation Period
Sampling mRNA/cDNA
Equlibrium Period
www.affymetrix.com/
Time
Microarray Data on Photosynthetic Regulation
4
NBLR
NBLA
cpcB
psbA2
psbA1
DFR
PBS
3.5
3
2.5
2
1.5
1
0.5
0
1
2
3
4
5
6
Revising a Model of Gene Regulation
Our approach carries out heuristic search through the model space,
guided by candidates’ abilities to explain the data:
Starting state: Initial model proposed by the biologist
Operators: Add a link, delete a link, determine sign on a link
Control: Greedy search for N steps to determine link structure;
Exhaustive search to determine best signs on links
Evaluation: Agreement with predicted relations among partial
correlations, similar to those used in Tetrad
To reduce variance, the system repeats this process using bootstrap
sampling and only makes changes that occur in 75% of the models.
Greedy Search Through a Space of Models
Initial model
Revision 1.1
Revision 1.2
Revision 1.3
Revision 1.4
Revision 2.1
Revision 2.2
Revision 2.3
Revision 2.4
Revision 3.1
Revision 3.2
Revision 3.3
Revision 3.4
A Revised Model of Photosynthesis Regulation
Changes to the model improve its match to the expression data.
+
NBLR
NBLA
-
PBS
+
+
DFR
psbA1
-
+
RR
×
Health
+
-
psbA2
×
Photo
Light
cpcB
Similar changes adapt the model to expression data from mutants.
Future Work on Biological Modeling
 Add more knowledge about biochemical pathways and use to
interpret other microarray data (e.g., rat metabolism, cancer).
 Introduce taxonomic knowledge to limit the search process and
improve final models.
 Expand modeling formalism to support biological mechanisms
in addition to abstract processes.
 Implement an interactive system that lets scientists direct highlevel search for improved biological process models.
Concluding Remarks
In summary, unlike work in the data mining paradigm, our research
on computational discovery:
 attempts to move beyond description and prediction to both
explanation and understanding;
 uses domain knowledge to initialize search and to characterize
differences from revised model;
 presents the new knowledge in some communicable notation
that is familiar to domain experts.
This approach seems especially appropriate for manipulating and
understanding complex scientific and engineering data.
In Memoriam
Earlier this year, computational scientific discovery lost two of
its founding fathers:
 Herbert A. Simon (1916 – 2001)
 Jan M. Zytkow (1945 – 2001)
Both contributed to the field in many ways: posing new problems,
inventing methods, training students, and organizing meetings.
Moreover, both were interdisciplinary researchers who contributed
to computer science, psychology, philosophy, and statistics.
Herb Simon and Jan Zytkow were excellent role models that we
should aim to emulate.
A Closing Quotation
We would like to imagine that the great discoverers, the scientists
whose behavior we are trying to understand, would be pleased with
this interpretation of their activity as normal (albeit high-quality)
human thinking. . .
But science is concerned with the way the world is, not with how
we would like it to be. So we must continue to try new experiments,
to be guided by new evidence, in a heuristic search that is never
finished but always fascinating.
Herbert A. Simon, Envoi to Scientific Discovery, 1987.
Visualizing Errors in the Model
We can easily plot an improved model’s errors in spatial terms.
Such displays can help suggest causes for prediction errors and thus
ways to further improve the model.
Related Research on Discovery
Our approach to computational scientific discovery borrows ideas
from earlier work on:
 equation discovery (Langley et al. 1983; Zytkow et al, 1990;
Washio & Motoda, 1998; Todorovski & Dzeroski, 1997);
 revision of qualitative models (Ourston & Mooney, 1990;
Towell, 1991);
 revision of quantitative models (Glymour et al., 1987; Chown
& Dietterich, 2000).
However, our work combines these ideas in novel ways to produce
a discovery system with new functionality.