Computational Discovery of Communicable Knowledge

Download Report

Transcript Computational Discovery of Communicable Knowledge

Interactive Software Environments for
Computational Modeling and Discovery
Pat Langley
Institute for the Study of Learning and Expertise
Palo Alto, California
and
Center for the Study of Language and Information
Stanford University, Stanford, California
http://www.isle.org/~langley
[email protected]
Thanks to S. Bay, V. Brooks, L. Chrisman, S. Klooster, A. Pohorille, C. Potter, K. Saito,
H. Spencer, J. Shrager, M. Schwabacher, and A. Torregrosa.
The Challenge of Systems Science
As a field of science matures, researchers move beyond accounts
of simple, isolated phenomena to:
 develop models of complex systems with many components;
 compare these models to observational data from the systems;
 evaluate their models’ ability to fit these observations; and
 improve their models in response to detected anomalies.
Developing, testing, and revising such models is a challenging
endeavor that would benefit from computational aides.
Our research goal is to design, construct, evaluate, and understand
such computational tools for systems science.
Lessons about Scientific Knowledge Discovery
Our research collaborations in Earth science and microbiology
have suggested some important lessons:
1. Traditional notations from machine learning and data mining are not
communicated easily to domain scientists.
2. Scientists often want models that move beyond description to provide
explanations of their data.
3. Scientists often have initial models and background knowledge that
should influence the discovery process.
4. Scientific data are often rare and difficult to obtain rather than being
plentiful, making variance reduction a key issue.
5. Scientists often want computational assistance rather than automated
discovery systems.
These observations suggest clear needs for additional research in
computational approaches to scientific knowledge discovery.
Inductive Process Modeling
training data
learned knowledge
model AquaticEcosystem
variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phyto
observables: nitro, phyto, zoo
process phyto_exponential_growth
equations: d[phyto,t] = 0.1  phyto
process zoo_logistic_growth
equations: d[zoo,t] = 0.1  zoo / (1  zoo / 1.5)
Induction
process exponential_growth
variables: P {population}
equations: d[P,t] = [0, 1,]  P
process logistic_growth
variables: P {population}
equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ])
process constant_inflow
variables: I {inorganic_nutrient}
equations: d[I,t] = [0, 1, ]
process consumption
variables: P1 {population}, P2 {population}, nutrient_P2
equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2,
d[P2,t] =  [0, 1, ]  P1  nutrient_P2
process no_saturation
variables: P {number}, nutrient_P {number}
equations: nutrient_P = P
process saturation
variables: P {number}, nutrient_P {number}
equations: nutrient_P = P / (P + [0, 1, ])
background knowledge
process phyto_nitro_consumption
equations: d[nitro,t] = 1  phyto  nutrient_nitro,
d[phyto,t] = 1  phyto  nutrient_nitro
process phyto_nitro_no_saturation
equations: nutrient_nitro = nitro
process zoo_phyto_consumption
equations: d[phyto,t] = 1  zoo  nutrient_phyto,
d[zoo,t] = 1  zoo  nutrient_phyto
process zoo_phyto_saturation
equations: nutrient_phyto = phyto / (phyto + 0.5)
Why Are Process Models Interesting?
Process models are good targest for knowledge discovery because:
 they incorporate scientific formalisms rather than AI notations;
 that are easily communicable to scientists and engineers;
 they move beyond descriptive generalization to explanation;
 while retaining the modularity needed to support induction.
These reasons point to process models as an ideal representation
for scientific and engineering knowledge.
Process models are an important alternative to formalisms used
currently in machine learning and data mining.
Three Challenging Scientific Domains
Earth ecosystem
gene regulation
NPPc = Smonth max (E · IPAR, 0)
E = 0.56 · T1 · T2 · W
T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
W = 0.5 + 0.5 · EET / PET
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0
PET = 0 if Tempc < 0
A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49
IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver
FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95]
SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)
NBLR
psbA1
+
+
RR
-
psbA2
cpcB
heart
rate
activity
level
-
PBS
GSR
lung
activity
resp.
rate
Health
+
-
Light
heart
activity
lung
capacity
NBLA
-
DFR
heart
capacity
human
activities
+
+
+
Photo
Challenges of Inductive Process Modeling
Process model induction differs from typical learning tasks in that:
 process models characterize behavior of dynamical systems;
 variables are mainly continuous and data are unsupervised;
 observations are not independently and identically distributed;
 process models contain unobservable processes and variables;
 multiple processes can interact to produce complex behavior.
Compensating factors include a focus on deterministic systems and
the availability of background knowledge.
An Environment for Interactive Process Modeling
We plan to develop an interactive environment that lets users:
 specify process models of static and dynamic systems;
 display and edit a model’s structure and details graphically;
 utilize a model to simulate a system’s behavior over time;
 incorporate background knowledge cast as generic processes;
 indicate which processes to consider during model revision;
 invoke a revision module that improves a model’s fit to data.
Our initial implementation focuses on quantitative processes, but
future versions should also support qualitative models.
A Process Model for Carbon Production
model npp;
variables NPPc, E, IPAR, T1, T2, W, Topt, tempc, eet, PET, PETTWM,
ahi, A, FPARFAS, monthlySolar, SolConver, MONFASNDVI, umd_veg;
observable ahi,eet,tempc,Topt,MONFASNDVI,monthlySolar,PETTWM,umd_veg;
process CarbonProd;
equations NPPc = E * IPAR;
process PhotoEfficiency;
equations E = (0.389 * (T1 * (T2 * W)));
process TempStress1;
equations T1 = (0.8 + ((0.02 * Topt) - (0.0005 * (Topt ^ 2))));
process TempStress2;
equations T2 = ((1.1814 /
(1 + (2.718281828 ^ (0.2 * (Topt - 10 - tempc))))) /
(1 + (2.718281828 ^ (0.3 * (tempc - 10 - Topt)))));
process WaterStress;
conditions PET!=0;
equations W = (0.5 + (0.5 * (eet / PET)));
process WSNoEvapoTrans;
conditions PET==0;
equations W = 0.5;
process EvapoTrans;
conditions tempc>0;
equations PET = 1.6 * (10 * tempc / ahi) ^ A * PETTWM;
•
•
•
Viewing and Editing a Process Model
Initial Results on Ecosystem Model Revision
Initial model:
E = 0.56 · T1 · T2 · W
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M
SR  {3.06, 4.35, 4.35, 4.05, 5.09, 3.06, 4.05, 4.05, 4.05, 5.09, 4.05}
Cross-validated RMSE = 465.212 and r 2 = 0.799
Revised model:
• E = 0.353 · T10.00 · T2 0.08 · W 0.00
• T2 = 0.83 / [(1 + e 1.0 · (Topt – Tempc – 6.34) ) · (1 + e 1.0 · (Tempc – Topt – 11.52) )]
PET = 1.6 · (10 · Tempc / AHI) A · PET-TW-M
• SR  {0.61, 3.99, 2.44, 10.0, 2.21, 2.13, 2.04, 0.43, 1.35, 1.85, 1.61}
Cross-validated RMSE = 397.306 and r 2 = 0.853 [ 15 % reduction ]
A Qualitative Model of Gene Regulation
How do plants modify their photosynthetic apparatus in high light?
NBLR
+
NBLA
-
PBS
+
-
dspA
psbA1
-
+
+
+
-
-
RR
Health
psbA2
+
Photo
Light
cpcB
This model is qualitative but relates continuous variables, much as
formalisms from qualitative physics (e.g., Forbus, 1984).
Fields Contributing to the Proposed Research
computational
scientific discovery
qualitative reasoning
human-computer
interaction
simulation languages,
numerical analysis
biology, physiology,
Earth science
Plans for Experimental Evaluation
Our plans for evaluation include a variety of methods, including:





demonstrating new functionality in each of three domains
collecting and analyzing traces of users’ interactions
formulation of hypotheses about the human-computer system
lesion studies with synthetic data to test those hypotheses
revision of environment based on results of experiments
Taken together, these studies should uncover the design principles
that produce successful modeling and discovery environments.
The methodology for evaluating intelligent assistants is not yet
mature, so we must develop it along the way.
Some Legitimate Reviewer Concerns
1. A general-purpose modeling environment may not be justified
given the differences in the proposed application domains.
2. We should take a closer look at existing modeling environments
like STELLA and link our work to them if possible.
3. The research plan for modeling human activities is vague.
4. The schedule of work follows a standard software life cycle,
rather than giving detail about tasks relevant to the project.
Less Legitimate Reviewer Concerns
5. We may not need to develop new modeling formalism, since
inductive logic programming can handle most of our needs.
6. The proposed research program will not use a "cutting-edge AI
approach" because it relies on the heuristic search metaphor.
7. We should not incorporate qualitative physics because it did not
scale well, has made little progress, and has had little impact.
8. The proposal reads like a CYC project for scientists.
9. The work plan is sketchy and, since the main task is developing
the modeling environment, one postdoc may not be enough.
Less Legitimate Reviewer Concerns
10. The proposal makes little commitment to data-mining
methods and it does not offer timely advances. There is no
conceptual novelty, and the framework is not "radically new".
11. No work is cited for keeping qualitative, quantitative, verbal,
and visual representations consistent.
12. There is a fundamental assumption that automated discovery
tools are inferior to interactive ones.
13. We should take advantage of recent advances in genetic
methods and ones for learning generative models.
14. The research seems unlikely to have a big commercial impact.
Planned Collaborations
Likely collaborations with current UCC researchers include:





using constraints to control search for models (Freuder et al.)
learning numeric constraints from observations (Freuder et al.)
using methods for case adaptation to revise models (Bridge)
modeling regulation of apoptotic cell death (Cotter, Higgins)
modeling behavior of Irish ecosystems (O’Kane)
We also plan to continue ongoing collaborations with scientists at:
 Stanford University, ISLE, and NASA Ames (USA)
 Josef Stefan Institute (Slovenia)
 NTT Communication Science Laboratories (Japan)
Proposed Research Staff
Principal Investigator – Oversight of entire research project
Senior Scientist – Oversight of environment design/implementation
Postdoc – Implementing and maintaining modeling environment
Postdocs – One for each scientific application domain
Postdoc – Experimental evaluation of modeling environment
PhD students – Two for each scientific application domain
Laboratory manager – Responsible for general operations
Computer manager – Responsible for computing environment
Technical writer – Prepare manuals and co-author research reports
Concluding Remarks
In summary, unlike work in the data-mining paradigm, our research
on computational modeling and discovery:
 moves beyond description and prediction to explanatory models;
 uses domain knowledge to initialize and constrain search for
improved models;
 provides an interactive environment that lets the user specify
initial models and direct the revision process;
 presents the revised knowledge in some communicable notation
that is familiar to domain experts.
This approach holds great potential to aid the modeling of complex
systems in science and engineering.
The NPPc Portion of CASA
NPPc = Smonth max (E · IPAR, 0)
E = 0.56 · T1 · T2 · W
T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
W = 0.5 + 0.5 · EET / PET
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0
PET = 0 if Tempc < 0
A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49
IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver
FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95]
SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)
The NPPc Portion of CASA
NPPc
E
e_max
W
A
PET
AHI
PETTWM
IPAR
T2
EET
Tempc
T1
SOLAR
Topt
SR
NDVI
FPAR
VEG
History of Research on
Computational Scientific Discovery
1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Abacus,
Coper
Bacon.1–Bacon.5
AM
Glauber
Dendral
Dalton,
Stahl
Legend
Hume,
ARC
DST, GPN
LaGrange
IDSQ,
Live
NGlauber
Stahlp,
Revolver
IE
Numeric laws
Fahrehneit, E*,
Tetrad, IDSN
Gell-Mann
BR-3,
Mendel
RL, Progol
Pauli
Coast, Phineas,
AbE, Kekada
Qualitative laws
SDS
HR
BR-4
Mechem, CDP
Structural models
SSF, RF5,
LaGramge
Process models
Astra,
GPM