Computational Discovery of Communicable Knowledge

Download Report

Transcript Computational Discovery of Communicable Knowledge

Computational Discovery of
Explanatory Process Models
Pat Langley
Center for the Study of Language and Information
Stanford University, Stanford, California
http://cll.stanford.edu/~langley
[email protected]
Thanks to N. Asgharbeygi, K. Arrigo, S. Bay, A. Pohorille, J. Sanchez, K. Saito, and
J. Shrager for their contributions to this research.
Data Mining vs. Scientific Discovery
There exist two computational paradigms for discovering explicit
knowledge from data.
The data mining movement develops computational methods that:
 induce predictive models from large, often business, data sets;
 cast models as decision trees, logical rules, or other notations
invented by AI researchers.
In contrast, computational scientific discovery focuses on:
 constructing models from (often small) scientific data sets;
 stated in formalisms invented by scientists and engineers.
Both approaches draw on heuristic search to find regularities in
data, but they differ considerably in their emphases.
In Memoriam
Three years ago, computational scientific discovery lost two of
its founding fathers:
 Herbert A. Simon (1916 – 2001)
 Jan M. Zytkow (1945 – 2001)
Both contributed to the field in many ways: posing new problems,
inventing methods, training students, and organizing meetings.
Moreover, both were interdisciplinary researchers who contributed
to computer science, psychology, philosophy, and statistics.
Herb Simon and Jan Zytkow were excellent role models who we
should all aim to emulate.
Time Line for Research on
Computational Scientific Discovery
1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Abacus,
Coper
Bacon.1–Bacon.5
AM
Glauber
Dendral
Dalton,
Stahl
Numeric laws
Hume,
ARC
DST, GPN
LaGrange
IDSQ,
Live
NGlauber
Stahlp,
Revolver
IE
Legend
Fahrehneit, E*,
Tetrad, IDSN
Gell-Mann
BR-3,
Mendel
RL, Progol
Pauli
Coast, Phineas,
AbE, Kekada
Qualitative laws
SDS
HR
BR-4
Mechem, CDP
Structural models
SSF, RF5,
LaGramge
Process models
Astra,
GPM
Successes of Computational Scientific Discovery
Over the past decade, systems of this type have helped discover
new knowledge in many scientific fields:
 qualitative chemical factors in mutagenesis (King et al., 1996)
 quantitative laws of metallic behavior (Sleeman et al., 1997)
 qualitative conjectures in number theory (Colton et al., 2000)
 temporal laws of ecological behavior (Todorovski et al., 2000)
 reaction pathways in catalytic chemistry (Valdes-Perez, 1994)
Each has led to publications in the refereed scientific literature
(e.g., Langley, 2000), but they did not focus on systems science.
The Nature of Systems Science
Disciplines like Earth science and computational biology differ
from traditional fields in that they:
 focus on synthesis rather than analysis in their operation;
 rely on computer modeling as one of their central methods;
 develop system-level models with many variables and relations;
 evaluate their models on observational, not experimental, data.
Developing and testing such models are complex tasks that would
benefit from computational aids.
However, existing methods for computational scientific discovery
were not designed with systems science in mind.
Observations from the Ross Sea
Inductive Process Modeling
Our response is to design, construct, and evaluate computational
methods for inductive process modeling, which:
 represent scientific models as sets of quantitative processes;
 use these models to predict and explain observational data;
 search a space of process models to find good candidates;
 utilize background knowledge to constrain this search.
This framework has great potential for aiding systems science,
but it raises new computational challenges.
Challenges of Inductive Process Modeling
Process model induction differs from typical learning tasks in that:
 process models characterize behavior of dynamical systems;
 variables are continuous but can have discontinuous behavior;
 observations are not independently and identically distributed;
 models may contain unobservable processes and variables;
 multiple processes can interact to produce complex behavior.
Compensating factors include a focus on deterministic systems and
the availability of background knowledge.
Issue 1: Representing Scientific Models
To assist system scientists’ modeling efforts, we must first encode
candidate models that:
 address observational rather than experimental data;
 deal with dynamic systems that change over time;
 have an explanatory rather than a descriptive character;
 are causal in that they describe chains of effects;
 contain quantitative relations and qualitative structure.
We need some formal way to represent such models that can be
interpreted computationally.
Why Are Existing Formalisms Inadequate?
regression trees
B>6
C>0
14.3
C>4
18.7
11.5
16.9
hidden Markov models
0.7
x=16,x=2
y=13,x=1
Horn clause programs
1.0
x=12,x=1
y=18,x=2
x=19,x=1
y=11,x=2
0.3
x=12,x=1
y=10,x=2
1.0
gcd(X,X,X).
gcd(X,Y,D) :- X<Y,Z is Y–X,gcd(X,Z,D).
gcd(X,Y,D) :- Y<X,gcd(Y,X,D).
systems of equations
d[ice_mass,t] =  (18  heat) / 6.02
d[water_mass,t] = (18  heat) / 6.02
A Process Model for an Aquatic Ecosystem
model Ross_Sea_Ecosystem
variables: phyto, nitro, residue, light, growth_rate, effective_light, ice_factor
observables: phyto, nitro, light, ice_factor
process phyto_loss
equations: d[phyto,t,1] =  0.1  phyto
d[residue,t,1] = 0.1  phyto
process phyto_growth
equations: d[phyto,t,1] = growth_rate  phyto
process phyto_uptakes_nitro
conditions: nitro > 0
equations: d[nitro,t,1] =  1  0.204  growth_rate  phyto
process growth_limitation
equations: growth_rate = 0.23  min(nitrate_rate, light_rate)
process nitrate_availability
equations: nitrate_rate = nitrate / (nitrate + 5)
process light_availability
equations: light_rate = effective_light / (effective_light + 50)
process light_attenuation
equations: effective_light = light  ice_factor
Advantages of Quantitative Process Models
Process models are a good target for discovery systems because:
 they embed quantitative relations within qualitative structure;
 that refer to notations and mechanisms familiar to scientists;
 they provide dynamical predictions of changes over time;
 they offer causal and explanatory accounts of phenomena;
 while retaining the modularity needed to support induction.
Quantitative process models provide an important alternative to
formalisms used currently in computational discovery.
Issue 2: Generating Predictions and Explanations
To utilize or evaluate a given process model, we must simulate its
behavior over time:
 specify initial values for input variables and time step size;
 on each time step, determine which processes are active;
 solve active algebraic/differential equations with known values;
 propagate values and recursively solve other active equations;
 when multiple processes influence the same variable, assume
their effects are additive.
This performance method makes specific predictions that we can
compare to observations.
Issue 3: Encoding Background Knowledge
To constrain candidate models, we can utilize available backround
knowledge about the domain.
Previous work has encoded background knowledge in terms of:
 Horn clause programs (e.g., Towell & Shavlik, 1990)
 context-free grammars (e.g., Dzeroski & Todorovski, 1997)
 prior probability distributions (e.g., Friedman et al., 2000)
However, none of these notations are familiar to domain scientists,
which suggests the need for another approach.
Generic Processes as Background Knowledge
Our framework casts background knowledge as generic processes
that specify:
 the variables involved in a process and their types;
 the parameters appearing in a process and their ranges;
 the forms of conditions on the process; and
 the forms of associated equations and their parameters.
Generic processes are building blocks from which one can compose
a specific process model.
Generic Processes for Aquatic Ecosystems
generic process exponential_loss
variables: S{species}, D{detritus}
parameters:  [0, 1]
equations: d[S,t,1] = 1    S
d[D,t,1] =   S
generic process remineralization
variables: N{nutrient}, D{detritus}
parameters:  [0, 1]
equations: d[N, t,1] =   D
d[D, t,1] = 1    D
generic process grazing
variables: S1{species}, S2{species}, D{detritus}
parameters:  [0, 1],  [0, 1]
equations: d[S1,t,1] =     S1
d[D,t,1] = (1  )    S1
d[S2,t,1] = 1    S1
generic process constant_inflow
variables: N{nutrient}
parameters:  [0, 1]
equations: d[N,t,1] = 
generic process nutrient_uptake
variables: S{species}, N{nutrient}
parameters:  [0, ],  [0, 1],  [0, 1]
conditions: N > 
equations: d[S,t,1] =   S
d[N,t,1] = 1      S
Issue 4: Inducing Process Models
training data
process model
model AquaticEcosystem
variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phyto
observables: nitro, phyto, zoo
process phyto_exponential_growth
equations: d[phyto,t] = 0.1  phyto
process zoo_logistic_growth
equations: d[zoo,t] = 0.1  zoo / (1  zoo / 1.5)
Induction
process exponential_growth
variables: P {population}
equations: d[P,t] = [0, 1,]  P
process logistic_growth
variables: P {population}
equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ])
process constant_inflow
variables: I {inorganic_nutrient}
equations: d[I,t] = [0, 1, ]
process consumption
variables: P1 {population}, P2 {population}, nutrient_P2
equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2,
d[P2,t] =  [0, 1, ]  P1  nutrient_P2
process no_saturation
variables: P {number}, nutrient_P {number}
equations: nutrient_P = P
process saturation
variables: P {number}, nutrient_P {number}
equations: nutrient_P = P / (P + [0, 1, ])
generic processes
process phyto_nitro_consumption
equations: d[nitro,t] = 1  phyto  nutrient_nitro,
d[phyto,t] = 1  phyto  nutrient_nitro
process phyto_nitro_no_saturation
equations: nutrient_nitro = nitro
process zoo_phyto_consumption
equations: d[phyto,t] = 1  zoo  nutrient_phyto,
d[zoo,t] = 1  zoo  nutrient_phyto
process zoo_phyto_saturation
equations: nutrient_phyto = phyto / (phyto + 0.5)
A Method for Process Model Induction
We have implemented the IPM algorithm, which induces process
models from generic components in four stages:
1. Find all ways to instantiate known generic processes with
specific variables, subject to type constraints;
2. Combine instantiated processes into candidate generic models
subject to additional constraints (e.g., number of processes);
3. For each generic model, carry out search through parameter
space to find good coefficients;
4. Return the parameterized model with the best overall score.
The evaluation metric can be squared error or description length
(e.g., MD = (MV + MC )  log (n) + n  log (ME ) .
Estimating Parameters in Process Models
To estimate the parameters for each generic model structure, the
IPM algorithm:
1. Selects random initial values that fall within ranges specified
in the generic processes;
2. Improves these parameters using the Levenberg-Marquardt
method until it reaches a local optimum;
3. Generates new candidate values through random jumps along
dimensions of the parameter vector and continue search;
4. If no improvement occurs after N jumps, it restarts the search
from a new random initial point.
This multi-level method gives reasonable fits to time-series data
from a number of domains, but it is computationally intensive.
More Issues in Process Model Induction
Inductive process modeling raises a number of issues that have
clear analogues in other paradigms:
 identifying conditions on component processes
 inferring initial values of unobservable variables
 keeping the structural search space tractable
 reducing variance to mitigate overfitting effects
We have demonstrated promising responses to these problems
within the IPM framework.
Evaluation of the IPM Algorithm
To demonstrate IPM's ability to induce process models, we ran it
on synthetic data for a known system:
1. We used the aquatic ecosystem model to generate data sets
over 100 time steps for the variables nitro and phyto;
2. We replaced each ‘true’ value x with x  (1 + r  n), where r
followed a Gaussian distribution ( = 0,  = 1) and n > 0;
3. We ran IPM on these noisy data, giving it type constraints and
generic processes as background knowledge.
In two experiments, we let IPM determine the initial values and
thresholds given the correct structure; in a third study, we let it
search through a space of 256 generic model structures.
Experimental Results with IPM
The main results of our studies with IPM on synthetic data were:
1. The system infers accurate estimates for the initial values of
unobservable variables like zoo and residue;
2. The system induces estimates of condition thresholds on nitro
that are close to the target values; and
3. The MDL criterion selects the correct model structure in all
runs with 5% noise, but only 40% of runs with 10% noise.
These suggest that the basic approach is sound, but that we should
consider more MDL schemes and other responses to overfitting.
Observations from the Ross Sea
Results on Training Data from Ross Sea
Results on Test Data from Ross Sea
Collecting Data on Photosynthetic Processes
www.affymetrix.com/
/wwwscience.murdoch.edu.au/teach
Microarray
Trace
Continuous Culture (Chemostat)
Health of Culture
External stimuli (e.g., light)
Adaptation Period
Sampling mRNA/cDNA
Equlibrium Period
www.affymetrix.com/
Time
Gene Expressions for Cyanobacteria
Generic Processes for Photosynthesis Regulation
generic process translation
variables: P{protein}, M{mRNA}
parameters:  [0, 1]
equations: d[P,t,1] =   M
generic process transcription
variables: M{mRNA}, R{rate}
parameters:
equations: d[M,t,1] = R
generic process regulate_one
variables: R{rate}, S{signal}
parameters:  [1 , 1]
equations: R =   S
generic process regulate_two
variables: R{rate}, S{signal}
parameters:  [1 , 1],  [0, 1]
equations: R =   S
d[S, t,1] = 1    S
generic process automatic_degradation
variables: C{concentration}
conditions: C > 0
parameters:  [0, 1]
equations: d[C,t,1] = 1    C
generic process controlled_degradation
variables: D{concentration}, E{concentration}
conditions: D > 0, E > 0
parameters:  [0, 1]
equations: d[D,t,1] = 1    E
d[E,t,1] = 1    E
generic process photosynthesis
variables: L{light}, P{protein}, R{redox}, S{ROS}
parameters:  [0, 1],  [0, 1]
equations: d[R,t,1] =   L  P
d[S,t,1] =   L  P
A Process Model for Photosynthetic Regulation
model photo_regulation
variables: light, mRNA_protein, ROS, redox, transcription_rate
observables: light, mRNA
process photosynthesis;
equations: d[redox,t,1] = 0.0155  light  protein
d[ROS,t,1] = 0.019  light  protein
process protein_translation
equations: d[protein,t,1] = 7.54  mRNA
process mRNA_transcription
equations: d[mRNA,t,1] = transcription_rate
process regulate_one_1
equations: transcription_rate = 0.99  light
process regulate_two_2
equations: transcription_rate = 1.203  redox
d[redox,t,1] =  0.0002  redox
process automatic_degradation_1
conditions: protein > 0
equations: d[protein,t,1] =  1.91  protein
process controlled_degradation_1
conditions:redox > 0, ROS > 0
equations: d[redox,t,1] =  0.0003  ROS
d[ROS,t,1] =  0.0003  ROS
Predictions from Best Parameterized Model
Electric Power on the International Space Station
Results on Battery Test Data
Results on Data from Rinkobing Fjord
Issue 5: Interfacing with Scientists
Because few scientists want to be replaced, we are developing an
interactive environment that lets users:
 specify a quantitative process model of the target system;
 display and edit the model’s structure and details graphically;
 simulate the model’s behavior over time and situations;
 compare the model’s predicted behavior to observations;
 invoke a revision module in response to detected anomalies.
The environment offers computational assistance in forming and
evaluating models but lets the user retain control.
Viewing and Editing a Process Model
Results of Revising the NPP Model
Initial model:
E = 0.56 · T1 · T2 · W
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M
SR  {3.06, 4.35, 4.35, 4.05, 5.09, 3.06, 4.05, 4.05, 4.05, 5.09, 4.05}
RMSE on training data = 465.212 and r 2 = 0.799
Revised model:
• E = 0.353 · T10.00 · T2 0.08 · W 0.00
• T2 = 0.83 / [(1 + e 1.0 · (Topt – Tempc – 6.34) ) · (1 + e 1.0 · (Tempc – Topt – 11.52) )]
PET = 1.6 · (10 · Tempc / AHI) A · PET-TW-M
• SR  {0.61, 3.99, 2.44, 10.0, 2.21, 2.13, 2.04, 0.43, 1.35, 1.85, 1.61}
Cross-validated RMSE = 397.306 and r 2 = 0.853 [ 15 % reduction ]
Intellectual Influences
Our approach to computational discovery incorporates ideas from
many traditions:
 computational scientific discovery (e.g., Langley et al., 1983);
 theory revision in machine learning (e.g., Towell, 1991);
 qualitative physics and simulation (e.g., Forbus, 1984);
 languages for scientific simulation (e.g., STELLA, MATLAB);
 interactive tools for data analysis (e.g., Schneiderman, 2001).
Our work combines, in novel ways, insights from machine learning,
AI, programming languages, and human-computer interaction.
Contributions of the Research
In summary, our work on computational scientific discovery has, in
responding to various challenges, produced:
 a new formalism for representing scientific process models;
 a computational method for simulating these models’ behavior;
 an encoding for background knowledge as generic processes;
 an algorithm for inducing process models from time-series data;
 an interactive environment for model construction/utilization.
We have demonstrated this approach to model creation on domains
from Earth science, microbiology, and engineering.
Directions for Future Research
Despite our progress to date, we need further work in order to:
 produce additional results on other scientific data sets
 develop improved methods for fitting model parameters
 extend the approach to handle data sets with missing values
 implement heuristic methods for searching the structure space
 utilize knowledge of subsystems to further constrain search
 augment the modeling environment to make it more usable
Inductive process modeling has great potential to speed progress
in systems science and engineering.
End of Presentation