Computational Discovery of Communicable Knowledge

Download Report

Transcript Computational Discovery of Communicable Knowledge

Computational Revision of
Ecological Process Models
Nima Asgharbeygi, Pat Langley, Stephen Bay
Center for the Study of Language and Information
Stanford University
Kevin Arrigo
Department of Geophysics
Stanford University
Thanks to S. Dzeroski, J. Sanchez, K. Saito, J. Shrager, and L. Todorovski for their
contributions to this research, which is funded by the US National Science Foundation.
Data Mining vs. Scientific Discovery
There exist two computational paradigms for discovering explicit
knowledge from data.
The data mining movement develops computational methods that:
 induce predictive models from large (often business) data sets;
 represent models in notations invented by AI researchers.
In contrast, computational scientific discovery focuses on:
 constructing models from (often small) scientific data sets;
 stated in formalisms invented by scientists themselves.
This talk focuses on applications of the second framework to
environmental and ecosystem modeling.
Observations from the Ross Sea
A Model of Ross Sea Ecosystem
model RossSeaEcosystem
variables: phyto, zoo, nitro, residue
observables: phyto, nitro
d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto
d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo
d[residue,t,1] = 0.307  phyto + 0.251  zoo + 0.385  0.495  zoo  0.005  residue
d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue
Inductive Revision of Ecosystem Models
observations
revised model
model RossSeaEcosystem
variables: phyto, zoo, nitro, residue
observables: phyto, nitro
Revision
d[phyto,t,1] =  0.307  phyto  0.495  zoo
+ 0.411  phyto
d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo
d[residue,t,1] = 0.307  phyto + 0.251  zoo
+ 0.385  0.495  zoo  0.005  residue
d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue
model RossSeaEcosystem
variables: phyto, zoo, nitro, residue
observables: phyto, nitro
d[phyto,t,1] =  0.307  phyto  0.495  zoo
+ 0.411  phyto
d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo
d[residue,t,1] = 0.307  phyto + 0.251  zoo
+ 0.385  0.495  zoo  0.005  residue
d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue
initial model
A Space of Ecosystem Models

  

   

  
 

   
  

  



  
  
 
  
    
   
  
  
  
 











  
   
  
  


  



 
 
   
 










   
  

    
  

 
  
   
 
    
  

  
  
    
   
   

  
  
 

   
   

  
  



  
   

 
  

  

  
  
       

  
 
 
 

     
 
 
 




 
 




Model revision requires ways to constrain search through this space.
Phytoplankton Loss in Ross Sea Ecosystem
model RossSeaEcosystem
variables: phyto, zoo, nitro, residue
observables: phyto, nitro
d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto
d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo
d[residue,t,1] = 0.307  phyto + 0.251  zoo + 0.385  0.495  zoo  0.005  residue
d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue
Phytoplankton loss is a process that affects two variables; no model
should include one influence without the other.
Grazing in the Ross Sea Ecosystem
model RossSeaEcosystem
variables: phyto, zoo, nitro, residue
observables: phyto, nitro
d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto
d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo
d[residue,t,1] = 0.307  phyto + 0.251  zoo + 0.385  0.495  zoo  0.005  residue
d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue
We can view an ecosystem model as a set of processes that provide
an alternative way to encode its assumptions.
Process Model of Ross Sea Ecosystem
model RossSeaEcosystem
variables: phyto, zoo, nitro, residue
observables: phyto, nitro
process phyto_loss
equations: d[phyto,t,1] =  0.307  phyto
d[residue,t,1] = 0.307  phyto
process zoo_loss
equations: d[zoo,t,1] =  0.251  zoo
d[residue,t,1] = 0.251  zoo
process zoo_phyto_grazing
equations: d[zoo,t,1] = 0.615  0.495  zoo
d[residue,t,1] = 0.385  0.495  zoo
d[phyto,t,1] =  0.495  zoo
process nitro_uptake
equations: d[phyto,t,1] = 0.411  phyto
d[nitro,t,1] =  0.098  0.411  phyto
process nitro_remineralization;
equations: d[nitro,t,1] = 0.005  residue
d[residue,t,1 ] =  0.005  residue
Inductive Revision of Process Models
observations
revised model
model RossSeaEcosystem
variables: phyto, zoo, nitro, residue
observables: phyto, nitro
d[phyto,t,1] =  0.307  phyto  0.495  zoo
+ 0.411  phyto
d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo
d[residue,t,1] = 0.307  phyto + 0.251  zoo
+ 0.385  0.495  zoo  0.005  residue
d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue
Revision
process exponential_growth
variables: P {population}
equations: d[P,t] = [0, 1,]  P
model RossSeaEcosystem
process logistic_growth
variables: P {population}
equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ])
variables: phyto, zoo, nitro, residue
observables: phyto, nitro
process constant_inflow
variables: I {inorganic_nutrient}
equations: d[I,t] = [0, 1, ]
d[phyto,t,1] =  0.307  phyto  0.495  zoo
+ 0.411  phyto
d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo
d[residue,t,1] = 0.307  phyto + 0.251  zoo
+ 0.385  0.495  zoo  0.005  residue
d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue
initial model
process consumption
variables: P1 {population}, P2 {population}, nutrient_P2
equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2,
d[P2,t] =  [0, 1, ]  P1  nutrient_P2
process no_saturation
variables: P {number}, nutrient_P {number}
equations: nutrient_P = P
process saturation
variables: P {number}, nutrient_P {number}
equations: nutrient_P = P / (P + [0, 1, ])
generic processes
Generic Processes for Aquatic Ecosystems
generic process exponential_loss
variables: S{species}, D{detritus}
parameters:  [0, 1]
equations: d[S,t,1] = 1    S
d[D,t,1] =   S
generic process remineralization
variables: N{nutrient}, D{detritus}
parameters:  [0, 1]
equations: d[N, t,1] =   D
d[D, t,1] = 1    D
generic process grazing
variables: S1{species}, S2{species}, D{detritus}
parameters:  [0, 1],  [0, 1]
equations: d[S1,t,1] =     S1
d[D,t,1] = (1  )    S1
d[S2,t,1] = 1    S1
generic process constant_inflow
variables: N{nutrient}
parameters:  [0, 1]
equations: d[N,t,1] = 
generic process nutrient_uptake
variables: S{species}, N{nutrient}
parameters:  [0, ],  [0, 1],  [0, 1]
conditions: N > 
equations: d[S,t,1] =   S
d[N,t,1] = 1      S
A Method for Process Model Revision
We have implemented RPM, an algorithm that revises an initial
process model in four main stages:
1. Find all ways to instantiate available generic processes with
specific variables, subject to type constraints;
2. Generate candidate model structures by deleting the current
processes and adding new ones, subject to complexity limits;
3. For each generic model, carry out search through parameter
space to find good coefficients [difficult];
4. Return a list of revised models ordered by their overall scores.
The evaluation metric can be squared error or description length
based on error and distance from the initial model.
Observations from the Ross Sea
Revised Model of Ross Sea Ecosystem
model RossSeaEcosystem
variables: phyto, zoo, nitro, residue, light, G, growth_rate, nitro_rate, light_rate
observables: phyto, nitro, light
d[phyto,t,1] =  0.307  phyto  G  zoo + growth_rate  phyto
d[zoo,t,1] = 0.615  G  zoo
d[residue,t,1] = 0.307  phyto + 0.385  G  zoo  0.083  residue
d[nitro,t,1] =  1  n_to_c  growth_rate  phyto + 0.083  n_to_c  residue
G = 0.415  (1 – exp(– 1  0.27  phyto)
growth_rate = r_max  min(nitro_rate, light_rate)
nitro_rate = nitro / (nitro + 4.33)
light_rate = light / (light + 11.67)
n_to_c = 0.251, r_max = 0.194, remin_rate = 0.0676
Initial Results on Ross Sea Training Data
The best revised model reproduces the observations quite well.
Initial Results on Ross Sea Test Data
But the model predicts nearly the same behavior for both years.
Revised Results on Ross Sea Test Data
Refitting initial values for zooplankton gives better generalization.
Results on Data from Protist Study
Results on Data from Rinkobing Fjord
Interfacing with Scientists
Because few scientists want to be replaced, we are developing
PROMETHEUS, an interactive environment that lets users:
 specify a quantitative process model of the target system;
 display and edit the model’s structure and details graphically;
 simulate the model’s behavior over time and situations;
 compare the model’s predicted behavior to observations;
 invoke a revision module in response to detected anomalies.
The environment offers computational assistance in forming and
evaluating models but lets the user retain control.
Viewing and Editing a Process Model
Intellectual Influences
Our approach to computational discovery incorporates ideas from
many traditions:
 computational scientific discovery (e.g., Langley et al., 1983);
 theory revision in machine learning (e.g., Towell, 1991);
 qualitative physics and simulation (e.g., Forbus, 1984);
 languages for scientific simulation (e.g., STELLA, MATLAB);
 interactive tools for data analysis (e.g., Schneiderman, 2001).
Our work combines ideas from machine learning, AI, programming
languages, and human-computer interaction.
Directions for Future Research
Despite our progress to date, we need further work in order to:
 produce additional results on other ecosystem modeling tasks
 develop improved methods for fitting model parameters
 implement heuristic methods for searching the structure space
 utilize knowledge of subsystems to further constrain search
 augment the modeling environment to make it more usable
Process modeling has great potential to aid model development
in environmental science.
Contributions of the Research
In summary, our work on computational discovery has produced:
 a new formalism for representing scientific process models;
 an encoding for background knowledge as generic processes;
 an algorithm for revising process models with time-series data;
 an interactive environment for model construction/utilization.
We have demonstrated this approach to model revision on both
ecosystem modeling and an environmental domain.
The PROMETHEUS modeling/revision environment is available at:
http://www.isle.org/process.html
End of Presentation
The Challenge of Systems Science
Disciplines like Earth science differ from traditional disciplines by:
 focusing on synthesis rather than analysis in their operation;
 using computer modeling as one of their central methods;
 developing system-level models with many variables / relations;
 evaluating models on observational, not experimental, data.
Constructing such models are complex tasks that would benefit
from computational aids, but existing methods are insufficient.
Why Are Process Models Interesting?
Process models are a crucial target for machine learning because:
 they incorporate scientific formalisms rather than AI notations;
 that are easily communicable to scientists and engineers;
 they move beyond descriptive generalization to explanation;
 while retaining the modularity needed to support induction.
These reasons point to process models as an ideal representation
for scientific and engineering knowledge.
Process models are an important alternative to formalisms used
currently in machine learning.
Advantages of Quantitative Process Models
Process models offer scientists a promising framework because:
 they embed quantitative relations within qualitative structure;
 that refer to notations and mechanisms familiar to experts;
 they provide dynamical predictions of changes over time;
 they offer causal and explanatory accounts of phenomena;
 while retaining the modularity needed to support induction.
Quantitative process models provide an important alternative to
formalisms used currently in ecosystem modeling.
Inductive Process Modeling
Our response is to design, construct, and evaluate computational
methods for inductive process modeling, which:
 represent scientific models as sets of quantitative processes;
 use these models to predict and explain observational data;
 search a space of process models to find good candidates;
 utilize background knowledge to constrain this search.
This framework has great potential to aid environmental science,
but it raises new computational challenges.
Challenges of Inductive Process Modeling
Process model induction differs from typical learning tasks in that:
 process models characterize behavior of dynamical systems;
 variables are continuous but can have discontinuous behavior;
 observations are not independently and identically distributed;
 models may contain unobservable processes and variables;
 multiple processes can interact to produce complex behavior.
Compensating factors include a focus on deterministic systems and
the availability of background knowledge.
Generating Predictions and Explanations
To utilize or evaluate a given process model, we must simulate its
behavior over time:
 specify initial values for input variables and time step size;
 on each time step, determine which processes are active;
 solve active algebraic/differential equations with known values;
 propagate values and recursively solve other active equations;
 when multiple processes influence the same variable, assume
their effects are additive.
This performance method makes specific predictions that we can
compare to observations.
Generic Processes as Background Knowledge
Our framework casts background knowledge as generic processes
that specify:
 the variables involved in a process and their types;
 the parameters appearing in a process and their ranges;
 the forms of conditions on the process; and
 the forms of associated equations and their parameters.
Generic processes are building blocks from which one can compose
a specific process model.
Estimating Parameters in Process Models
To estimate the parameters for each generic model structure, the
IPM algorithm:
1. Selects random initial values that fall within ranges specified
in the generic processes;
2. Improves these parameters using the Levenberg-Marquardt
method until it reaches a local optimum;
3. Generates new candidate values through random jumps along
dimensions of the parameter vector and continue search;
4. If no improvement occurs after N jumps, it restarts the search
from a new random initial point.
This multi-level method gives reasonable fits to time-series data
from a number of domains, but it is computationally intensive.
A Process Model for an Aquatic Ecosystem
model Ross_Sea_Ecosystem
variables: phyto, nitro, residue, light, growth_rate, effective_light, ice_factor
observables: phyto, nitro, light, ice_factor
process phyto_loss
equations: d[phyto,t,1] =  0.1  phyto
d[residue,t,1] = 0.1  phyto
process phyto_growth
equations: d[phyto,t,1] = growth_rate  phyto
process phyto_uptakes_nitro
conditions: nitro > 0
equations: d[nitro,t,1] =  1  0.204  growth_rate  phyto
process growth_limitation
equations: growth_rate = 0.23  min(nitrate_rate, light_rate)
process nitrate_availability
equations: nitrate_rate = nitrate / (nitrate + 5)
process light_availability
equations: light_rate = effective_light / (effective_light + 50)
process light_attenuation
equations: effective_light = light  ice_factor
Generic Processes for Aquatic Ecosystems
generic process exponential_loss
variables: S{species}, D{detritus}
parameters:  [0, 1]
equations: d[S,t,1] = 1    S
d[D,t,1] =   S
generic process remineralization
variables: N{nutrient}, D{detritus}
parameters:  [0, 1]
equations: d[N, t,1] =   D
d[D, t,1] = 1    D
generic process grazing
variables: S1{species}, S2{species}, D{detritus}
parameters:  [0, 1],  [0, 1]
equations: d[S1,t,1] =     S1
d[D,t,1] = (1  )    S1
d[S2,t,1] = 1    S1
generic process constant_inflow
variables: N{nutrient}
parameters:  [0, 1]
equations: d[N,t,1] = 
generic process nutrient_uptake
variables: S{species}, N{nutrient}
parameters:  [0, ],  [0, 1],  [0, 1]
conditions: N > 
equations: d[S,t,1] =   S
d[N,t,1] = 1      S
Inductive Process Modeling
training data
process model
model AquaticEcosystem
variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phyto
observables: nitro, phyto, zoo
process phyto_exponential_growth
equations: d[phyto,t] = 0.1  phyto
process zoo_logistic_growth
equations: d[zoo,t] = 0.1  zoo / (1  zoo / 1.5)
Induction
process exponential_growth
variables: P {population}
equations: d[P,t] = [0, 1,]  P
process logistic_growth
variables: P {population}
equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ])
process constant_inflow
variables: I {inorganic_nutrient}
equations: d[I,t] = [0, 1, ]
process consumption
variables: P1 {population}, P2 {population}, nutrient_P2
equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2,
d[P2,t] =  [0, 1, ]  P1  nutrient_P2
process no_saturation
variables: P {number}, nutrient_P {number}
equations: nutrient_P = P
process saturation
variables: P {number}, nutrient_P {number}
equations: nutrient_P = P / (P + [0, 1, ])
generic processes
process phyto_nitro_consumption
equations: d[nitro,t] = 1  phyto  nutrient_nitro,
d[phyto,t] = 1  phyto  nutrient_nitro
process phyto_nitro_no_saturation
equations: nutrient_nitro = nitro
process zoo_phyto_consumption
equations: d[phyto,t] = 1  zoo  nutrient_phyto,
d[zoo,t] = 1  zoo  nutrient_phyto
process zoo_phyto_saturation
equations: nutrient_phyto = phyto / (phyto + 0.5)
The NPPc Portion of CASA
NPPc = Smonth max (E · IPAR, 0)
E = 0.56 · T1 · T2 · W
T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
W = 0.5 + 0.5 · EET / PET
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0
PET = 0 if Tempc < 0
A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49
IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver
FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95]
SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)
Results of Revising the NPP Model
Initial model:
E = 0.56 · T1 · T2 · W
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M
SR  {3.06, 4.35, 4.35, 4.05, 5.09, 3.06, 4.05, 4.05, 4.05, 5.09, 4.05}
RMSE on training data = 465.212 and r 2 = 0.799
Revised model:
• E = 0.353 · T10.00 · T2 0.08 · W 0.00
• T2 = 0.83 / [(1 + e 1.0 · (Topt – Tempc – 6.34) ) · (1 + e 1.0 · (Tempc – Topt – 11.52) )]
PET = 1.6 · (10 · Tempc / AHI) A · PET-TW-M
• SR  {0.61, 3.99, 2.44, 10.0, 2.21, 2.13, 2.04, 0.43, 1.35, 1.85, 1.61}
Cross-validated RMSE = 397.306 and r 2 = 0.853 [ 15 % reduction ]
Generic Processes for Photosynthesis Regulation
generic process translation
variables: P{protein}, M{mRNA}
parameters:  [0, 1]
equations: d[P,t,1] =   M
generic process transcription
variables: M{mRNA}, R{rate}
parameters:
equations: d[M,t,1] = R
generic process regulate_one
variables: R{rate}, S{signal}
parameters:  [1 , 1]
equations: R =   S
generic process regulate_two
variables: R{rate}, S{signal}
parameters:  [1 , 1],  [0, 1]
equations: R =   S
d[S, t,1] = 1    S
generic process automatic_degradation
variables: C{concentration}
conditions: C > 0
parameters:  [0, 1]
equations: d[C,t,1] = 1    C
generic process controlled_degradation
variables: D{concentration}, E{concentration}
conditions: D > 0, E > 0
parameters:  [0, 1]
equations: d[D,t,1] = 1    E
d[E,t,1] = 1    E
generic process photosynthesis
variables: L{light}, P{protein}, R{redox}, S{ROS}
parameters:  [0, 1],  [0, 1]
equations: d[R,t,1] =   L  P
d[S,t,1] =   L  P
A Process Model for Photosynthetic Regulation
model photo_regulation
variables: light, mRNA_protein, ROS, redox, transcription_rate
observables: light, mRNA
process photosynthesis;
equations: d[redox,t,1] = 0.0155  light  protein
d[ROS,t,1] = 0.019  light  protein
process protein_translation
equations: d[protein,t,1] = 7.54  mRNA
process mRNA_transcription
equations: d[mRNA,t,1] = transcription_rate
process regulate_one_1
equations: transcription_rate = 0.99  light
process regulate_two_2
equations: transcription_rate = 1.203  redox
d[redox,t,1] =  0.0002  redox
process automatic_degradation_1
conditions: protein > 0
equations: d[protein,t,1] =  1.91  protein
process controlled_degradation_1
conditions:redox > 0, ROS > 0
equations: d[redox,t,1] =  0.0003  ROS
d[ROS,t,1] =  0.0003  ROS