Parameter estimation for nonlinear models: Numerical approaches

Transcript Parameter estimation for nonlinear models: Numerical approaches

Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem

Lecture 1

01/08/2008 Sven Zenker

01/08/2008 01/15/2008 01/22/2008 01/29/2008

02/05/2008

02/12/2008 02/19/2008 02/26/2008 03/04/2008 Schedule/Course overview (Part 1) Introduction/Basics: Problem setup, maximum likelihood estimation and nonlinear programming (NLP), Technical considerations Non-linear least squares as NLP, approaches to NLP 1: unconstrained minimization Approaches to NLP 2: constraints, duality NLP 3: Modern techniques: non-monotonous globalization; Global optimization

(no class)

ODE-specific methods 1: Multiple shooting ODE-specific methods 2: Collocation Iterative estimators, generalized smoothing The stochastic viewpoint revisited: philosophy and mathematical formulation, Bayesian and frequentist viewpoints, review of probability basics

Schedule/Course overview (Part 2)

03/11/2008 (no class, Spring Recess)

03/18/2008 Numerical representation of probability distributions and operations on them, density estimation, bandwidth adaptation as an optimization problem 03/25/2008 Sampling 1: Monte Carlo, Rejection sampling, Random walks: MCMC: Metropolis Hastings and its variations 04/01/2008 MCMC diagnostics, Sampling 2: Importance sampling (IF) and its relatives, particle filters 04/08/2008 "Population Monte Carlo", mixing MCMC and IF, Hybrid Monte Carlo, Langevin methods 04/15/2008 Utilizing the samples: model selection criteria, sampling the model space: reversible jump MCMC 04/22/2008 Summary & Outlook

Literature Nonlinear Programming. Dimitri P. Bertsekas, Athena Scientific, 1999.

Inverse Problem Theory and Model Parameter Estimation. Albert Tarantola, SIAM, 2005.

+ various other sources

General framework: steps in developing mathematical models of the physical world 1.Model development and parametrization 2.Solution of the forward problem 3.Solution of the inverse problem Typically, this will be iteratively repeated.

Interaction between experiment and mathematical model

Experimental models of all kinds Data for identification of relationships, Validation or Falsification (Scientific Method!) Hypotheses, predictions, realization of experimentally not realizable scenarios Mathematical models

The model development process yields

x y



n m y

 M, the (deterministic) nonlinear model X: parameter or model space Y: data or observation space

Terminology: Forward & inverse problem The model maps parameter vectors in a model dependent parameter space X to vectors of observable quantities y. This “prediction” of observations is termed the solution of the forward problem.

Terminology: Forward & inverse problem Inferring parameter values from observations is termed the solution of the

inverse problem

Stochasticity?

Although we will, for the purpose of this course, restrict ourselves to deterministic models, the presence of measurement error may still enforce stochastic considerations.

Sources of uncertainty in the forward and inverse problems, a more complete picture Single State Vector

“Forward”

Measurement error and model stochasticity (if present) introduce uncertainty

Probability Density function on measurement space “Prediction”

Quantitative representation of system

System states Parameters

Mathematical model

of system Measurement results “Inference” Probability density Function on state and Parameter space

Measurement error, model stochasticity, and ill-posedness introduce uncertainty

“Inverse”

Single Measurement vector

Mathematical models: classification by formulation of forward problem (1) •Closed form solution •The convenient situation •Rarely the case for practically relevant scenarios

Mathematical models: classification by formulation of forward problem (2) Solution available only by numerical approximation •Special case: Model is described by system of non-linear differential equations •Special case: Solution of nonlinear algebraic equations •Other common scenarios: DDEs, PDEs, etc.

Mathematical models: classification by formulation of forward problem (2) •Solution available only by numerical approximation •

Special case: Model is described by system of non linear differential equations

•Other common scenarios: DDEs, PDEs, solutions of optimization problems, etc.

Approach for part 1 of this course: Nonlinear programming formulation Turn problem of finding parameter vector x from observation vector y into a nonlinear programming problem, that is, an optimization problem.

Nonlinear programming formulation

Advantages:

•Relatively simple formulation •Excellent (even realtime) performance possible through exploitation of derivative information •Relatively straightforward extensions to optimal experimental design and control possible

Nonlinear programming formulation

Disadvantages:

•Strong assumptions required •These are usually met partially at best •Whether the failure to meet the assumptions is practically relevant in a given scenario is not always evident from the results

From the inverse problem to a nonlinear program: The maximum likelihood estimate (MLE) Basic idea: Given •a model and a set of parameters specifying it completely •the distribution of measurement error •a set of experimental observations we can compute the probability of the data being observed given the model specification and identify this with the likelihood of the given parameter set.

From the inverse problem to a nonlinear program: The maximum likelihood estimate (MLE) Deterministic model 

 and measurement error distribution together yield a parametrized family of conditional probability distributions on observation space Y (for the continuous case described by probability density functions)

The likelihood function is the above interpreted as a function of the 2 nd argument, x ( ) 

From the inverse problem to a nonlinear program: The maximum likelihood estimate (MLE) The likelihood function is the above interpreted as a function of the 2 nd argument, x ( ) 

for a fixed observation y~ representing our data. For

independent observations we have ( ) 

i k

  1

f x

(

From the inverse problem to a nonlinear program: The maximum likelihood estimate (MLE) Note that L(x) does NOT define a probability density function (PDF) on parameter space (or a probability distribution in the discrete case).

To obtain a bona fide PDF, we need to apply Bayes’ theorem (for PDFs):  

X f y x

ˆ   which requires knowledge (or assumption) of a prior probability density

π(x)

on parameter space.

From the inverse problem to a nonlinear program: The actual program The maximum likelihood estimate (MLE) is now found by computing

If prior information on the parameter distribution is available, the maximum a posteriori estimate (MAP) can be computed as 

since the integral term in the Bayes’ expression is independent of

and thus does not play a role in the optimization.

From the inverse problem to a nonlinear program The previously described expressions constitute generic nonlinear programs and can, in principle, be applied for arbitrary error distributions. We will discuss numerical methods that can indeed handle such arbitrary programs (with certain fundamental limitations to be discussed).

The likelihood function becomes the

objective function

of the nonlinear program.

The results are POINT ESTIMATES. Techniques for estimating full posterior distributions will be the subject of the 2 nd part of this course.

From the inverse problem to a nonlinear program: The actual program

Logarithmic transformation

Note that for likelihood functions arising from multiple independent transformations, i.e.

( ) 

i k

  1

f x

(

the logarithm as a strictly increasing function can be applied to the likelihood function without changing the location of the optimum, yielding a computationally convenient sum: ln ( ) 

i k

  1

f x

(

which often helps to avoid floating point computation related precision losses, among other things.

From the inverse problem to a nonlinear program Example & special case: normally distributed errors If the measurement errors are normally distributed, i.e.



k n

  1 1 2 

i e

 (

y i



M i



2 the log likelihood takes a familiar form:  

  1

(

y i

 

2 the sum of squared residuals, weighted by the respective standard deviations.

From the inverse problem to a nonlinear program Example & special case: normally distributed errors In the special case of independent observations with normally distributed errors, finding the MLE corresponds to the solution of the (weighted) nonlinear least squares problem.

This will be particularly convenient in the estimation process for reasons to be discussed.

From the inverse problem to a nonlinear program: A (slightly) more specific example Estimating parameters and initial conditions for a nonlinear system of ordinary differential equations, given measurements of (functions of) system states at a number of timepoints.

From the inverse problem to a nonlinear program: A more specific example Let the model be specified as an initial value problem for a system of 1 st order nonlinear ordinary differential equations ( )  

n s



n p

with initial conditions

(0) 

0 for which, under regularity assumptions for the RHS, a unique solution

From the inverse problem to a nonlinear program: A more specific example For now, let us assume that we are able to find the (existing and unique) s(t) by numerical integration for arbitrary initial conditions s(0) and parameter vectors p.

Let us further assume that our experimental data takes the form of a finite set of observation vectors  1,..., ,

t i



n o

taken at times t ,..., 1

t n t

From the inverse problem to a nonlinear program: A more specific example … and that observations can be computed from the solutions of our ODE system through a (in general nonlinear, parameter and time dependent) observation function

: 1

n s n p



n o

From the inverse problem to a nonlinear program: A more specific example To formulate the MLE problem, we lump the unknown parameters and initial conditions into one “parameter” vector to be estimated

:   

(0)

  ,



n s



n p

and can now view the solution of our ODE system as a function of time, initial conditions and parameters

( ) : 

From the inverse problem to a nonlinear program: A more specific example …which allows us to write down the log likelihood for Gaussian errors as a function of the sought after “parameter” vector x:  

i n t

  1 [

o i

 ( ( ), )]

t i

2 

2 The MLE thus corresponds to the solution of the weighted least squares problem

x MLE



arg min

i n t

  1

[

o i

 ( ( ), )]

t i

2 

From the inverse problem to a nonlinear program: A more specific example Why introduce the observation function T?

Example: fluid flow between compartments: •ODE formulation in terms of mass/volume conservation at nodes: state variables are volumes •Parametrization in terms of pressure-volume relationship •Observable: intracompartimental pressure Will become interesting when we discuss implementation issues, in particular related to computing derivative information…

From the inverse problem to a nonlinear program: Objective functions for multi-experiment setups We have this far assumed that our observation data is explicable by a single parameter/initial condition set (single experiment case). However, nothing prevents us from considering multi experiment setups in which a subset of parameters/initial conditions is allowed to vary between subsets of the observations.

From the inverse problem to a nonlinear program: Objective functions for multi-experiment setups In a “multi-experiment” setting, we have multiple observations from a n ex experiments

i j

 1,...,

n t j

o i j



 1,...,

n ex n o

j taken at times t ,..., 1

j t nt j

From the inverse problem to a nonlinear program: Objective functions for multi-experiment setups We then need to decide which components of our lumped “parameter” vector x we will allow to vary between experiments (example: initial conditions could be different, parameters the same), resulting in a combined parameter vector for the overall problem…

From the inverse problem to a nonlinear program: Objective functions for multi-experiment setups Overall parameter vector:

x total

       

x const x x

1 var

n ex

var        In the Gaussian case, we then need to solve

x MLE



arg min

n ex j

  1

i n t j

 1

x total

[

o i j



t i

(

const

var

x const

var

)] 2 

From the inverse problem to a nonlinear program: Objective functions for multi-experiment setups This is also worth mentioning since it introduces a particular structure into the objective function that can be exploited during the optimization process (block structured Jacobian, will discuss later).

The role of derivatives To fully exploit the advantages of the nonlinear programming formulation of the inverse problem, derivative information is required.

Interlude: Vector calculus Notation For a totally differentiable function



 



y m

1      

f m

( )   

 the total differential

Jacobi matrix (“Jacobian”) is a linear map described by the m x n

         ( )



f m



1 1  ( ) 

x n



f m



x n

     

Interlude: Vector calculus Notation We will use the subscript to denote with respect to which argument we are differentiating when a function formally depends on multiple vector valued arguments (board…).

Interlude: Vector calculus Reminder: Chain rule For 2 totally differentiable functions



y z

 and



, x y=g(x) the total differential of the composite function



 is the product of the Jacobians



y x

( ) w



m m



l n



Basic stuff, I know, but checking dimensions can often serve as a plausibility check when things get messy during implementation…

Back to our MLE setup Straightforward application of the chain rule, e.g., allows us to compute the Jacobian of our predicted observables in the previous example from the Jacobian of the observation function wrt. to the system states and the Jacobian of the system states wrt. to model parameters and initial conditions, as well as the gradient of the likelihood function wrt. to parameters if we so desire.

The key issue: Model Jacobian For all models for which the forward problem can be solved numerically: •Finite differences •Forward: O(h) •Backward: O(h) •Central: O(h 2 )

The key issue: Model Jacobian For models available in closed form: can be done exactly, either symbolically •Manual •Computer algebra system (Maple, Mathematica, Derive, etc., etc.) or via •Automatic/Algorithmic differentiation

Interlude: Algorithmic differentiation •Basic idea: compute derivatives along with regular values as program code is run using “augmented arithmetic” •Accuracy limited only by machine precision •2 extreme cases: forward and backward “mode”, performance depends on number of input and output values •For m x n Jacobian: n forward, m backward sweeps Technical implementation: •Code transformation •Operator overloading •Expression templates •Will discuss practical issues (choice of implementation, seeding, etc., when we get to them)

The key issue: Model Jacobian Jacobian of ODE solutions For models specified as ordinary differential equations and solved using numerical integration: •Modify integrator (fast): •Internal numerical differentiation (Bock) •Iterative approximation using directional derivatives •Solve sensitivity equations (error control!)

The key issue: Model Jacobian Jacobian of ODE solutions

A few remarks on numerical ODE integration (my personal take)

•ALWAYS use adaptive schemes with error control in an optimization settings •Read the solver manual, understand the error control scheme, meaning of the tunable parameters (at least superficially) •If integration is slow using an explicit method, try an implicit method (stiff integrator) •If an implicit method fails or shows poor performance, provide exact Jacobians (analytical, AD) •If it still fails, closely inspect the point of failure (typical issues: physically positive state becomes negative due to roundoff error accumulation, system explodes, etc.) •Think about RHS discontinuities, performance integrators will NOT let you be sloppy on that account

Jacobian of ODE solutions Forward sensitivity equations For an ODE system of the form ( )  

n s



n p

the chain rule of differentiation for each parameter yields a set of ns differential equations describing the temporal evolution of the partial derivatives of the solution components of the ODE wrt. to that parameter   

p i d dt



D f t s t p s t s i

 



p i

, To obtain sensitivities to initial conditions, introduce the initial conditions as additional parameters for which



n s



n p



n s s

ˆ (0)



e j

and

ˆ (0)

 

s p

0 ( ) 

p i

for the sensitivity to the j condition component.

th initial

Jacobian of ODE solutions Forward sensitivity equations We thus end up with an augmented ODE system with a total of  1)

s p

equations that need to be solved.

While this is expensive, the full range of error control methods for the solution of ODE systems is available.

In addition, integrators exist (CVODES, e.g.) that exploit the special structure of this system to solve it more efficiently than would be possible for a general system (at least for the stiff case).

Can nevertheless become prohibitively expensive for large systems.

Jacobian of ODE solutions Adjoint sensitivity analysis Alternative: adjoint sensitivity analysis.

•Useful when gradient of function of solutions wrt. parameters is needed for relatively few scalar function(al)s of the ODE solution.

•Idea: integrate ODE system first, use solution to integrate backward to find sensitivities.

•One backwards integration per scalar function •Derivation a little more complicated, implementation requires solver that supports efficient backward integration => will discuss when we are ready to use it…

Jacobian of ODE solutions Why can’t we just AD it?

•Differentiability… •Adaptive iterative codes contain branching points where differentiability criteria may not be met •This can cause automatically differentiated codes to be unreliable •Since reasonable other methods are available, not the best option from my point of view •However, some successful applications of AD to large legacy PDE codes have been reported

General remarks on implementation issues •Tradeoff between development time and performance (i.e., MATLAB vs. C implementation) •Related to the above: tradeoff between robustness and performance •High performance code is more error prone (memory handling, etc.) •Recommended procedure (my opinion): prototype in high level interpreted language such as MATLAB, obtain reference results, then gradually transition to performance implementation (if even necessary…) •Test cases are essential!!

•Use plenty of comments, understanding your own reasonably complex code a few months after writing it may be very hard…

Inner workings of NLP codes Why bother?

NLP is a mature field of research We will probably be unable to improve on existing codes with justifiable time investment BUT Understanding the underlying principles of typical methods, in particular with respect to •Proof of convergence •Analysis of rate of convergence •Numerical implementation

Inner workings of NLP codes Why bother?

…should enable us to better •Select an appropriate method for a given problem •Understand and monitor its behavior (what is the meaning of all those iteratively updated, algorithm specific numbers, and what do they tell us about the solution we obtain?) •Diagnose failure if it occurs (typically when assumptions are not met) •Perform technical performance tuning (e.g., adaptation to tuned linear algebra codes, parallelization, etc.)

Reminder & outlook: fundamental issue

Non-linear optimization/programming: Efficient techniques make use of local information about objective function, i.e. Jacobian, Hessian => Can only find local minima, result depends on the “basin of attraction” you start in: “Local” “Global” At least partially amenable to technical solutions (multiple shooting for ODEs, global optimization algorithms )

Basins of attraction: Algorithm dependence

Toy model: superposition of 2 sine waves, 2 free parameters, 20 “measurements” Residual error after convergence as a function of initial guess for 3 different algorithms.

Assignment (1) MATLAB & ODE warm-up exercises Consider a sinusoidally forced van der Pol oscillator

  (1  

2 )

y a

cos(   )

: amplitude of forcing   : frequency of forcing :phase of forcing

Assignment (1) MATLAB & ODE warm-up •Implement this oscillator (hint: a non-forced version is part of the MATLAB examples) and solve it for values of mu=1 and mu=1000, and at least 1 setting of the forcing parameters where an effect is noticable (use a=0 as test case, can you find interesting settings?), using ode45 and ode15s, observe & describe the effect of stiffness (hint: timescale for mu=1 10s of seconds, for mu=1000, 1000s of seconds) •Compute sensitivities by deriving forward sensitivity equations and integrating these (feel free to use a CAS if you wish) •Download & install sens_analysis_1.1 ( http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=148 0&objectType=FILE ) •Use it to evaluate sensitivities for the same parameter settings for which you integrated the sensitivity equations and compare results •Compute values and sensitivities wrt. to parameters and initial conditions for the following vector valued observable for mu=1, a=100, omega =pi, phi=0, t=[0,20]:        2 2  2 2      

Assignment (2) OPTIONAL, WILL NOT BE GRADED AT THIS POINT •Implement (or provide an existing implementation) of an ODE system that either is related to your research or interests you for other reasons and for which you wish to attempt parameter estimation •Next weeks interactive component will in part be for you to briefly (3-5 minutes, 3-5 slides) describe the system, show solutions for example parameter sets and any data that you may wish to fit to •Idea here is that if suitable, we will use these systems for assignments, performance comparisons, etc., so that the implementation work you do for the course does not go to waste •Do not hesitate if you think the problem difficult, grades will be assigned based on the quality of the work, not on speed of success (and there will always be toy models as alternative options)