Class 14. Nonlinear Models and Nonlinear Optimization

Download Report

Transcript Class 14. Nonlinear Models and Nonlinear Optimization

Part 14: Nonlinear Models [ 1/84]
Econometric Analysis of Panel Data
William Greene
Department of Economics
Stern School of Business
Part 14: Nonlinear Models [ 2/84]
Nonlinear Models


Nonlinear Models
Estimation Theory for Nonlinear Models



Estimators
Properties
M Estimation



GMM Estimation



Minimum Distance Estimation
Minimum Chi-square Estimation
Computation – Nonlinear Optimization



Nonlinear Least Squares
Maximum Likelihood Estimation
Nonlinear Least Squares
Newton-like Algorithms; Gradient Methods
(Background: JW, Chapters 12-14, Greene, Chapters 12-14)
Part 14: Nonlinear Models [ 3/84]
What is a ‘Model?’





Unconditional ‘characteristics’ of a population
Conditional moments: E[g(y)|x]: median, mean,
variance, quantile, correlations, probabilities…
Conditional probabilities and densities
Conditional means and regressions
Fully parametric and semiparametric specifications



Parametric specification: Known up to parameter θ
Parameter spaces
Conditional means: E[y|x] = m(x, θ)
Part 14: Nonlinear Models [ 4/84]
What is a Nonlinear Model?


Model: E[g(y)|x] = m(x,θ)
Objective:




Learn about θ from y, X
Usually “estimate” θ
Linear Model: Closed form; θˆ = h(y, X)
Nonlinear Model


Not wrt m(x,θ). E.g., y=exp(θ’x + ε)
ˆ )=0,
Wrt estimator: Implicitly defined. h(y, X, θ
E.g., E[y|x]= exp(θ’x)
Part 14: Nonlinear Models [ 5/84]
What is an Estimator?

Point and Interval
ˆ
  f(data | mod el)
I(ˆ
)  ˆ
  sampling variability

Classical and Bayesian
ˆ
  E[ | data,prior f()]  expectation from posterior
I(ˆ
)  narrowest interval from posterior density
containing the specified probability (mass)
Part 14: Nonlinear Models [ 6/84]
Parameters



Model parameters
The parameter space
The true parameter(s)
exp( y i / i )
Example : f(y i | x i ) 
, i  exp(βx i )
i
Model parameters : β
Conditional Mean: E(y i | x i )  i  exp(βx i )
Part 14: Nonlinear Models [ 7/84]
The Conditional Mean Function
m( x , 0 )  E[y | x] for some 0 in .
A property of the conditional mean:
E y,x (y  m( x , ))2 is minimized by E[y | x]
(Proof, pp. 343-344, JW)
Part 14: Nonlinear Models [ 8/84]
M Estimation
Classical estimation method
1 n
ˆ
  arg min  i=1 q(datai ,)
n
Example : Nonlinear Least squares
1 n
ˆ
  arg min  i=1 [y i -E(y i | x i , )]2
n
Part 14: Nonlinear Models [ 9/84]
An Analogy Principle for M Estimation
1 n
ˆ
The estimator  minimizes q=  i1 q(datai , )
n
The true parameter 0 minimizes q*=E[q(data, )]
The weak law of large numbers:
1 n
P
q=  i1 q(datai , ) 
 q*=E[q(data, )]
n
Part 14: Nonlinear Models [ 10/84]
Estimation
1 n
P
q=  i1 q(datai , ) 
 q*=E[q(data, )]
n
Estimator ˆ
 minimizes q
True parameter 0 minimizes q*
P
q  q*
P
Does this imply ˆ
 
 0 ?
Yes, if ...
Part 14: Nonlinear Models [ 11/84]
Identification
Uniqueness :
If 1  0 , then m(x ,1 )  m( x , 0 ) for some x
Examples when this does not occur
(1) Multicollinearity generally
(2) Need for normalization E[y|x] = m(x /)
(3) Indeterminacy m(x ,)=1  2 x  3 x 4 when 3 = 0
Part 14: Nonlinear Models [ 12/84]
Continuity
q(datai , ) is
(a) Continuous in  for all datai and all 
(b) Continuously differentiable. First derivatives
are also continuous
(c) Twice differentiable. Second derivatives
must be nonzero, though they need not
be continuous functions of . (E.g. Linear LS)
Part 14: Nonlinear Models [ 13/84]
Consistency
1 n
P
q=  i1 q(datai , ) 
 q*=E[q(data, )]
n
Estimator ˆ
 minimizes q
True parameter 0 minimizes q*
P
q 
 q*
P
Does this imply ˆ
 
 0 ?
Yes. Consistency follows from identification
and continuity with the other assumptions
Part 14: Nonlinear Models [ 14/84]
Asymptotic Normality of M
Estimators
First order conditions:
(1/n)Ni=1q(datai , ˆ
)
0
ˆ

)
1 N q(datai , ˆ
 i=1
n
ˆ

1
 Ni=1g(datai , ˆ
)  g(data, ˆ
)
n
For any ˆ
, this is the mean of a random
sample. We apply Lindberg-Feller CLT to assert
the limiting normal distribution of n g(data, ˆ
).
Part 14: Nonlinear Models [ 15/84]
Asymptotic Normality
A Taylor series expansion of the derivative
 = some point between ˆ
 and 0
g(data, ˆ
)  g(data, 0 )  H()(ˆ
  0 )  0
1 n  2 q(datai , )
H()  i1
n

Then, (ˆ
  0 )  [H()]1 g(data, 0 ) and
n (ˆ
  0 )  [H()]1
n g(data, 0 )
Part 14: Nonlinear Models [ 16/84]
Asymptotic Normality
n (ˆ
  0 )  [H()]1
n g(data, 0 )
[H()]1 converges to its expectation (a matrix)
n g(data, 0 ) converges to a normally distributed
vector (Lindberg-Feller)
Implies limiting normal distribution of
n (ˆ
  0 ).
Limiting mean is 0.
Limiting variance to be obtained.
Asymptotic distribution obtained by the usual means.
Part 14: Nonlinear Models [ 17/84]
Asymptotic Variance
a
ˆ
 
 0  [H()]1 g(data, 0 )
Asymptotically normal
Mean  0
Asy.Var[ˆ
]  [H(0 )]1 Var[g(data, 0 )] [H(0 )]1
(A sandwich estimator, as usual)
What is Var[g(data, 0 )]?
1
E[g(datai , 0 )g(datai , 0 ) ']
n
Not known what it is, but it is easy to estimate.
1 1 n
 i1g(datai , ˆ
)g(datai , ˆ
) '
n n
Part 14: Nonlinear Models [ 18/84]
Estimating the Variance
Asy.Var[ˆ
]  [H(0 )]1 Var[g(data, 0 )] [H(0 )]1
1
Estimate [H(0 )]
 1 n  2m(datai , ˆ
) 
with   i1

ˆ
ˆ
n







)   m(datai , ˆ
) 
1 1 n  m(datai , ˆ
Estimate Var[g(data, 0 )] with
 

 
ˆ
n n i1 

ˆ





E.g., if this is linear least squares, (1/2)ni=1 (y i -x i)2
m(datai , ˆ
)  (1 / 2)(y i  x ib)2
 1 n  2m(datai , ˆ
) 
1
  i1
  ( X X / n)
ˆ ˆ

 n

)   m(datai , ˆ
) 
1 1 n  m(datai , ˆ
2
N
2

(1
/
n
)

e
x i x i





i

1
i
i1 



ˆ
ˆ
nn





Part 14: Nonlinear Models [ 19/84]
Nonlinear Least Squares
Gauss-Marquardt Algorithm
qi  the conditional mean function
= m(x i , )
m(xi , )
gi 
 xi0  'pseudo  regressors '

x i0 is one row of pseudo-regressor matrix X 0
Algorithm - iteration
ˆ
(k+1)  ˆ
(k)  [X 0'X 0 ]1 X 0'e0
Part 14: Nonlinear Models [ 20/84]
Application - Income
German Health Care Usage Data, 7,293 Individuals, Varying Numbers of Periods
Variables in the file are
Data downloaded from Journal of Applied Econometrics Archive. This is an unbalanced
panel with 7,293 individuals. They can be used for regression, count models, binary
choice, ordered choice, and bivariate binary choice. This is a large data set. There are
altogether 27,326 observations. The number of observations ranges from 1 to
7. (Frequencies are: 1=1525, 2=2158, 3=825, 4=926, 5=1051, 6=1000, 7=987). Note,
the variable NUMOBS below tells how many observations there are for each
person. This variable is repeated in each row of the data for the person.
HHNINC = household nominal monthly net income in German marks / 10000.
(4 observations with income=0 were dropped)
HHKIDS = children under age 16 in the household = 1; otherwise = 0
EDUC = years of schooling
AGE = age in years
Part 14: Nonlinear Models [ 21/84]
Income Data
2. 74
2. 19
Densi t y
1. 64
1. 09
. 55
. 00
. 00
1. 00
2. 00
3. 00
4. 00
I NCO M E
Ker nel
densi t y est i m at e f or
I NCO M E
5. 00
Part 14: Nonlinear Models [ 22/84]
Exponential Model
f(Income | Age,Educ,Married)
 HHNINCi 
1
 exp 

i
i


i  exp(a0  a1Educ  a2Married  a3 Age)
E[HHNINC | Age,Educ,Married]  i
Starting values for the iterations:
E[y i | nothing else]=exp(a0 )
Start a0 = logHHNINC, a1  a2  a3  0
Part 14: Nonlinear Models [ 23/84]
Conventional Variance Estimator
2
ˆ
 [yi  m(xi , )]
0 0 1
(X X )
n  #parameters
n
i1
Sometimes omitted.
Part 14: Nonlinear Models [ 24/84]
Variance Estimator
for the M Estimator
qi  (1 / 2)[y i  exp( x i)]2  (1 / 2)(y i  i )2
gi   eii x i
Hi  y i i x i x i
Estimator is
[Ni=1Hi ]-1 [Ni=1 gi gi ][Ni=1Hi ]-1
= [Ni=1 y i i x i x i]-1 [Ni=1 ei2 i2 x i x i][Ni=1 y i i x i x i]-1
This is the White estimator. See JW, p. 359.
Part 14: Nonlinear Models [ 25/84]
Computing NLS
Reject;
Calc ;
Nlsq ;
;
;
;
Name ;
Create;
;
Matrix;
Matrix;
hhninc=0$
b0=log(xbr(hhninc))$
lhs = hhninc
fcn = exp(a0+a1*educ+a2*married+a3*age)
start = b0, 0, 0, 0
labels = a0,a1,a2,a3$
x = one,educ,married,age$
thetai = exp(x'b); ei = hhninc*thetai
gi=ei*thetai
; hi = hhninc*thetai$
varM = <x'[hi] x> * x'[gi^2]x * <x'[hi] x> $
stat(b,varm,x)$
Part 14: Nonlinear Models [ 26/84]
Iterations
' gradient '  e0X 0 (X 0 ' X 0 )-1X 0 ' e0
Part 14: Nonlinear Models [ 27/84]
NLS Estimates with Different
Variance Estimators
Part 14: Nonlinear Models [ 28/84]
Hypothesis Tests for M Estimation
Null hypothesis: c()=0 for some set of J functions
(1) continuous
c()
(2) differentiable;
 R (), J  K Jacobian

(3) functionally independent: Rank R() = J
ˆ =Est.Asy.Var[ˆ
Wald: given ˆ
, V
],
W=Wald distance
=[c(ˆ
)-c()]{R() V()R()'} -1[c(ˆ
)-c()]
 chi-squared[J]
Part 14: Nonlinear Models [ 29/84]
Change in the Criterion Function
1 n
P
q=  i1 q(datai , ) 
 q*=E[q(data, )]
n
Estimator ˆ
 minimizes q
Estimator ˆ
0 minimizes q subject to
restrictions c()=0
q
0
 q.
D
2n(q0  q) 
chi  squared[J]
Part 14: Nonlinear Models [ 30/84]
Score Test
LM Statistic
Derivative of the objective function
(1 / n)ni=1q(datai , )
Score vector =
 g(data, )

Without restrictions g(data, ˆ
)  0
With null hypothesis, c(ˆ
) imposed
g(data, ˆ
0 ) generally not equal to 0. Is it close?
(Within sampling variability?)
Wald distance = [g(data, ˆ
0 )]'{Var[g(data, ˆ
0 )]} 1 [g(data, ˆ
0 )]
D
LM 
chi  squared[J]
Part 14: Nonlinear Models [ 31/84]
Exponential Model
f(Income | Age,Educ,Married)
 HHNINCi 
1
 exp 

i
i


i  exp(a0  a1Educ  a2Married  a3 Age)
Test H0: a1  a2  a3  0
Part 14: Nonlinear Models [ 32/84]
Wald Test
Matrix ; List ; R = [0,1,0,0 / 0,0,1,0 / 0,0,0,1]
; c = R*b ; Vc = R*Varb*r'
; Wald = c’ <VC> c $
Matrix R
has
.0000000D+00
.0000000D+00
.0000000D+00
Matrix C
has
.05472
.23756
.00081
Matrix VC
has
.1053686D-05
.4530603D-06
.3649631D-07
Matrix WALD
has
3627.17514
3 rows and 4 columns.
1.00000
.0000000D+00 .0000000D+00
.0000000D+00
1.00000
.0000000D+00
.0000000D+00 .0000000D+00
1.00000
3 rows and 1 columns.
3 rows and 3 columns.
.4530603D-06 .3649631D-07
.5859546D-04 -.3565863D-06
-.3565863D-06 .6940296D-07
1 rows and 1 columns.
Part 14: Nonlinear Models [ 33/84]
Change in Function
Calc ; b0 = log(xbr(hhninc)) $
Nlsq ; lhs
= hhninc ; labels = a0,a1,a2,a3
; start = b0,0,0,0
; fcn
= exp(a0+a1*educ+a2*married+a3*age)$
Calc ; qbar = sumsqdev/n $
Nlsq ; lhs
= hhninc ; labels = a0,a1,a2,a3
; start = b0,0,0,0 ; fix
= a1,a2,a3
; fcn
= exp(a0+a1*educ+a2*married+a3*age)$
Calc ; cm = 2*n*(Sumsqdev/n – qbar) $
(Sumsqdev = 763.767; Sumsqdev_0 = 854.682)
Part 14: Nonlinear Models [ 34/84]
Constrained Estimation
Part 14: Nonlinear Models [ 35/84]
LM Test
2
Function : qi  (1 / 2)[y i  exp(a0  a1Educ...)]
Derivative : gi  eii x i
LM statistic
1
n

LM=( gi )[ gigi ] (i1gi )
n
i1
n
i1
All evaluated at ˆ
a0  log(y), 0, 0, 0
Part 14: Nonlinear Models [ 36/84]
LM Test
Name
Create
Create
Create
Matrix
;
;
;
;
;
;
Matrix LM
x = one,educ,married,age$
thetai = exp(x'b)
ei = hhninc - thetai$
gi = ei*thetai $
list
LM = x’gi * <x'[gi2]x> * gi’x
has
1 rows and
1
+-------------1| 1915.03286
$
1 columns.
Part 14: Nonlinear Models [ 37/84]
Maximum Likelihood Estimation

Fully parametric estimation



Density of yi is fully specified
The likelihood function = the joint density of
the observed random variable.
Example: density for the exponential model
 yi 
1
f(y i | x i )  exp    , i  exp( xiβ)
i
 i 
E[y i | x i ]=i , Var[y i | x i ]=i2
NLS (M) estimator examined earlier
operated only on E[y i | x i ]=i.
Part 14: Nonlinear Models [ 38/84]
The Likelihood Function
 yi 
1
f(y i | x i )  exp    , i  exp( x iβ)
i
 i 
Likelihood  f(y1 ,..., y n | x1 ,..., x n )
by independence,
 yi 
1
L(β|data)= i=1 exp    , i  exp( x iβ)
i
 i 
ˆ , maximizes the likelihood function
The MLE ,β
n
MLE
Part 14: Nonlinear Models [ 39/84]
Log Likelihood Function
 yi 
1
f(y i | x i )  exp    , i  exp( x iβ)
i
 i 
 yi 
n 1
L(β|data)= i=1 exp    , i  exp( x iβ)
i
 i 
ˆ , maximizes the likelihood function
The MLE ,β
MLE
logL(β|data) is a monotonic function. Therefore
ˆ , maximizes the log likelihood function
The MLE ,β
MLE
yi
logL(β|data)= i=1 -logi 
i
n
Part 14: Nonlinear Models [ 40/84]
Conditional and Unconditional
Likelihood
Unconditional joint density f(y i , x i | , )
  our parameters of interest
 = parameters of the marginal density of x i
Unconditional likelihood function
L(, |y,X )= i=1 f(y i , x i | , )
n
f(y i , x i | , )  f(y i |x i , , )g(x i | , )
L(, |y,X )= i=1 f(y i |x i , , )g(x i | , )
n
Assuming the parameter space partitions
logL(, |y,X )= i=1 logf(y i |x i , )   i=1 logg(x i | )
n
n
 conditional log likelihood + marginal log likelihood
Part 14: Nonlinear Models [ 41/84]
Concentrated Log Likelihood
ˆ
MLE maximizes logL(|data)
Consider a partition, =(,) two parts.
logL
Maximum occurs where
0

 

Joint solution equates both derivatives to 0.
If logL/=0 admits an implicit solution for
ˆ), then write
 in terms of ,= 
ˆ MLE  
ˆ (
logL c (,())=a function only of .
Concentrated log likelihood can be maximized
for , then the solution computed for .
ˆ) , so restrict
The solution must occur where 
ˆ MLE  
ˆ (
the search to this subspace of the parameter space.
Part 14: Nonlinear Models [ 42/84]
Concentrated Log Likelihood
Fixed effects exponential regression: it  exp(i  x it )
logL   i1  t 1 (  log it  y it / it )
n
T
  i1  t 1 ( (i  x it )  y it exp(i  x it ))
n
T
T
 logL
  t 1 1  y it exp(i  x it )(1)
i
  T   t 1 y it exp(i  x it )
T
  T  exp(i ) t 1 y it exp( x it )  0
T
  tT1  y it / exp( x it )  
Solve this for i  log 
  i ()

T


T

 

c
t 1 y it / exp( x it )
Concentrated log likelihood has it  
 exp( x it )
T


Part 14: Nonlinear Models [ 43/84]
ML and M Estimation
logL()   i1 log f(y i | x i , )
n
ˆ
MLE  argmax  i1 log f(y i | x i , )
n
1 n
 argmin -  i1 log f(y i | x i , )
n
The MLE is an M estimator. We can use all
of the previous results for M estimation.
Part 14: Nonlinear Models [ 44/84]
‘Regularity’ Conditions



Conditions for the MLE to be consistent, etc.
Augment the continuity and identification
conditions for M estimation
Regularity:




Three times continuous differentiability of the log
density
Finite third moments of log density
Conditions needed to obtain expected values of
derivatives of log density are met.
(See Greene (Chapter 14))
Part 14: Nonlinear Models [ 45/84]
Consistency and Asymptotic
Normality of the MLE



Conditions are identical to those for M estimation
Terms in proofs are log density and its derivatives
Nothing new is needed.
 Law of large numbers
 Lindberg-Feller central limit theorem applies to
derivatives of the log likelihood.
Part 14: Nonlinear Models [ 46/84]
Asymptotic Variance of the MLE
Based on results for M estimation
Asy.Var[ˆ
MLE ]
={-E[Hessian]}-1 {Var[first derivative]}{-E[Hessian]} -1
1
   logL  
  logL     logL  
= -E 
 Var 
-E 









  
 
 
 
2
2
1
Part 14: Nonlinear Models [ 47/84]
The Information Matrix Equality
Fundamental Result for MLE
The variance of the first derivative equals the negative
of the expected second derivative.
  2 logL 
-E 
 The Information Matrix

  
Asy.Var[ˆ
MLE ]
   logL  
= -E 



 
 
2
1
   logL  
= -E 



 
 
2
   logL      logL  
-E 
 -E 






 
   
 
1
2
2
1
Part 14: Nonlinear Models [ 48/84]
Three Variance Estimators



Negative inverse of expected second derivatives
matrix. (Usually not known)
Negative inverse of actual second derivatives
matrix.
Inverse of variance of first derivatives
Part 14: Nonlinear Models [ 49/84]
Asymptotic Efficiency




M estimator based on the conditional mean is
semiparametric. Not necessarily efficient.
MLE is fully parametric. It is efficient among all
consistent and asymptotically normal estimators
when the density is as specified.
This is the Cramer-Rao bound.
Note the implied comparison to nonlinear least
squares for the exponential regression model.
Part 14: Nonlinear Models [ 50/84]
Invariance
Useful property of MLE
If =g() is a continuous function of ,
the MLE of  is g(ˆ
MLE )
E.g., in the exponential FE model, the
MLE of i = exp(-i ) is exp(-
ˆ i,MLE )
Part 14: Nonlinear Models [ 51/84]
Application: Exponential
Regression – MLE and NLS
MLE assumes E[y|x] = exp(β′x) – Note sign reversal.
Part 14: Nonlinear Models [ 52/84]
Variance Estimators
LogL   i1  log i  y i / i , i  exp( x iβ)
n
g
 logL
n
n
  i1  x i  (y i / i )x i   i1 [(y i / i )  1]x i

Note, E[y i | x i ]  i , so E[g]=0
 2 logL
n
H=
  i1 (y i / i )x i x i

E[H]   i1 x i x i = -X'X (known for this particular model)
n
Part 14: Nonlinear Models [ 53/84]
Three Variance Estimators
Berndt-Hall-Hall-Hausman (BHHH)
1
   i1 [(y i / ˆ
i )  1] x i x i 


Based on actual second derivatives
  g g 
 i=1 i i 
n
2
n
1
1
1
  H     (y / ˆ



)
x
x
i
i
i
i
i
i=1


 i1

Based on expected second derivatives
n

n

E   i=1 Hi 


n
1
   i1 x i x i 


n
1
 ( X'X )1
Part 14: Nonlinear Models [ 54/84]
Variance Estimators
Loglinear ; Lhs=hhninc;Rhs=x ; Model = Exponential
create;thetai=exp(x'b);hi=hhninc*thetai;gi2=(hi-1)^2$
matr;he=<x'x>;ha=<x'[hi]x>;bhhh=<x'[gi2]x>$
matr;stat(b,ha);stat(b,he);stat(b,bhhh)$
Part 14: Nonlinear Models [ 55/84]
Hypothesis Tests

Trinity of tests for nested hypotheses




Wald
Likelihood ratio
Lagrange multiplier
All as defined for the M estimators
Part 14: Nonlinear Models [ 56/84]
Example Exponential vs. Gamma
exp(yi / i )yPi 1
Gamma Distribution: f(yi | xi , ,P) 
Pi (P)
Exponential: P = 1
P>1
Part 14: Nonlinear Models [ 57/84]
Log Likelihood
logL  ni1 P log i  log (P)  y i / i  (P  1) log y i 
(1)  0!  1
logL  ni1  log i  y i / i
 (P  1) log y i  (P  1) log i  log (P)
 ni1  log i  y i / i  (P  1) log(y i / i )  log (P)
 Exponential logL + part due to P  1
Part 14: Nonlinear Models [ 58/84]
Estimated Gamma Model
Part 14: Nonlinear Models [ 59/84]
Testing P = 1


Wald: W = (5.10591-1)2/.042332
= 9408.5
Likelihood Ratio:




logL|(P=1)=1539.31
logL|P = 14240.74
LR = 2(14240.74 - 1539.31)=25402.86
Lagrange Multiplier…
Part 14: Nonlinear Models [ 60/84]
Derivatives for the LM Test
logL  ni1  log i  y i / i  (P  1) log(y i / i )  log (P)
 logL
 ni1 (y i / i  P) x i  in1gx ,i x i

 logL
 ni1 log(y i / i )  (P)  in1gP,i
P
(1)=-.5772156649
For the LM test, we compute these at the exponential
MLE and P = 1.
Part 14: Nonlinear Models [ 61/84]
Psi Function
Ps i (p ) = L o g De ri v a ti v e o f Ga m m a F u n c ti o n
5
0
P si Funct i on
-5
- 10
- 15
- 20
- 25
. 00
. 50
1. 00
1. 50
2. 00
PV
2. 50
3. 00
3. 50
Part 14: Nonlinear Models [ 62/84]
Score Test
Test the hypothesis that the derivative vector equals
zero when evaluated for the larger model with the
restricted coefficient vector.
1 n
Estimator of zero is g =  i=1 g i
n
Statistic = chi squared = g[Var g]-1 g
 1  1  n
Use     i=1 g i gi (the n's will cancel).
 n  n 
-1
 n
n
n

chi squared =  i=1 g i  i=1 g i gi
g

  i=1 i




Part 14: Nonlinear Models [ 63/84]
Calculated LM Statistic
Loglinear ; Lhs = hhninc ; Rhs = x ; Model = Exponential $
Create;thetai=exp(x’b) ; gi=(hhninc*thetai – 1) $
Create;gpi=log(hhninc*thetai)-psi(1)$
Create;g1i=gi;g2i=gi*educ;g3i=gi*married;g4i=gi*age;g5i=gpi$
Namelist;ggi=g1i,g2i,g3i,g4i,g5i$
Matrix;list ; lm = 1'ggi * <ggi'ggi> * ggi'1 $
Matrix LM
has 1 rows and
1
+-------------1| 23468.7
1 columns.
? Use built-in procedure.
? LM is computed with actual Hessian instead of BHHH
Loglinear ; Lhs = hhninc ; Rhs = x ; Model = Exponential $
logl;lhs=hhninc;rhs=x;model=gamma;start=b,1;maxit=0 $
| LM Stat. at start values
9604.33
|
Part 14: Nonlinear Models [ 64/84]
Clustered Data and Partial Likelihood
Panel Data: y it | x it , t  1,..., Ti
Some connection across observations within a group
Assume marginal density for y it | x it  f(y it | x it , )
Joint density for individual i is
f(y i1 ,..., y i,Ti | X i )   t i 1 f(y it | x it , )
T
"Pseudo  logLikelihood"   i1 log t i 1 f(y it | x it , )
T
n
=
 
n
T
i1
t 1
log f(y it | x it , )
Just the pooled log likelihood, ignoring the panel
aspect of the data.
Not the correct log likelihood. Does maximizing wrt
 work? Yes, if the marginal density is correctly specified.
Part 14: Nonlinear Models [ 65/84]
Inference with ‘Clustering’
(1) Estimator is consistent
(2) Asymptotic Covariance matrix needs adjustment
Asy.Var[]=[Hessian]-1 Var[gradient][Hessian]-1
H   i1  t i 1 Hit
n
T
g   i1 gi , where gi   t i 1 git
n
T
Terms in gi are not independent, so estimation of the
 
variance cannot be done with
n
Ti
i1
t 1
git git
But, terms across i are independent, so we estimate
Var[g] with

]  

Est.Var[ˆ
PMLE
n
i1
  g  '
 Hˆ    
 t 1 git
Ti
n
i1
Ti
t 1
1
Ti
t 1
it
it
n
Ti
i1
t 1
ˆ it
g

 
 t 1 gˆit '  i1  t 1 Hˆ it
Ti
(Stata inserts a term n/(n-1) before the middle term.)
n
Ti

1
Part 14: Nonlinear Models [ 66/84]
Cluster Estimation
Part 14: Nonlinear Models [ 67/84]
On Clustering




The theory is very loose.
That the marginals would be correctly specified
while there is ‘correlation’ across observations is
ambiguous
It seems to work pretty well in practice
(anyway)
BUT… It does not imply that one can safely
just pool the observations in a panel and ignore
unobserved common effects.
Part 14: Nonlinear Models [ 68/84]
‘Robust’ Estimation


If the model is misspecified in some way, then
the information matrix equality does not hold.
Assuming the estimator remains consistent, the
appropriate asymptotic covariance matrix would
be the ‘robust’ matrix, actually, the original one,
Asy.Var[ˆ
MLE ]  [E[Hessian]]1 Var[gradient][E[Hessian]]1
(Software can be coerced into computing this by telling it
that clusters all have one observation in them.)
Part 14: Nonlinear Models [ 69/84]
Two Step Estimation and
Murphy/Topel
Likelihood function defined over two parameter vectors
logL= i=1 logf(y i | x i , zi , , )
n
(1) Maximize the whole thing. (FIML)
(2) Typical Situation: Two steps
 y 
1
exp   i  ,
i
 i 
i  exp(0  1Educ  2Married  3 Age  4 Pr[IfKids])
E.g., f(HHNINC|educ,married,age,Ifkids)=
If[Kids | age,bluec]  Logistic Regression
Pr[IfKids]=exp(0  1 Age  2Bluec) /[1  exp(0  1 Age  2Bluec)]
(3) Two step strategy: Fit the stage one model () by MLE
) and estimate .
first, insert the results in logL(, ˆ
Part 14: Nonlinear Models [ 70/84]
Two Step Estimation
(1) Does it work? Yes, with the usual identification conditions,
continuity, etc. The first step estimator is assumed to be consistent
and asymptotically normally distributed.
(2) The asymptotic covariance matrix at the second step that takes
ˆ
 as if it were known it too small.
(3) Repair to the covariance matrix by the Murphy Topel Result
(the one published verbatim twice by JBES).
Part 14: Nonlinear Models [ 71/84]
Murphy-Topel - 1
logL1 () defines the first step estimator. Let
ˆ1  Estimated asymptotic covariance matrix for ˆ
V

gi,1 
 log fi,1 (..., )

n
ˆ1 might = [i=1
ˆ i,1g
ˆi,1 ]1 )
. (V
g
logL(, ˆ
) defines the second step estimator using
the estimated value of .
ˆ2  Estimated asymptotic covariance matrix for ˆ
V
|ˆ

gi,2 
 log fi,2 (..., , ˆ
)

ˆ2 is too small
V
n
ˆ2 might = [i=1
ˆ i,2 g
ˆi,2 ]1 )
. (V
g
Part 14: Nonlinear Models [ 72/84]
Murphy-Topel - 2
ˆ1  Estimated asymptotic covariance matrix for ˆ
V

n
ˆ1 might = [i=1
ˆi,1g
ˆi,1 ]1 )
gi,1   log fi,1 (..., ) / . ( V
g
ˆ2  Estimated asymptotic covariance matrix for ˆ
V
|ˆ

ˆ2 might = [ni=1g
ˆ i,2 g
ˆi,2 ]1 )
gi,2   log fi,2 (..., , ˆ
) / . ( V
ˆ
hi,2   log fi,2 (..., , ˆ
) / 
C
ˆ (the off diagonal block in the Hessian)
ˆ i,2h
 ni=1g
i,2
R
ˆ i,2 g
ˆ i,1 (cross products of derivatives for two logL's)
 ni=1g
ˆ2  V
ˆ2  V
ˆ2 [CV
ˆ1C' - CV
ˆ1R' - RV
ˆ1C']V
ˆ2
M&T: Corrected V
Part 14: Nonlinear Models [ 73/84]
Application of M&T
Reject
;
Logit
;
Matrix
;
Names
;
Create
;
Loglinear;
Matrix
;
Names
;
Create
;
Create
;
Create
;
Matrix
;
Matrix
;
;
hhninc = 0 $
lhs=hhkids ; rhs=one,age,bluec ; prob=prifkids $
v1=varb$
z1=one,age,bluec$
gi1=hhkids-prifkids$
lhs=hhninc;rhs=one,educ,married,age,prifkids;model=e$
v2=varb$
z2=one,educ,married,age,prifkids$
gi2=hhninc*exp(z2'b)-1$
hi2=gi2*b(5)*prifkids*(1-prifkids)$
varc=gi1*gi2 ; varr=gi1*hi2$
c=z2'[varc]z1 ; r=z2'[varr]z1$
q=c*v1*c'-c*v1*r'-r*v1*c'
mt=v2+v2*q*v2;stat(b,mt)$
Part 14: Nonlinear Models [ 74/84]
M&T Application
+---------------------------------------------+
| Multinomial Logit Model
|
| Dependent variable
HHKIDS
|
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Characteristics in numerator of Prob[Y = 1]
Constant
2.61232320
.05529365
47.245
.0000
AGE
-.07036132
.00125773
-55.943
.0000
43.5271942
BLUEC
-.02474434
.03052219
-.811
.4175
.24379621
+---------------------------------------------+
| Exponential (Loglinear) Regression Model
|
| Dependent variable
HHNINC
|
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Parameters in conditional mean function
Constant
-3.79588863
.44440782
-8.541
.0000
EDUC
-.05580594
.00267736
-20.844
.0000
11.3201838
MARRIED
-.20232648
.01487166
-13.605
.0000
.75869263
AGE
.08112565
.00633014
12.816
.0000
43.5271942
PRIFKIDS
5.23741034
.41248916
12.697
.0000
.40271576
+---------+--------------+----------------+--------+---------+
B_1
-3.79588863
.44425516
-8.544
.0000
B_2
-.05580594
.00267540
-20.859
.0000
B_3
-.20232648
.01486667
-13.609
.0000
B_4
.08112565
.00632766
12.821
.0000
B_5
5.23741034
.41229755
12.703
.0000
Why so little change? N = 27,000+. No new variation.
Part 14: Nonlinear Models [ 75/84]
GMM Estimation
1
g(β)= Ni1mi (y i , x i , β)
N
1 1 N

Asy.Var[g(β)] estimated with W=  i1mi (y i , x i , β)mi (y i , x i , β) 
NN

The GMM estimator of β then minimizes
1 N
 1  1 N

q   i1mi (y i , x i , β)  'W  i1mi (y i , x i , β)  .
N

N

g(β)
-1
1
ˆ
Est.Asy.Var[β
]

[
G'W
G
]
,
G
=
GMM
β
Part 14: Nonlinear Models [ 76/84]
GMM Estimation-1


GMM is broader than M estimation and ML
estimation
Both M and ML are GMM estimators.
1 n  log f(y i | x i , β)
g(β)  i1
for MLE
n
β
E(y i | x i , β)
1 n
g(β)  i1 ei
for NLSQ
n
β
Part 14: Nonlinear Models [ 77/84]
GMM Estimation - 2
Exactly identified GMM problems
1
When g(β) = Ni1mi (y i , x i , β)  0 is K equations in
N
K unknown parameters (the exactly identified case),
the weighting matrix in
1

1

q   Ni1mi (y i , x i , β)  'W 1  Ni1mi (y i , x i , β) 
N

N

is irrelevant to the solution, since we can set exactly
g(β)  0 so q = 0. And, the asymptotic covariance matrix
(estimator) is the product of 3 square matrices and becomes
[G'W -1G]1  G-1 WG'-1
Part 14: Nonlinear Models [ 78/84]
Optimization - Algorithms
Maximize or minimize (optimize) a function F()
Algorithm = rule for searching for the optimizer
Iterative algorithm: (k 1)  (k )  Update (k )
Derivative (gradient) based algorithm
(k 1)  (k )  Update(g(k ) )
Update is a function of the gradient.
Compare to 'derivative free' methods
(Discontinuous criterion functions)
Part 14: Nonlinear Models [ 79/84]
Optimization
Algorithms
Iteration (k+1)  (k)  Update(k )
General structure: (k+1)  (k)   (k ) W (k ) g(k )
g(k)  derivative vector, points to a better
value than (k)
= direction vector
 (k ) = 'step size'
W (k) = a weighting matrix
Algorithms are defined by the choices of  (k) and W (k)
Part 14: Nonlinear Models [ 80/84]
Algorithms
Steepest Ascent: 
(k)
-g(k)'g(k)
 (k) (k) (k) , W (k)  I
g 'H g
g(k)  first derivative vector
H(k)  second derivatives matrix
Newton's Method:  (k)  -1, W (k)  [H(k) ]1
(Sometimes called Newton-Raphson.
Method of Scoring:  (k)  -1, W (k)  [E[H(k) ]]1
(Scoring uses the expected Hessian. Usually inferior to
Newton's method. Takes more iterations.)
(k)
(k)
BHHH Method (for MLE):  (k)  -1, W (k)  [in1gi gi ']1
Part 14: Nonlinear Models [ 81/84]
Line Search Methods
Squeezing: Essentially trial and error
 (k)  1, 1/2, 1/4, 1/8, ...
Until the function improves
Golden Section: Interpolate between  (k) and  (k-1)
Others : Many different methods have been suggested
Part 14: Nonlinear Models [ 82/84]
Quasi-Newton Methods
How to construct the weighting matrix:
Variable metric methods:
W (k)  W (k 1)  E(k 1) , W (1)  I
Rank one updates: W (k)  W (k 1)  a(k 1) a(k 1)'
(Davidon Fletcher Powell)
There are rank two updates (Broyden) and higher.
Part 14: Nonlinear Models [ 83/84]
Stopping Rule
When to stop iterating: 'Convergence'
(1) Derivatives are small? Not good.
Maximizer of F() is the same as that of .0000001F(),
but the derivatives are small right away.
(2) Small absolute change in parameters from one
iteration to the next? Problematic because it is a
function of the stepsize which may be small.
(3) Commonly accepted 'scale free' measure
 = g(k)[H(k ) ]1 g(k )
Part 14: Nonlinear Models [ 84/84]
For Example
Nonlinear Estimation of Model Parameters
Method=BFGS ; Maximum iterations= 4
Convergence criteria:gtHg
.1000D-05 chg.F
.0000D+00 max|dB|
.0000D+00
Start values: -.10437D+01
.00000D+00
.00000D+00
.00000D+00
.10000D+01
1st derivs.
-.23934D+05 -.26990D+06 -.18037D+05 -.10419D+07
.44370D+05
Parameters:
-.10437D+01
.00000D+00
.00000D+00
.00000D+00
.10000D+01
Itr 1 F= .3190D+05 gtHg= .1078D+07 chg.F= .3190D+05 max|db|= .1042D+13
Try = 0 F= .3190D+05 Step= .0000D+00 Slope= -.1078D+07
Try = 1 F= .4118D+06 Step= .1000D+00 Slope= .2632D+08
Try = 2 F= .5425D+04 Step= .5214D-01 Slope= .8389D+06
Try = 3 F= .1683D+04 Step= .4039D-01 Slope= -.1039D+06
1st derivs.
-.45100D+04 -.45909D+05 -.18517D+04 -.95703D+05 -.53142D+04
Parameters:
-.10428D+01
.10116D-01
.67604D-03
.39052D-01
.99834D+00
Itr 2 F= .1683D+04 gtHg= .1064D+06 chg.F= .3022D+05 max|db|= .4538D+07
Try = 0 F= .1683D+04 Step= .0000D+00 Slope= -.1064D+06
Try = 1 F= .1006D+06 Step= .4039D-01 Slope= .7546D+07
Try = 2 F= .1839D+04 Step= .4702D-02 Slope= .1847D+06
Try = 3 F= .1582D+04 Step= .1855D-02 Slope= .7570D+02
...
1st derivs.
-.32179D-05 -.29845D-04 -.28288D-05 -.16951D-03
.73923D-06
Itr 20 F= .1424D+05 gt<H>g= .13893D-07 chg.F= .1155D-08 max|db|= .1706D-08
* Converged
Normal exit from iterations. Exit status=0.
Function= .31904974915D+05, at entry, -.14237328819D+05 at exit