Prediction - Home

Download Report

Transcript Prediction - Home

Chapter 4
Prediction and Bayesian Inference
• 4.1 Estimators versus predictors
• 4.2 Prediction for one-way ANOVA models
– Shrinkage estimation, types of predictions
• 4.3 Best linear unbiased predictors (BLUPs)
• 4.4 Mixed model predictors
• 4.5 Bayesian inference
• 4.6 Case study: Forecasting lottery sales
• 4.7 Credibility Theory
• Appendix 4A Linear unbiased predictors
4.1 Estimators versus predictors
• In the longitudinal data model, yit = zit´ ai + xit´ b + eit , the
variables {ai} describe subject-specific effects.
• Given the data {yit, zit, xit}, in some problems it is of
interest to “summarize” subject effects.
– We have discussed how to estimate fixed, unknown
parameters .
– It is also of interest to summarize subject-specific
effects, such as those described by the random variable
a i.
• Predictors are “estimators” of random variables.
– Like estimators, predictors are said to be linear if they
are formed from a linear combination of the response y.
Applications of prediction
• In animal and plant breeding, one wishes to predict the
production of milk for cows based on (1) their lineage
(random) and (2) herds (fixed)
• In credibility theory, one wishes to predict expected claims
for a policyholder given exposure to several risk factors
• In sample surveys, one wishes to predict the size of a
specific age-sex-race cohort within a small geographical
area (known as “small area estimation”).
• In a survey article, Robinson (1991) also cites (1) ore
reserve estimation in geological surveys, (2) measuring
quality of a production plan and (3) ranking baseball
players abilities.
4.2. Prediction for one-way ANOVA models
• Consider the traditional one-way random effects ANOVA
(analysis of variance) model:
yit = ma + ai + eit
– Suppose that we wish to summarize the subject-specific
conditional mean, ma + ai .
• For contrast, first consider using the fixed effects model
with ma = 0.
– Here, we have that y i is the “best” (Gauss-Markov)
estimate of ai.
– This estimate is unbiased, that is, E y i = ai.
– This estimate has minimum variance among all linear
unbiased estimators (BLUE).
Shrinkage estimator
• Using the one-way random effects model.
– Consider an “estimator” of ma + ai that is a linear
combination of y i and y , that is, c1 yi  c2 y
for constants c1 and c2.
• Calculations show that the best values of c1 and c2 that
minimize Ec1 yi  c2 y  ( ma  a i ) 2 are c2 = 1 – c1 and
c1 =
Ti  a
 2  Ti* a2
*
2
Ti* =
2T
1
1 i  2
N
N
n

T j2
j =1
Ti 1  N 1
• For large n, we have the shrinkage estimator, or predictor, of
ma + ai to be yi ,s =  i yi  (1   i ) y , where
Ti a2
i =
Ti a2   e2
Example of shrinkage estimator
Hypothetical Run Times for Three Machines
•
•
•
•
Machine
1
2
3
Run Times
14, 12, 10, 12
9, 16, 15, 12
8, 10, 7, 7
Average Run Time
y1 = 12
y2 = 13
y3 = 8
– Notation: yij means the jth run from the ith machine.
– For example, y21 = 9 and y23 = 15.
• Are there real differences among machines?
Example - Continued
• To see the “shrinkage” effect, consider
y1
y
y 2 y3
8
11
12
8.525
•
13
11.825 12.650
Figure 4.1 Comparison of Subject-Specific Means to
Shrinkage Estimators.
More on shrinkage estimators
• Under the random effects model, y i is an unbiased
predictor of ma+ai in the sense that E y i - (ma + ai) = 0.
– However, y iis inefficient in the sense that yi , shas a
smaller mean square error than y.i
– Here, y i has been “shrunk” towards the stable estimator y
– The “estimator” yi , s is said to “borrow strength” from the
stable estimator y
• Recall
Ti a2
i =
Ti a2   e2
• Note that i1 as either (i) Ti or (ii) a2/ e2 .
Best predictors
• From Section 3.1, it is easy to check that the generalized
least square estimator of ma is

=

n
ma ,GLS

y
i
i
i =1
n
i
Ti a2
i =
Ti a2   e2
i =1
• The linear predictor of ma + ai that has minimum variance
is yi , BLUP = i y i + (1 - i ) ma,GLS .
– Here, the acronym BLUP stands for best linear unbiased
predictor.
Types of Predictors
• We have now introduced the BLUP of ma + ai . This
quantity is a linear combination of global parameters and
subject-specific effects.
• Two other types of predictors are of interest.
– Residuals. Here, we wish to “predict” eit . The BLUP
residual turns out to be
eit,BLUP = yit  yi,BLUP
– Forecasts. Here, we wish to predict, for “L” lead time
units into the future,
yi ,Ti  L = ma  a i  e i ,Ti  L
– Without serial correlation, the predictor is the same as
the predictor of ma + ai . However, we will see that the
mean square error turns out to be larger.
4.3 Best linear unbiased predictors
• This section develops best linear unbiased predictors in the
context of mixed linear models, then specializes the
consideration to longitudinal data mixed models.
• BLUPs are developed by examining the minimum mean
square error predictor of a random variable, w.
– We give a development due to Harville (1976).
– The argument is originally due to Goldberger (1962),
who coined the phrase best linear unbiased predictor.
– The acronym was first used by Henderson (1973).
• BLUPs can also be developed as conditional expectations
using multivariate normality
• BLUPs can also be developed in a Bayesian context.
Mixed linear models
• Suppose that we observe an N  1 random vector y with mean
E y = X b and variance Var y = V.
– We wish to predict a random variable w, that has mean E w
= l b and Var w = w2.
– Denote the covariance between w and y as Cov(w,y) =
covwy.
• Assuming known regression parameters (b), the best linear (in
y) predictor of w is
w* = E w + covwy V-1(y - E y ) = l b + covwy V-1(y - X b ).
– If w,y are multivariate normal, then w* equals E (w | y ) and
hence is a minimum mean square predictor of w.
– The predictor w* is also a minimum mean square predictor
of w without the assumption of normality. See Appendix
4A.1.
BLUP’s as predictors
• To develop the BLUP,
– define bGLS = ( X V -1 X )-1 X V-1 y to be the generalized
least squares (GLS) estimator of b.
– This is the best linear unbiased estimator (BLUE).
– Replace b by bGLS in the definition of w* to get the BLUP
wBLUP = l bGLS + covwy  V-1(y - X bGLS )
= (l - covwy V-1X) bGLS + covwy V-1 y.
– See Appendix 4A.2 for a check, establishing wBLUP as
the best linear unbiased predictor of w.
• From Appendix 4A.3, we also have the form for the
minimum mean square error:
Var (wBLUP - w) = (l - covwy V-1X) ( X V -1 X )-1
(l - covwy V-1X) - covwy V-1 covwy + w2.
Example: One-way model
n
• Recall, yit = ma + ai + eit
 y

i =1 i i
– Thus, yi = 1i (ma + ai) + ei . Thus, bGLS = ma ,GLS =
n
i

2
i
=
1
Xi = 1i and

a
1 
1
Vi = 2  I i 
J 
2
2 i
e 
Ti a   e 
– With this, we note that Vi-1 (yi - Xi bGLS)=
1
= 2 (y i - 1i ma ,GLS )   i 1i ( yi  ma ,GLS ) 
e
– Thus, for predicting w = ma + ai we have l=1 and
Cov(w, yi) = 1i a2 for the ith subject, 0 otherwise. Thus,
wBLUP = ma ,GLS  Cov ( w, y i )Vi1 (y i - X i b GLS )
= ma ,GLS
  a 1i
2
1
e
2
(y i - 1i ma ,GLS )   i 1i ( yi  ma ,GLS )
= ma ,GLS   i ( yi - ma ,GLS ) = yi,BLUP
Random effect ANOVA model
• For predicting residuals eit we have l=0 and Cov(w, yi) =
e2 for the ith subject, tth time period, 0 otherwise.
• Let 1it be a Ti  1 vector with a 1 in the tth position, 0
otherwise. Thus,
wBLUP =  e2 1it Vi1 (y i - X i b GLS )
= yit  yi,BLUP
• is our BLUP residual.
4.4 Mixed model predictors
• Recall the longitudinal data mixed model
yi = Zi ai + Xi b + ei
• As described in Section 3.3, this is a special case of the
mixed linear model. We use
V = block diagonal (V1, ..., Vn) ,
where Vi = Zi D Zi + Ri.
X = (X1, ... Xn)
• For BLUP calculations, note that
covwy = ( Cov(w, y1 ),…, Cov(w, yn) )
Longitudinal data mixed model BLUP
• Recall that the r.v. w has mean E w = l b and Var w = w2.
• The BLUP is
wBLUP = λ b GLS 
 Cov(w, y )V
n
1
i (y i
i
- Xi b GLS )

i =1
• The mean square error is Var (wBLUP - w) =

 λ 


n



1
1
Cov( w, y i )Vi Xi  Xi Vi Xi 
i =1
 i =1

n



n

i =1
1

 λ 




Cov( w, y i )Vi1Xi 
i =1

n

Cov( w, y i )Vi1Cov( w, y i )   w2
BLUP special cases
• Global parameters and subject-specific effects.
– Suppose that the interest is in predicting linear
combinations of global parameters b and subjectspecific effect ai.
– Consider linear combinations of the form
w = c1 ai + c2 b.
• Residuals. Here, w = eit .
• Forecasts. Suppose that the ith subject is included in the data
set; predict
yi,Ti  L = zi,Ti  L αi  xi,Ti  L β  e i,Ti  L
– for L lead time units in the future.
Predicting global parameters and
subject-specific effects
• Consider linear combinations of the form w = c1 ai + c2 b.
• Straightforward calculations show that
– E w = c2 b so that l = c2,
– Cov (w, yj ) = c1 D Zi for j = i
– Cov (w , yj ) = 0 for j  i.
• Thus, wBLUP = c2 bGLS + c1 D Zi Vi-1 (yi - Xi bGLS ).
Special case 1
• Take c2 = 0 . Because the means and variance expressions
are true for all vectors c2, we may write this in vector
notation to get the BLUP of ai, the vector
ai,BLUP = D Zi Vi-1 (yi - Xi bGLS ).
• This is unbiased in the sense that E ai,BLUP - ai = 0.
• This estimate has minimum variance among all linear
unbiased predictors (BLUP).
• In the case of the error components model (zit = 1), this
reduces to
ai,BLUP =  i ( yi - xi bGLS )
• For comparison, recall the fixed effects parameter estimate,
ai = y i  x i b
Motivating BLUP’s
• We can also motivate BLUP’s using normal theory:
– Consider the case where ai and e are multivariate normally
distributed.
– Then, it can be shown that E (ai | yi) = D Zi Vi-1 (yi -Xi b).
– To motivate this, consider asking the question: what
realization of ai could be associated with yi? The
expectation!
– The BLUP is the BLUE of E (ai | yi). (That is, replace b by
bGLS.)
Special case 2
• As another example, it is of interest to predict
w = E( yi,Ti 1 |αi ) = zi ,Ti 1α i  xi ,Ti 1β
•
• Choose c1 = z i,Ti 1 and c2 = xi ,Ti 1
• This yields
wBLUP = zi ,Ti 1ai , BLUP  xi ,Ti 1bGLS
• This predictor is of interest in actuarial science, where it is
known as the credibility estimator.
BLUP Residuals
• Here, w = eit . Because E w = 0, it follows that l = 0.
• Straightforward calculations show that
– Cov (w, yj ) = e2 1it for j = i and
– Cov (w , yj ) = 0 for j  i.
– Here, the symbol 1it denotes a Ti  1 vector that has a
“one” in the tth position and is zero otherwise.
• Thus
eit,BLUP = e2 1it Vi-1 (yi - Xi bGLS ).
• This can also be expressed as
eit, BLUP = yit  zi ,t ai ,BLUP  xi ,t bGLS 
Predicting future observations
•
Suppose that the ith subject is included in the data set; predict
yi,Ti  L = zi,Ti  L αi  xi,Ti  L β  e i,Ti  L
– for L lead time units in the future.
• We will assume that x i ,Ti  L and z i ,T  L are known.
i
• It follows that λ = x i ,Ti 1
• Straightforward calculations show that
 Z i Dz i ,Ti  L  Cov(e i ,Ti  L , ε i ) for j = i
Cov(w, y j ) = 
0
for j  i
• Thus, the forecast of yi,Ti+L is
yˆi,Ti L = xi,Ti L bGLS  zi,Ti L ai,BLUP  Cov(ei,Ti L , εi )Ri1ei,BLUP
• Thus, the forecast is the estimate of the conditional mean
plus the serial correlation correction factor
Cov(e i,Ti  L , ε i )R i1ei, BLUP
Predicting future observations
• To illustrate, consider the special case where we have
autoregressive of order 1 (AR(1)), serially correlated errors.
• Thus, we have
 1

 
R =  2  2

 
 T 1


1


 T 2
2

1

 T 1 

T 2
 


  T 3 


 T 3 

0
 1

2

   1 
2
 0


1


1

R 1 = 2
2


 (1   )  
 0
0
0

 0
0
0

 

1 
• After some algebra, the L step forecast is
ˆyi,Ti L = xi,Ti L bGLS  zi,Ti L ai,BLUP   L eiTi ,BLUP

0

0

0


 1  2


0 

0 
0 

 
  
1 
4.5 Bayesian Inference
• With Bayesian statistical models, one views both the model
parameters and the data as random variables.
– We assume distributions for each type of random variable.
• Given the parameters β and α, the response model is
y = Xβ  Zα  ε
– Specifically, we assume that the responses y conditional on
α and β are normally distributed and that
E (y | α, β ) = Z α + X β and Var (y | α, β) = R.
• Assume that α is distributed normally with mean mα and
variance D and that β is distributed normally with mean μβ
and variance β, each independent of the other.
Distributions
• The joint distribution of (α, β) is known as the prior
distribution.
• To summarize, the joint distribution of (α, β, y) is


μα
α



 
μβ
,
 β   N 
 

y
Zμ

X
μ
β
 
 α
• where V = R + Z D Z.
 D

 0

 ZD
0
Σβ
XΣ β


Σ β X  
 

V  XΣ β X  
DZ 
Posterior Distribution
• The distribution of parameters given the data is known as
the posterior distribution.
• The posterior distribution of (α, β) given y is normal.
• The conditional moments are


 y  Zμ
 y  Zμ
 μ α  DZ  V  XΣ β X
α
E   | y = 
 μ β  Σ β X V  XΣ β X
β


1
1
D 0
DZ 
Var  α  | y =  0 Σ    Σ X  V  XΣ β X
β  β
β


 ZD
1


 X μ β 

α  X μβ 
α
XΣ β

Relation with BLUPs
• In longitudinal data applications, one typically has more
information about the global parameters β than subjectspecific parameters α.
• Consider first the case β = 0, so that β = mβ with probability
one.
– Intuitively, this means that β is precisely known, generally
from collateral information.
– Assuming that mα = 0, it is easy to check that the best linear
unbiased estimator (BLUE) of E ( α | y ) is
aBLUP = D Z V-1 ( y – X bGLS)
– Recall from equation (4.11) that aBLUP is also the best
linear unbiased predictor in the frequentist (non-Bayesian)
model framework.
Relation with BLUPs
• Consider second the case where β-1 = 0.
– In this case, prior information about the parameter β is
vague; this is known as using a diffuse prior.
– Assuming mα = 0, one can show that
E ( α | y ) = aBLUP
• It is interesting that in both extreme cases, we arrive at the
statistic aBLUP as a predictor of α.
– This analysis assumes D and R are matrices of fixed
parameters.
– It is also possible to assume distributions for these
parameters; typically, independent Wishart distributions are
used for D-1 and R-1 as these are conjugate priors.
– The general strategy of substituting point estimates for
certain parameters in a posterior distribution is called
empirical Bayes estimation.
Example – One-way random effects
ANOVA model
• The posterior means turn out to be


nT
ˆβ = E  β |y  =  1 

2
2
2




T

b
e
a


1

mb 
nT

y 2 
2
2
   T


e
a
b


aˆ i = E a i |y  =  ( yi  m b )   b ( y  m b )
• where
 =
T a2
 e2  T a2
b =
nT b2
 e2  T a2  nT b2
• Note that b measures the precision of knowledge about b.
Specifically, we see that b approaches one as b2 , and
approaches zero as b2 0.
aˆ i  bˆ = (1   b )(1   )m b   yi    b (1   ) y   yi 
4.6 Wisconsin Lottery Sales
• T=40 weeks of sales from n =50 zip codes
Table 4.1. Lottery, Economic and Demographic
Characteristics of 50 Wisconsin ZIP Codes
Lottery Characteristics
ZOLSALES
NRETAIL
Online lottery sales to individual consumers
Number of listed retailers
Economic and Demographic Characteristics
PERPERHH
MEDSCHYR
MEDHVL
PRCRENT
PRC55P
HHMEDAGE
MEDINC
POPULAT
Persons per household
Median years of schooling
Median home value in $1000s for owner-occupied homes
Percent of housing that is renter occupied
Percent of population that is 55 or older
Household median age
Estimated median household income, in $1000s
Population
Lottery Sales Data Analysis
• Cross-sectional analysis shows that population size heavily
influences sales, with Kenosha as an outlier
• Multiple time series plots
– show the effect of jackpots that is common to all postal
codes
– show the heterogeneity among postal codes (reaffirmed
by a pooling test)
– show the heteroscedasticity that is accommodated
through a logarithmic transformation
Lottery Sales Model Selection
• In-sample results show that
– One-way error components dominates pooled crosssectional models
– An AR(1) error specification significantly improves the
fit.
– The best model is probably the two-way error
component model, with an AR(1) error specification (not
yet documented)
• Out-of-sample analysis suggests that
– logarithmic sales is the preferred choice of response; it
outperforms sales and percentage change.
4.7. What is Credibility?
• Hickman’s (1975) Analogy
– In politics, leaders begin with a reservoir of credibility
which decreases as executive experience is compiled.
– Insurance behaves in a reverse fashion!
– Here, credibility increases as experience increases.
Credibility Theory
• Credibility is a technique for predicting future expected
claims for a risk class, given past claims of that and related
risk classes.
• Importance
– Credibility is widely used for pricing property and
casualty, worker’s compensation and health care
coverages.
– According to Rodermund (1989), “the concept of
credibility has been the casualty actuaries’ most
important and enduring contribution to casualty actuarial
science.”
History
• Mowbray (1914 - PCAS)
– Asked the question, “how extensive is an
exposure necessary to give a dependable pure
premium?”
– This approach is now known as the “limited
fluctuation” or “American” credibility
• Question 1 – do we have enough exposure to give
full weight to the risk class under consideration?
• Question 2 – if not, how can we combine information
from this and related risk classes?
More History
• Whitney (1918 - PCAS)
– introduced the idea of using a weighted average
of average claims of (1) a given risk class and
(2) all risk classes.
– The weight is known as the credibility factor.
– It is of the form
New Premium =
Z  Claims Experience + (1 – Z)  Old Premium.
Example - Balanced Bühlmann
• Consider the model
yit = b + ai + eit.
• The credibility factor is
T
=
T  e2 a2
• The traditional credibility estimator is
wBLUP = (1  ) y   yi .
Example
Hypothetical Claims for Three Towns
Town
1
2
3
Claims
14, 12, 10, 12
9, 16, 15, 12
8, 10, 7, 7
Average Claim
y1 = 12
y2 = 13
y3 = 8
• Are there real differences among towns?
• Mowbray - does Town 3 have enough data to
support its own estimator of pure premiums?
• Whitney - how can I use the information in Towns 1
and 2 to help determine my rate for Town 3?
Response toWhitney
• Known as the “shrinkage” effect
y1
y
y2 y3
8
11
12
8.525
13
11.825 12.650
• Comparison of Subject-Specific Means to
Credibility Estimators.
Why study credibility theory?
• Long history of applications – “a business necessity”
– More recently, many theoretical advances with fewer
innovative applications
• Credibility techniques required in legal statutes and
standards of practice
– Standard of Practice 25 by the Actuarial Standards Board
of the American Academy of Actuaries
– Wisconsin statutes on credibility insurance and disability
income
• Advanced techniques are critical for keeping up with
competition (health insurance – health economists)
• Innovative techniques enhance the “credibility” of the
profession