Principal Components Analysis

Download Report

Transcript Principal Components Analysis

Principal Components Analysis
Eric Vaagen, FCAS
Assistant Actuary
September 5, 2008
Agenda
•
•
•
•
•
1
Motivation
What is PCA?
Background
Simple example
Is PCA right for you?
Motivation
• Forecast average premium by coverage
• Explanatory variables
– Vehicle use, territory, driving record
– Breakdown of change in average premium
• Multicollinearity exists
2
Average Premium 2002-2006
Avg. Premium at 2007 Rate Level
628
626
624
622
620
2002
2003
2004
Year
3
2005
2006
Modeling Procedure
Explanatory
Variables
Variable
Selection
Response
Variable
Chosen
Variables
4
Model
Modeling Procedure
Vehicle Use
Territory
Drv. Record
Variable
Selection
Average
Premium
Chosen
Variables
5
Multiple
Regression
Variable Selection Methods
• Stepwise regression
– Forward, backward
• PCA
– Unsupervised
• Partial least squares
– Supervised
• GLM
6
Variable
Selection
Background
• First described in
1901 by Karl Pearson
– Find the best lines and
planes to fit a set of
points
• What else did he
discover?
– Pearson’s χ²
– Linear regression
– Classification of
distributions
(exponential family)
7
PCA Example
• Vehicle use
– Pleasure
– Commute
– Business
8
Explanatory
Variables
• Territory
– Rural
– Suburban
– Urban
Vehicle Use 2002-2006
100%
90%
% of Policies
80%
70%
60%
50%
40%
30%
20%
10%
0%
2002
2003
2004
2005
Year
Pleasure
9
Commute
Business
2006
Territory 2002-2006
100%
90%
% of Policies
80%
70%
60%
50%
40%
30%
20%
10%
0%
2002
2003
2004
2005
Year
Rural
10
Suburban
Urban
2006
Avg. Premium at 2007 Rate Level
Example – Average Premium
Response
Variable
2002
2003
2004
Year
11
2005
2006
Modeling Procedure
Vehicle Use
Territory
PCA
Average
Premium
Chosen PCs
12
Multiple
Regression
PCA Procedure
• PCs
– No multicollinearity
– The 1st PC has the most variance
• Output
– Weights to create the PCs
– Variability of each PC
13
Modeling Procedure
Vehicle Use
Territory
5 years x 6 variables
Weights
PCA
5 years x 6 variables
Variability
Chosen PCs
14
Example – Scree Plot
Proportion of Total Variance
50%
40%
30%
20%
10%
0%
1
2
3
4
PC
15
5
6
Chosen
Variables
PC Calculation
Pleasure
Commute
Business
Rural
Suburban
Urban
16
PC #1
-0.19
0.54
-0.40
0.56
-0.45
-0.03
PC #2
-0.54
0.14
0.48
-0.20
-0.31
0.58
PC #3
-0.55
0.36
0.23
-0.02
0.47
-0.55
PC Calculation
Pleasure
P
Rural
R
Commute
C
Suburban
S
Business
B
Urban
U
• PC1 = - 0.19P + 0.54C - 0.40B
+ 0.56R - 0.45S - 0.03U
• PC1
17
2002
= -0.19(30%)+0.54(50%)-0.40(20%)
+0.56(20%)-0.45(30%)-0.03(50%)
Example - Modeling Procedure
Vehicle Use
Territory
PCA
Average
Premium
Chosen PCs
18
Multiple
Regression
Multiple
Regression
Avg. Premium at 2007 Rate Level
Example – Results
2002
2003
2004
2005
2006
Stepwise B
PCA
Year
Actual
19
Stepwise F
2007
Ja
nAp 03
r0
Ju 3
lO 03
ct
Ja 03
nAp 04
r0
Ju 4
lO 04
ct
Ja 04
nAp 05
r0
Ju 5
l-0
O 5
ct
Ja 05
nAp 06
r0
Ju 6
lO 06
ct
Ja 06
nAp 07
r0
Ju 7
lO 07
ct
-0
7
ICBC Personal TPB
660
640
620
600
Actual
20
PCA
Advantages
• Eliminates multicollinearity
• Most of the original variance is captured
in a few principal components
• More refined selection method
21
Disadvantages
• Can be hard to interpret the PCs
• PC weights may not be stable from year
to year
• Difficult to explain
22
Is PCA Right For You?
• Concerned about multicollinearity?
• Confident in the set of explanatory
variables?
• Want to reduce dimensionality, without
throwing away variables?
23
For More Information
• 2008 Discussion Paper
– PCA and Partial Least Squares: Two
Dimension Reduction Techniques for
Regression
• http://www.casact.org/pubs/dpp/dpp08/08dpp76.pdf
• Predictive modeling seminar
– Oct 6-7, 2008 in San Diego, CA
• PCA and Partial Least Squares
24