Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA
Download
Report
Transcript Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA
Dimension Reduction in
Workers Compensation
CAS predictive Modeling Seminar
Louise Francis, FCAS, MAAA
Francis Analytics and Actuarial Data Mining, Inc.
[email protected]
www,data-mines.com
Objectives
Answer questions: What is dimension
reduction and why use it?
Introduce key methods of dimension
reduction
Illustrate with examples in Workers
Compensation
There will be some formulas, but
emphasis is on insight into basic
mechanisms of the procedures
Introduction
“How do mere observations become data for
analysis?”
“Specific variable values are never immutable
characteristics of the data”
Jacoby, Data Theory and Dimension Analysis, Sage
Publications
Many of the dimension
reduction/measurement techniques originated
in the social sciences and dealt with how to
create scales from responses on attitusional
and opinion surveys
Unsupervised learning
Dimension reduction methods generally
unsupervised learning
Supervised Learning
A dependent or target variable
Unsupervised learning
No target variable
Group like variables or like records together
The Data
BLS Economic indexes
Components of inflation
Employment data
Health insurance inflation
Texas Department of Insurance closed
claim data for 2002 and 2003
Employment related injury
Excludes small claims
About 1800 records
What is a dimension?
Jacoby – any information that adds
significant variability
In many studies each variable is a
dimension
However,we can also view each record
in a database as a dimension
Dimensions
Year Medical Csre
1980 $
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
MedicalServices
74.90 $
82.90
92.50
100.60
106.80
113.50
122.00
130.10
138.60
149.30
162.80
Transportation Electricity
74.80 $
82.80
92.60
100.70
106.70
113.20
121.90
130.00
138.30
148.90
162.70
83.10 $
93.20
97.00
99.30
103.70
106.40
102.30
105.40
108.70
114.10
120.50
26.70
31.55
36.01
37.18
38.60
38.98
40.22
40.02
40.20
40.83
41.66
The Two Major Categories of
Dimension Reduction
Variable reduction
Factor Analysis
Principal Components Analysis
Record reduction
Clustering
Other methods tend to be developments
on these
Principal Components Analysis
A form of dimension (variable) reduction
Suppose we want to combine all the
information related to the “inflation” dimension
of insurance costs
Medical care costs
Employment (wage) costs
Other
Energy
Transportation
Services
Principal Components
These variables are correlated but not
perfectly correlated
We replace many variables with a
weighted sum of the variables
These are then used as independent
variables in a predictive model
Factor Analysis: A Latent Factor
9/12/2005
litigation rates
Subtitle
9/12/2005
Social Inflation
# Procedures
Subtitle
9/12/2005
Index of tort climate
Subtitle
Factor/Principal Components
Analysis
Linear methods – use linear correlation
matrix
Correlation matrix decomposed to find
smaller number of factors the are related
to the same underlying drivers
Highly correlated variables tend to have
high load on the same factor
Factor/Principal Components
Analysis
Medical Care
MedicalServices
Transportation
Electricity
Utility
Fuel Oil
Gas
Bread
Medical Care MedicalServices Transportation Electricity
1.000
1.000
1.000
0.993
0.992
1.000
0.888
0.884
0.910
1.000
0.872
0.873
0.875
0.771
0.448
0.451
0.468
0.281
0.586
0.592
0.601
0.402
0.983
0.983
0.975
0.844
Utility
1.000
0.704
0.752
0.847
Fuel Oil
1.000
0.926
0.459
Gas
1.000
0.595
Bread
1.000
Factor/Principal Components
Analysis
Uses eignevectors and eigenvalues
R is correlation matrix, V
eigenvectors, lambda eigenvalues
RV V
Inflation Data
Component Matrixa
Medical Care
MedicalServices
Trans portation
Electricity
Utility
Fuel Oil
Gas
Bread
Eggs
Apples
Coffee
Employment
UEP
EmpCost
Component
1
2
.986
-.086
.986
-.081
.990
-.073
.895
-.205
.877
.303
.551
.761
.709
.639
.973
-.078
.587
.337
.766
.077
.457
-.644
.967
-.202
-.695
.521
.986
-.048
Extraction Method: Principal Component Analys is .
a. 2 components extracted.
Factor Rotation
Find simpler more easily interpretable
factors
Use 1notion
of2 factor
complexity
r
2
2
qi
(b
r
ij
bij )
i
r is number of factors, b ij is loading
of variable i on fcator j, b ij is mean
loading on factor for row
Factor Rotation
Quartimax Rotation
Maximize q
Varimax Rotation
Maximizes the variance of squared loadings
for each factor rather than for each variable
Varimax Rotation
Rotated Component Matrixa
Medical Care
MedicalServices
Trans portation
Electricity
Utility
Fuel Oil
Gas
Bread
Eggs
Apples
Coffee
Employment
UEP
EmpCos t
Component
1
2
.834
.533
.831
.537
.829
.546
.835
.383
.510
.775
-.028
.939
.172
.939
.818
.532
.260
.625
.560
.529
.755
-.232
.890
.429
-.869
-.011
.811
.563
Extraction Method: Principal Component Analysis.
Rotation Method: Varimax with Kais er Normalization.
a. Rotation converged in 3 iterations.
Plot of Loadings on Factors
How Many Factors to Keep?
Eigenvalues provide information on
how much variance is explained
Proportion explained by a given
component=corresponding
eigenvalue/n
Use Scree Plot
Rule of thumb: keep all factors with
eigenvalues>1
WC Severity vs Factor 1
WC Severity vs Factor 2
What About Categorical Data?
Factor analysis is performed on numeric
data
You could code data as binary dummy
variables
Categorical Variables from Texas data
Injury
Cause of loss
Business Class
Health Insurance (Y/N)
Optimal Scaling
A method of dealing with categorical
variables
Uses regression to
Assign numbers to categories
Fit regression coefficients
Y*=f(X*)
In each round of fitting, a new Y* and X* is
created
Variable Correlations
Correlations Original Variables
injury
cause
Bus ines s clas s
Dimens ion
Eigenvalue
injury
1.000
-.019
.049
1
1.109
cause
-.019
1.000
.105
2
1.014
Bus ines s
clas s
.049
.105
1.000
3
.877
Correlations Transformed Variables
Dimens ion: 1
injury
cause
Bus ines s clas s
Dimens ion
Eigenvalue
injury
1.000
.710
.433
1
2.138
cause
.710
1.000
.552
2
.590
Bus ines s
clas s
.433
.552
1.000
3
.272
Visualizations of Scaled Variables
Can we use scaled variables in
prediction?
Report
Mean
NTILES of Paidloss
1
2
3
4
5
Total
Object s cores
dimension 1
-.0271
.0246
.0156
.0045
-.0158
.0000
Object s cores
dimension 2
-.1361
-.0626
-.0721
.0562
.2172
.0000
Row Reduction: Cluster Analysis
Records are grouped in categories that have
similar values on the variables
Examples
Marketing: People with similar values on
demographic variables (i.e., age, gender, income)
may be grouped together for marketing
Text analysis: Use words that tend to occur together
to classify documents
Fraud modeling
Note: no dependent variable used in analysis
Clustering
Common Method: k-means, hierarchical
No dependent variable – records are
grouped into classes with similar values
on the variable
Start with a measure of similarity or
dissimilarity
Maximize dissimilarity between members
of different clusters
Dissimilarity (Distance) Measure –
Continuous Variables
Euclidian Distance
dij
1/ 2
m
2
( xik x jk )
i, j = records k=variable
k 1
Manhattan Distance
dij
m
xik x jk
k 1
Column
Variable
Binary Variables
Row Variable
1
0
0a
b
a+b
1c
d
c+d
a+c b+d
Binary Variables
Sample Matching
bc
d
abcd
Rogers and Tanimoto
2(b c)
d
(a d ) 2(b c)
Example: Texas Data
Data from 2002 and 200 3closed claim
database by Texas Ins Dept
Only claims over a threshold included
Variables used for clustering:
Report Lag
Settlement Lag
County (ranked by how often in data)
Injury
Cause of Loss
Business class
Results Using Only Numeric
Variables
Used squared distance measure
Final Cluster Centers
RANK of NCounty
RANK of SumLos s
RANK of numSuit
age
Elaps ed time between
date of injury and date
reported to ins urer
Elaps ed time between
date of injury and date
s uit filed
Elaps ed time between
date of injury and date
of trial
BackInj
MultInj
1
10.741
25.155
11.204
40.67
Cluster
2
5.158
7.342
4.553
42.26
3
14.500
53.000
14.000
63.00
233
8264
13893
391
7439
13843
172
0
14627
.39
.35
.00
.00
.00
.00
Two Stage Clustering With
Categorical Variables
First compute distances
Then get clusters
Find optimum number of clusters
Loadings of Injuries on Cluster
Age and Cluster
County vs Cluster
Means of Financial Variables by
Cluster
Mean
TwoStep Cluster Number
1
2
3
4
Total
Paidloss
257.112
78.187
263.851
174.739
219.855
Amount for
allocated
expense for
inhouse
defense
couns el
964
0
2.533
1.540
1.612
Amount for
other ALAE
9.764
2.444
11.805
5.656
8.421
Modern dimension reduction
Hidden layer in neural networks like a
nonlinear principle components
Projection Pursuit Regression – a
nonlinear PCA
Kahonen self-organizing maps – a kind
of neural network that does clustering
These can be understood as
enhancements factor analysys or
clusterini
Kahonen SOM for Fraud
S16
S13
4-5
S10
3-4
S7
16
13
10
7
4
1
S4
S1
2-3
1-2
0-1