Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA

Download Report

Transcript Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA

Dimension Reduction in
Workers Compensation
CAS predictive Modeling Seminar
Louise Francis, FCAS, MAAA
Francis Analytics and Actuarial Data Mining, Inc.
[email protected]
www,data-mines.com
Objectives
Answer questions: What is dimension
reduction and why use it?
 Introduce key methods of dimension
reduction
 Illustrate with examples in Workers
Compensation
 There will be some formulas, but
emphasis is on insight into basic
mechanisms of the procedures

Introduction
“How do mere observations become data for
analysis?”
 “Specific variable values are never immutable
characteristics of the data”



Jacoby, Data Theory and Dimension Analysis, Sage
Publications
Many of the dimension
reduction/measurement techniques originated
in the social sciences and dealt with how to
create scales from responses on attitusional
and opinion surveys
Unsupervised learning
Dimension reduction methods generally
unsupervised learning
 Supervised Learning



A dependent or target variable
Unsupervised learning
No target variable
 Group like variables or like records together

The Data

BLS Economic indexes
Components of inflation
 Employment data
 Health insurance inflation


Texas Department of Insurance closed
claim data for 2002 and 2003
Employment related injury
 Excludes small claims
 About 1800 records

What is a dimension?
Jacoby – any information that adds
significant variability
 In many studies each variable is a
dimension
 However,we can also view each record
in a database as a dimension

Dimensions
Year Medical Csre
1980 $
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
MedicalServices
74.90 $
82.90
92.50
100.60
106.80
113.50
122.00
130.10
138.60
149.30
162.80
Transportation Electricity
74.80 $
82.80
92.60
100.70
106.70
113.20
121.90
130.00
138.30
148.90
162.70
83.10 $
93.20
97.00
99.30
103.70
106.40
102.30
105.40
108.70
114.10
120.50
26.70
31.55
36.01
37.18
38.60
38.98
40.22
40.02
40.20
40.83
41.66
The Two Major Categories of
Dimension Reduction

Variable reduction
Factor Analysis
 Principal Components Analysis


Record reduction


Clustering
Other methods tend to be developments
on these
Principal Components Analysis

A form of dimension (variable) reduction
 Suppose we want to combine all the
information related to the “inflation” dimension
of insurance costs



Medical care costs
Employment (wage) costs
Other



Energy
Transportation
Services
Principal Components
These variables are correlated but not
perfectly correlated
 We replace many variables with a
weighted sum of the variables
 These are then used as independent
variables in a predictive model

Factor Analysis: A Latent Factor
9/12/2005
litigation rates
Subtitle
9/12/2005
Social Inflation
# Procedures
Subtitle
9/12/2005
Index of tort climate
Subtitle
Factor/Principal Components
Analysis
Linear methods – use linear correlation
matrix
 Correlation matrix decomposed to find
smaller number of factors the are related
to the same underlying drivers
 Highly correlated variables tend to have
high load on the same factor

Factor/Principal Components
Analysis
Medical Care
MedicalServices
Transportation
Electricity
Utility
Fuel Oil
Gas
Bread
Medical Care MedicalServices Transportation Electricity
1.000
1.000
1.000
0.993
0.992
1.000
0.888
0.884
0.910
1.000
0.872
0.873
0.875
0.771
0.448
0.451
0.468
0.281
0.586
0.592
0.601
0.402
0.983
0.983
0.975
0.844
Utility
1.000
0.704
0.752
0.847
Fuel Oil
1.000
0.926
0.459
Gas
1.000
0.595
Bread
1.000
Factor/Principal Components
Analysis

Uses eignevectors and eigenvalues
 R is correlation matrix, V
eigenvectors, lambda eigenvalues
RV  V
Inflation Data
Component Matrixa
Medical Care
MedicalServices
Trans portation
Electricity
Utility
Fuel Oil
Gas
Bread
Eggs
Apples
Coffee
Employment
UEP
EmpCost
Component
1
2
.986
-.086
.986
-.081
.990
-.073
.895
-.205
.877
.303
.551
.761
.709
.639
.973
-.078
.587
.337
.766
.077
.457
-.644
.967
-.202
-.695
.521
.986
-.048
Extraction Method: Principal Component Analys is .
a. 2 components extracted.
Factor Rotation
Find simpler more easily interpretable
factors
 Use 1notion
of2 factor
complexity
r
2
2

qi 
(b

r
ij
 bij )
i
r is number of factors, b ij is loading
of variable i on fcator j, b ij is mean
loading on factor for row
Factor Rotation

Quartimax Rotation


Maximize q
Varimax Rotation

Maximizes the variance of squared loadings
for each factor rather than for each variable
Varimax Rotation
Rotated Component Matrixa
Medical Care
MedicalServices
Trans portation
Electricity
Utility
Fuel Oil
Gas
Bread
Eggs
Apples
Coffee
Employment
UEP
EmpCos t
Component
1
2
.834
.533
.831
.537
.829
.546
.835
.383
.510
.775
-.028
.939
.172
.939
.818
.532
.260
.625
.560
.529
.755
-.232
.890
.429
-.869
-.011
.811
.563
Extraction Method: Principal Component Analysis.
Rotation Method: Varimax with Kais er Normalization.
a. Rotation converged in 3 iterations.
Plot of Loadings on Factors
How Many Factors to Keep?

Eigenvalues provide information on
how much variance is explained
 Proportion explained by a given
component=corresponding
eigenvalue/n
 Use Scree Plot
 Rule of thumb: keep all factors with
eigenvalues>1
WC Severity vs Factor 1
WC Severity vs Factor 2
What About Categorical Data?
Factor analysis is performed on numeric
data
 You could code data as binary dummy
variables
 Categorical Variables from Texas data

Injury
 Cause of loss
 Business Class
 Health Insurance (Y/N)

Optimal Scaling
A method of dealing with categorical
variables
 Uses regression to

Assign numbers to categories
 Fit regression coefficients
 Y*=f(X*)


In each round of fitting, a new Y* and X* is
created
Variable Correlations
Correlations Original Variables
injury
cause
Bus ines s clas s
Dimens ion
Eigenvalue
injury
1.000
-.019
.049
1
1.109
cause
-.019
1.000
.105
2
1.014
Bus ines s
clas s
.049
.105
1.000
3
.877
Correlations Transformed Variables
Dimens ion: 1
injury
cause
Bus ines s clas s
Dimens ion
Eigenvalue
injury
1.000
.710
.433
1
2.138
cause
.710
1.000
.552
2
.590
Bus ines s
clas s
.433
.552
1.000
3
.272
Visualizations of Scaled Variables
Can we use scaled variables in
prediction?
Report
Mean
NTILES of Paidloss
1
2
3
4
5
Total
Object s cores
dimension 1
-.0271
.0246
.0156
.0045
-.0158
.0000
Object s cores
dimension 2
-.1361
-.0626
-.0721
.0562
.2172
.0000
Row Reduction: Cluster Analysis

Records are grouped in categories that have
similar values on the variables
 Examples




Marketing: People with similar values on
demographic variables (i.e., age, gender, income)
may be grouped together for marketing
Text analysis: Use words that tend to occur together
to classify documents
Fraud modeling
Note: no dependent variable used in analysis
Clustering
Common Method: k-means, hierarchical
 No dependent variable – records are
grouped into classes with similar values
on the variable
 Start with a measure of similarity or
dissimilarity
 Maximize dissimilarity between members
of different clusters

Dissimilarity (Distance) Measure –
Continuous Variables

Euclidian Distance
dij 



1/ 2
m
2
( xik  x jk )
i, j = records k=variable
k 1
Manhattan Distance
dij 

m
xik  x jk
k 1

Column
Variable
Binary Variables
Row Variable
1
0
0a
b
a+b
1c
d
c+d
a+c b+d
Binary Variables

Sample Matching
bc
d
abcd

Rogers and Tanimoto
2(b  c)
d
(a  d )  2(b  c)
Example: Texas Data

Data from 2002 and 200 3closed claim
database by Texas Ins Dept
 Only claims over a threshold included
 Variables used for clustering:






Report Lag
Settlement Lag
County (ranked by how often in data)
Injury
Cause of Loss
Business class
Results Using Only Numeric
Variables

Used squared distance measure
Final Cluster Centers
RANK of NCounty
RANK of SumLos s
RANK of numSuit
age
Elaps ed time between
date of injury and date
reported to ins urer
Elaps ed time between
date of injury and date
s uit filed
Elaps ed time between
date of injury and date
of trial
BackInj
MultInj
1
10.741
25.155
11.204
40.67
Cluster
2
5.158
7.342
4.553
42.26
3
14.500
53.000
14.000
63.00
233
8264
13893
391
7439
13843
172
0
14627
.39
.35
.00
.00
.00
.00
Two Stage Clustering With
Categorical Variables
First compute distances
 Then get clusters
 Find optimum number of clusters

Loadings of Injuries on Cluster
Age and Cluster
County vs Cluster
Means of Financial Variables by
Cluster
Mean
TwoStep Cluster Number
1
2
3
4
Total
Paidloss
257.112
78.187
263.851
174.739
219.855
Amount for
allocated
expense for
inhouse
defense
couns el
964
0
2.533
1.540
1.612
Amount for
other ALAE
9.764
2.444
11.805
5.656
8.421
Modern dimension reduction
Hidden layer in neural networks like a
nonlinear principle components
 Projection Pursuit Regression – a
nonlinear PCA
 Kahonen self-organizing maps – a kind
of neural network that does clustering
 These can be understood as
enhancements factor analysys or
clusterini

Kahonen SOM for Fraud
S16
S13
4-5
S10
3-4
S7
16
13
10
7
4
1
S4
S1
2-3
1-2
0-1