Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA
Download ReportTranscript Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA
Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. [email protected] www,data-mines.com Objectives Answer questions: What is dimension reduction and why use it? Introduce key methods of dimension reduction Illustrate with examples in Workers Compensation There will be some formulas, but emphasis is on insight into basic mechanisms of the procedures Introduction “How do mere observations become data for analysis?” “Specific variable values are never immutable characteristics of the data” Jacoby, Data Theory and Dimension Analysis, Sage Publications Many of the dimension reduction/measurement techniques originated in the social sciences and dealt with how to create scales from responses on attitusional and opinion surveys Unsupervised learning Dimension reduction methods generally unsupervised learning Supervised Learning A dependent or target variable Unsupervised learning No target variable Group like variables or like records together The Data BLS Economic indexes Components of inflation Employment data Health insurance inflation Texas Department of Insurance closed claim data for 2002 and 2003 Employment related injury Excludes small claims About 1800 records What is a dimension? Jacoby – any information that adds significant variability In many studies each variable is a dimension However,we can also view each record in a database as a dimension Dimensions Year Medical Csre 1980 $ 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 MedicalServices 74.90 $ 82.90 92.50 100.60 106.80 113.50 122.00 130.10 138.60 149.30 162.80 Transportation Electricity 74.80 $ 82.80 92.60 100.70 106.70 113.20 121.90 130.00 138.30 148.90 162.70 83.10 $ 93.20 97.00 99.30 103.70 106.40 102.30 105.40 108.70 114.10 120.50 26.70 31.55 36.01 37.18 38.60 38.98 40.22 40.02 40.20 40.83 41.66 The Two Major Categories of Dimension Reduction Variable reduction Factor Analysis Principal Components Analysis Record reduction Clustering Other methods tend to be developments on these Principal Components Analysis A form of dimension (variable) reduction Suppose we want to combine all the information related to the “inflation” dimension of insurance costs Medical care costs Employment (wage) costs Other Energy Transportation Services Principal Components These variables are correlated but not perfectly correlated We replace many variables with a weighted sum of the variables These are then used as independent variables in a predictive model Factor Analysis: A Latent Factor 9/12/2005 litigation rates Subtitle 9/12/2005 Social Inflation # Procedures Subtitle 9/12/2005 Index of tort climate Subtitle Factor/Principal Components Analysis Linear methods – use linear correlation matrix Correlation matrix decomposed to find smaller number of factors the are related to the same underlying drivers Highly correlated variables tend to have high load on the same factor Factor/Principal Components Analysis Medical Care MedicalServices Transportation Electricity Utility Fuel Oil Gas Bread Medical Care MedicalServices Transportation Electricity 1.000 1.000 1.000 0.993 0.992 1.000 0.888 0.884 0.910 1.000 0.872 0.873 0.875 0.771 0.448 0.451 0.468 0.281 0.586 0.592 0.601 0.402 0.983 0.983 0.975 0.844 Utility 1.000 0.704 0.752 0.847 Fuel Oil 1.000 0.926 0.459 Gas 1.000 0.595 Bread 1.000 Factor/Principal Components Analysis Uses eignevectors and eigenvalues R is correlation matrix, V eigenvectors, lambda eigenvalues RV V Inflation Data Component Matrixa Medical Care MedicalServices Trans portation Electricity Utility Fuel Oil Gas Bread Eggs Apples Coffee Employment UEP EmpCost Component 1 2 .986 -.086 .986 -.081 .990 -.073 .895 -.205 .877 .303 .551 .761 .709 .639 .973 -.078 .587 .337 .766 .077 .457 -.644 .967 -.202 -.695 .521 .986 -.048 Extraction Method: Principal Component Analys is . a. 2 components extracted. Factor Rotation Find simpler more easily interpretable factors Use 1notion of2 factor complexity r 2 2 qi (b r ij bij ) i r is number of factors, b ij is loading of variable i on fcator j, b ij is mean loading on factor for row Factor Rotation Quartimax Rotation Maximize q Varimax Rotation Maximizes the variance of squared loadings for each factor rather than for each variable Varimax Rotation Rotated Component Matrixa Medical Care MedicalServices Trans portation Electricity Utility Fuel Oil Gas Bread Eggs Apples Coffee Employment UEP EmpCos t Component 1 2 .834 .533 .831 .537 .829 .546 .835 .383 .510 .775 -.028 .939 .172 .939 .818 .532 .260 .625 .560 .529 .755 -.232 .890 .429 -.869 -.011 .811 .563 Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kais er Normalization. a. Rotation converged in 3 iterations. Plot of Loadings on Factors How Many Factors to Keep? Eigenvalues provide information on how much variance is explained Proportion explained by a given component=corresponding eigenvalue/n Use Scree Plot Rule of thumb: keep all factors with eigenvalues>1 WC Severity vs Factor 1 WC Severity vs Factor 2 What About Categorical Data? Factor analysis is performed on numeric data You could code data as binary dummy variables Categorical Variables from Texas data Injury Cause of loss Business Class Health Insurance (Y/N) Optimal Scaling A method of dealing with categorical variables Uses regression to Assign numbers to categories Fit regression coefficients Y*=f(X*) In each round of fitting, a new Y* and X* is created Variable Correlations Correlations Original Variables injury cause Bus ines s clas s Dimens ion Eigenvalue injury 1.000 -.019 .049 1 1.109 cause -.019 1.000 .105 2 1.014 Bus ines s clas s .049 .105 1.000 3 .877 Correlations Transformed Variables Dimens ion: 1 injury cause Bus ines s clas s Dimens ion Eigenvalue injury 1.000 .710 .433 1 2.138 cause .710 1.000 .552 2 .590 Bus ines s clas s .433 .552 1.000 3 .272 Visualizations of Scaled Variables Can we use scaled variables in prediction? Report Mean NTILES of Paidloss 1 2 3 4 5 Total Object s cores dimension 1 -.0271 .0246 .0156 .0045 -.0158 .0000 Object s cores dimension 2 -.1361 -.0626 -.0721 .0562 .2172 .0000 Row Reduction: Cluster Analysis Records are grouped in categories that have similar values on the variables Examples Marketing: People with similar values on demographic variables (i.e., age, gender, income) may be grouped together for marketing Text analysis: Use words that tend to occur together to classify documents Fraud modeling Note: no dependent variable used in analysis Clustering Common Method: k-means, hierarchical No dependent variable – records are grouped into classes with similar values on the variable Start with a measure of similarity or dissimilarity Maximize dissimilarity between members of different clusters Dissimilarity (Distance) Measure – Continuous Variables Euclidian Distance dij 1/ 2 m 2 ( xik x jk ) i, j = records k=variable k 1 Manhattan Distance dij m xik x jk k 1 Column Variable Binary Variables Row Variable 1 0 0a b a+b 1c d c+d a+c b+d Binary Variables Sample Matching bc d abcd Rogers and Tanimoto 2(b c) d (a d ) 2(b c) Example: Texas Data Data from 2002 and 200 3closed claim database by Texas Ins Dept Only claims over a threshold included Variables used for clustering: Report Lag Settlement Lag County (ranked by how often in data) Injury Cause of Loss Business class Results Using Only Numeric Variables Used squared distance measure Final Cluster Centers RANK of NCounty RANK of SumLos s RANK of numSuit age Elaps ed time between date of injury and date reported to ins urer Elaps ed time between date of injury and date s uit filed Elaps ed time between date of injury and date of trial BackInj MultInj 1 10.741 25.155 11.204 40.67 Cluster 2 5.158 7.342 4.553 42.26 3 14.500 53.000 14.000 63.00 233 8264 13893 391 7439 13843 172 0 14627 .39 .35 .00 .00 .00 .00 Two Stage Clustering With Categorical Variables First compute distances Then get clusters Find optimum number of clusters Loadings of Injuries on Cluster Age and Cluster County vs Cluster Means of Financial Variables by Cluster Mean TwoStep Cluster Number 1 2 3 4 Total Paidloss 257.112 78.187 263.851 174.739 219.855 Amount for allocated expense for inhouse defense couns el 964 0 2.533 1.540 1.612 Amount for other ALAE 9.764 2.444 11.805 5.656 8.421 Modern dimension reduction Hidden layer in neural networks like a nonlinear principle components Projection Pursuit Regression – a nonlinear PCA Kahonen self-organizing maps – a kind of neural network that does clustering These can be understood as enhancements factor analysys or clusterini Kahonen SOM for Fraud S16 S13 4-5 S10 3-4 S7 16 13 10 7 4 1 S4 S1 2-3 1-2 0-1