Predictive modeling competitions making data science a sport Anthony Goldbloom CEO, Kaggle e-mail [email protected] twitter @antgoldbloom.
Download ReportTranscript Predictive modeling competitions making data science a sport Anthony Goldbloom CEO, Kaggle e-mail [email protected] twitter @antgoldbloom.
Predictive modeling competitions making data science a sport Anthony Goldbloom CEO, Kaggle e-mail [email protected] twitter @antgoldbloom 1. Motivation 2. Why compete? 3. How it works 4. R on Kaggle 5. The Heritage Health Prize Global competitions Predicting HIV viral load Competition closes 77% 1½ weeks 70.8% State of the art 70% Crowdsourcing Mismatch between those with data and those with the skills to analyse it Countless approaches. Hard to know which will work Not MIT, not SAS … UoL? Additional slides Tourism Forecasting Competition Forecast Error (MASE) Existing model Aug 9 2 weeks later 1 month later Competition End Chess Ratings Competition Existing model (ELO) Error Rate (RMSE) Aug 4 1 month later 2 months later Today Our User Base Users apply different techniques • • • • • • • neural networks logistic regression support vector machine decision trees ensemble methods adaBoost Bayesian networks • • • • • • genetic algorithms random forest Monte Carlo methods principal component analysis Kalman filter evolutionary fuzzy modeling 1. Motivation 2. Why compete? 3. How it works 4. R on Kaggle 5. The Heritage Health Prize Why Participants Compete 2 1 Clean, Real world data 3 Interactions with experts in related fields Professional Reputation & Experience 4 Prizes 1. Motivation 2. Why compete? 3. How it works 4. R on Kaggle 5. The Heritage Health Prize Competitions are judged based on predictive accuracy Competition Mechanics Competitions are judged on objective criteria 1. Motivation 2. Why compete? 3. How it works 4. R on Kaggle 5. The Heritage Health Prize R Matlab SAS WEKA SPSS Python Excel Mathematica Stata Other R on Kaggle R Matlab SAS WEKA SPSS Python Excel Mathematica Stata Other R on Kaggle among academics R Matlab SAS WEKA SPSS Python Excel Mathematica Stata Other R on Kaggle among Americans Who Uses R and How Number Name 4 5 6 7 10 13 HIV Progression Prediction Informs 2010 Chess Rating Tourism Forecasting Part 2 R Package Recommendation Ford Stay Alert Winner Chris Raimondi Cole Harris Yannis Sismanis Phil Brierley Max Lin Edward Packages Caret (RFE and RandomForest) GLM, NNET Forecast Stats, ROCR, GGPlot, GGPlot2 Stats 1. Motivation 2. Why compete? 3. How it works 4. R on Kaggle 5. The Heritage Health Prize Mmm… how do I put this into R? MembId DaysInHospital 25872 0 MembId AgeAtFirstClaim Sex 25872 MembId 25872 25872 25872 25872 25872 25872 25872 ProviderId Vendor PCP 171278567 7891165 294037 376108719 5024957 294037 171278567 7891165 294037 171278567 7891165 294037 171278567 7891165 294037 171278567 7891165 294037 376108719 5024957 294037 YearSvc Y1 Y1 Y1 Y1 Y1 Y1 Y1 Specialty Internal Laboratory Internal Internal Internal Internal Laboratory 19-Oct F Place PayDelay LengthOfStay DSFS Office 22 0- 1 month Independent Lab 23 0- 1 month Office 16 1- 2 months Office 19 2- 3 months Office 21 3- 4 months Office 21 4- 5 months Independent Lab 11 7- 8 months PrimaryConditionGroup CharIndexClaimID RESPR4 1-2 1 MSC2a3 0 2 RESPR4 1-2 3 RESPR4 1-2 4 RESPR4 1-2 5 RESPR4 1-2 6 METAB3 1-2 7 Some SQL Magic Gives us a flat record MembId DaysInHospital AgeAtFirstClaim Sex 25872 0 19-Oct F maxlos numclaims inhosp urgent 7 0 0 Voila, an entry! What could the world’s best analysts find in your data? e-mail [email protected] phone +61438400053 Photo by gidzy, www.flickr.com/photos/gidzy