Predictive modeling competitions making data science a sport Anthony Goldbloom CEO, Kaggle e-mail [email protected] twitter @antgoldbloom.

Download Report

Transcript Predictive modeling competitions making data science a sport Anthony Goldbloom CEO, Kaggle e-mail [email protected] twitter @antgoldbloom.

Predictive modeling competitions
making data science a sport
Anthony Goldbloom
CEO, Kaggle
e-mail [email protected]
twitter @antgoldbloom
1. Motivation
2. Why compete?
3. How it works
4. R on Kaggle
5. The Heritage Health Prize
Global competitions
Predicting HIV viral load
Competition closes 77%
1½ weeks 70.8%
State of the art 70%
Crowdsourcing
Mismatch between those with data and
those with the skills to analyse it
Countless approaches. Hard to know which will work
Not MIT, not SAS … UoL?
Additional slides
Tourism Forecasting Competition
Forecast Error
(MASE)
Existing model
Aug 9
2 weeks
later
1 month
later
Competition
End
Chess Ratings Competition
Existing model (ELO)
Error Rate
(RMSE)
Aug 4
1 month
later
2 months
later
Today
Our User Base
Users apply different techniques
•
•
•
•
•
•
•
neural networks
logistic regression
support vector machine
decision trees
ensemble methods
adaBoost
Bayesian networks
•
•
•
•
•
•
genetic algorithms
random forest
Monte Carlo methods
principal component analysis
Kalman filter
evolutionary fuzzy modeling
1. Motivation
2. Why compete?
3. How it works
4. R on Kaggle
5. The Heritage Health Prize
Why Participants Compete
2
1
Clean, Real world data
3
Interactions with experts in related fields
Professional Reputation & Experience
4
Prizes
1. Motivation
2. Why compete?
3. How it works
4. R on Kaggle
5. The Heritage Health Prize
Competitions are judged based on predictive accuracy
Competition Mechanics
Competitions are judged on objective criteria
1. Motivation
2. Why compete?
3. How it works
4. R on Kaggle
5. The Heritage Health Prize
R
Matlab
SAS
WEKA
SPSS
Python
Excel
Mathematica
Stata
Other
R on Kaggle
R
Matlab
SAS
WEKA
SPSS
Python
Excel
Mathematica
Stata
Other
R on Kaggle among academics
R
Matlab
SAS
WEKA
SPSS
Python
Excel
Mathematica
Stata
Other
R on Kaggle among Americans
Who Uses R and How
Number
Name
4
5
6
7
10
13
HIV Progression
Prediction
Informs 2010
Chess Rating
Tourism
Forecasting Part
2
R Package
Recommendation
Ford Stay Alert
Winner
Chris Raimondi
Cole Harris
Yannis Sismanis
Phil Brierley
Max Lin
Edward
Packages
Caret (RFE and
RandomForest)
GLM, NNET
Forecast
Stats, ROCR,
GGPlot, GGPlot2
Stats
1. Motivation
2. Why compete?
3. How it works
4. R on Kaggle
5. The Heritage Health Prize
Mmm… how do I put this into R?
MembId DaysInHospital
25872
0
MembId AgeAtFirstClaim Sex
25872
MembId
25872
25872
25872
25872
25872
25872
25872
ProviderId Vendor
PCP
171278567 7891165
294037
376108719 5024957
294037
171278567 7891165
294037
171278567 7891165
294037
171278567 7891165
294037
171278567 7891165
294037
376108719 5024957
294037
YearSvc
Y1
Y1
Y1
Y1
Y1
Y1
Y1
Specialty
Internal
Laboratory
Internal
Internal
Internal
Internal
Laboratory
19-Oct F
Place
PayDelay LengthOfStay
DSFS
Office
22
0- 1 month
Independent Lab 23
0- 1 month
Office
16
1- 2 months
Office
19
2- 3 months
Office
21
3- 4 months
Office
21
4- 5 months
Independent Lab 11
7- 8 months
PrimaryConditionGroup
CharIndexClaimID
RESPR4
1-2
1
MSC2a3 0
2
RESPR4
1-2
3
RESPR4
1-2
4
RESPR4
1-2
5
RESPR4
1-2
6
METAB3 1-2
7
Some SQL Magic
Gives us a flat record
MembId DaysInHospital AgeAtFirstClaim Sex
25872
0
19-Oct F
maxlos
numclaims inhosp urgent
7
0
0
Voila, an entry!
What could the world’s best
analysts find in your data?
e-mail [email protected]
phone +61438400053
Photo by gidzy, www.flickr.com/photos/gidzy