Statistical Prediction and Machine Learning in Football

Download Report

Transcript Statistical Prediction and Machine Learning in Football

By Andrew Finley
Research Question
 Is it possible to predict a football player’s professional
based on collegiate performance? That is, is it possible
to accurately predict some player’s NFL statistic using
only their collegiate statistics?
 Why – Too many “busts”
 How –
 Gather statistics for both NCAA and NFL players
 Use statistics and ML algorithms to train a program
 Use program to predict unseen examples
Presentation Outline
 Related Works
 Alternate applications of machine learning in sport
 My Approach
 Machine Learning - Classification
 Decision Tree Algorithm
 Implementation
 Statistics to predict
 Gather and Format Statistics
 Insert into Weka (ML software)
 Build Decision Tree
 Results and Analysis
 Cross-validation
 Feature Selection
Related Works
 Mr. NFL/NCAA (Predicts Games)
 Classification using Linear Regression on Team Statistics
 FFtoday.com (Predicts Fantasy Football Stats)
 Linear Regression on Fantasy Football Statistics
 Draft Tek (Predicts NFL Draft)
 Ranks college players and takes a matrix of team needs at
every position
 SABRmetrics
 Use statistical analysis to create new baseball statistics
 Example:

RUNS = (.41) 1B + (.82) 2B + (1.06) 3B + (1.42) HR
Machine Learning
 Type – Supervised Learning (Classification)
 Program is given a set of examples (instances) from
which it learns to classify unseen examples
 Each instance is a set of attribute values and with a
known class
 The goal is to generate a set of rules that will correctly
classify new examples
 Algorithm:

Decision Tree
Decision Tree
 Create a graph (tree) from the training data.
 The leaves are the classes, and branches are attribute
values
 Goal is to make the smallest tree possible that covers
all instances
 Use the tree to make a set of classification rules
My Data
 I narrowed my predictions down to just Quarterbacks and
Running backs
 Input (NCAA):
 Individual and team stats from every year of college play, as
well as team rankings and strength of schedule, and height
and weight
 Combine data not included due to lack of participation
 Output (NFL):
 RB: Yrds/Carry, Total Rushing Yards, and Rushing TDs, for
each of first 3 seasons, starting after 3 seasons
 QB: Total Passing Yards, Passing TDs, Interceptions, and QB
Rating, for each of first 3 seasons, starting after 3 seasons
Data Retrieval
 Step 1 – Find statistics
 Online: NFL.com, NCAA.org
 Collegio Football: Database Software
 Step 2 – Extract data
 Python scripts parsed necessary statistics off websites
 Statistics from Collegio were exported manually
 Step 3 – Convert data into correct format
 Python scripts used to combine data into 2 large .csv
files for, one for RB and one for QB
 Missing data is filled in as accurately as possible
Example
Player
School Year1 Pos1
Ronnie Brown Auburn
2002RB
Year2 Pos2
Cl1 G1 Rush Yds1 Car1
Rush TD1 Yds/Car1 RushYds/G1 Rec Yds1 Rec1
So 12
1008
175
13
5.76
84
166
Cl2 G2 Rush Yds2 Car2
Rush TD2 Yds/Car2 RushYds/G2 Rec Yds2 Rec2
9
Rec TD1 Yds/Rec1 Rec/G1 RecYds/G1 PR1 PR Yds1 PR TD1 Yds/PR1 PR/G1 KR1 KR Yds1 KR TD1 Yds/KR1 KR/G1 Ret TD1 Tot Yds1 Tot TD1 TotYds/G1
1
18.4
0
13.8
0
0
0
0
0 0
0
0
0
0
0
1174
14
97.8
Rec TD2 Yds/Rec2 Rec/G2 RecYds/G2 PR2 PR Yds2 PR TD2 Yds/PR2 PR/G2 KR2 KR Yds2 KR TD2 Yds/KR2 KR/G2 Ret TD2 Tot Yds2 Tot TD2 TotYds/G2
2003RB
Jr
6
446
95
5
4.7
74.3
80
8
0
10
1
13.3
0
0
0
0
0 0
0
0
0
0
0
526
5
87.6
Year3 Pos3
Cl3 G3 Rush Yds3 Car3
Rush TD3 Yds/Car3 RushYds/G3 Rec Yds3 Rec3
Rec TD3 Yds/Rec3 Rec/G3 RecYds/G3 PR3 PR Yds3 PR TD3 Yds/PR3 PR/G3 KR3 KR Yds3 KR TD3 Yds/KR3 KR/G3 Ret TD3 Tot Yds3 Tot TD3 TotYds/G3
2004RB
Sr
12
913
153
8
5.97
76.1
313
34
1
9.2
2
26.1
0
0
0
0
0 0
0
0
0
0
0
1226
9
102.2
Height Weight
6'-1''
230
Season1 Team1
G1 GS1 Att1
RushYds1 RushAvg1 RushLng1 RushTD1
Rec1
RecYds1 RecAvg1 RecLng1 RecTD1 FUM1
Lost1 Starting
2005MiamiDolphins 15 14
207
907
4.4
65
4
32
232
7.3
38
1
4
4 TRUE
Season2 Team2
G2 GS2 Att2
RushYds2 RushAvg2 RushLng2 RushTD2
Rec2
RecYds2 RecAvg2 RecLng2 RecTD2 FUM2
Lost2 Starting
2006MiamiDolphins 13 12
241
1008
4.2
47
5
33
276
8.4
24
0
4
2 TRUE
Season3 Team3
G3 GS3 Att3
RushYds3 RushAvg3 RushLng3 RushTD3
Rec3
RecYds3 RecAvg3 RecLng3 RecTD3 FUM3
Lost3 Starting
2007MiamiDolphins 7 7
119
602
5.1
60
4
39
389
10
43
1
0
0 TRUE
Blue = NCAA data
Red = NFL data
Weka Data Processing
 Weka is a machine learning algorithm database built
in Java.
 Only accepts .csv files in particular format.
 Preprocessing:
 Apply filters to fix missing stats
 Remove all NFL data except statistic being predicted
 Classify the desired statistic: if numeric separate into
ranges, if nominal separate by values.
 Specify attributes
Building the Tree
 Tree is constructed from specified attributes.
 Weka converts tree to classification rules.
 Accuracy is measured using cross validation.
 Cross validation: Break the training data into a
specified number of sets, use each set once as the test
data, while the rest is used as training data.
Initial Results
 Initial runs with all attributes used failed; created a 1
layer tree mapped to false for predicted statistic.
 The accuracy varies greatly with slight changes to
attributes used.
 Tree size seems to increase as the attributes used
decreases.
Analysis
 The initial 1 layer tree that was built gave an accuracy
of 68%.
 This is the worst possible tree, so I should be able to
get accuracy better than this.
 Attribute selection needs to improve.
Next
 Improve attribute selection to optimize accuracy.
 (If time) Implement other algorithms to compare
accuracy.
Questions?