Statistical Prediction and Machine Learning in Football
Download
Report
Transcript Statistical Prediction and Machine Learning in Football
By Andrew Finley
Research Question
Is it possible to predict a football player’s professional
based on collegiate performance? That is, is it possible
to accurately predict some player’s NFL statistic using
only their collegiate statistics?
Why – Too many “busts”
How –
Gather statistics for both NCAA and NFL players
Use statistics and ML algorithms to train a program
Use program to predict unseen examples
Presentation Outline
Related Works
Alternate applications of machine learning in sport
My Approach
Machine Learning - Classification
Decision Tree Algorithm
Implementation
Statistics to predict
Gather and Format Statistics
Insert into Weka (ML software)
Build Decision Tree
Results and Analysis
Cross-validation
Feature Selection
Related Works
Mr. NFL/NCAA (Predicts Games)
Classification using Linear Regression on Team Statistics
FFtoday.com (Predicts Fantasy Football Stats)
Linear Regression on Fantasy Football Statistics
Draft Tek (Predicts NFL Draft)
Ranks college players and takes a matrix of team needs at
every position
SABRmetrics
Use statistical analysis to create new baseball statistics
Example:
RUNS = (.41) 1B + (.82) 2B + (1.06) 3B + (1.42) HR
Machine Learning
Type – Supervised Learning (Classification)
Program is given a set of examples (instances) from
which it learns to classify unseen examples
Each instance is a set of attribute values and with a
known class
The goal is to generate a set of rules that will correctly
classify new examples
Algorithm:
Decision Tree
Decision Tree
Create a graph (tree) from the training data.
The leaves are the classes, and branches are attribute
values
Goal is to make the smallest tree possible that covers
all instances
Use the tree to make a set of classification rules
My Data
I narrowed my predictions down to just Quarterbacks and
Running backs
Input (NCAA):
Individual and team stats from every year of college play, as
well as team rankings and strength of schedule, and height
and weight
Combine data not included due to lack of participation
Output (NFL):
RB: Yrds/Carry, Total Rushing Yards, and Rushing TDs, for
each of first 3 seasons, starting after 3 seasons
QB: Total Passing Yards, Passing TDs, Interceptions, and QB
Rating, for each of first 3 seasons, starting after 3 seasons
Data Retrieval
Step 1 – Find statistics
Online: NFL.com, NCAA.org
Collegio Football: Database Software
Step 2 – Extract data
Python scripts parsed necessary statistics off websites
Statistics from Collegio were exported manually
Step 3 – Convert data into correct format
Python scripts used to combine data into 2 large .csv
files for, one for RB and one for QB
Missing data is filled in as accurately as possible
Example
Player
School Year1 Pos1
Ronnie Brown Auburn
2002RB
Year2 Pos2
Cl1 G1 Rush Yds1 Car1
Rush TD1 Yds/Car1 RushYds/G1 Rec Yds1 Rec1
So 12
1008
175
13
5.76
84
166
Cl2 G2 Rush Yds2 Car2
Rush TD2 Yds/Car2 RushYds/G2 Rec Yds2 Rec2
9
Rec TD1 Yds/Rec1 Rec/G1 RecYds/G1 PR1 PR Yds1 PR TD1 Yds/PR1 PR/G1 KR1 KR Yds1 KR TD1 Yds/KR1 KR/G1 Ret TD1 Tot Yds1 Tot TD1 TotYds/G1
1
18.4
0
13.8
0
0
0
0
0 0
0
0
0
0
0
1174
14
97.8
Rec TD2 Yds/Rec2 Rec/G2 RecYds/G2 PR2 PR Yds2 PR TD2 Yds/PR2 PR/G2 KR2 KR Yds2 KR TD2 Yds/KR2 KR/G2 Ret TD2 Tot Yds2 Tot TD2 TotYds/G2
2003RB
Jr
6
446
95
5
4.7
74.3
80
8
0
10
1
13.3
0
0
0
0
0 0
0
0
0
0
0
526
5
87.6
Year3 Pos3
Cl3 G3 Rush Yds3 Car3
Rush TD3 Yds/Car3 RushYds/G3 Rec Yds3 Rec3
Rec TD3 Yds/Rec3 Rec/G3 RecYds/G3 PR3 PR Yds3 PR TD3 Yds/PR3 PR/G3 KR3 KR Yds3 KR TD3 Yds/KR3 KR/G3 Ret TD3 Tot Yds3 Tot TD3 TotYds/G3
2004RB
Sr
12
913
153
8
5.97
76.1
313
34
1
9.2
2
26.1
0
0
0
0
0 0
0
0
0
0
0
1226
9
102.2
Height Weight
6'-1''
230
Season1 Team1
G1 GS1 Att1
RushYds1 RushAvg1 RushLng1 RushTD1
Rec1
RecYds1 RecAvg1 RecLng1 RecTD1 FUM1
Lost1 Starting
2005MiamiDolphins 15 14
207
907
4.4
65
4
32
232
7.3
38
1
4
4 TRUE
Season2 Team2
G2 GS2 Att2
RushYds2 RushAvg2 RushLng2 RushTD2
Rec2
RecYds2 RecAvg2 RecLng2 RecTD2 FUM2
Lost2 Starting
2006MiamiDolphins 13 12
241
1008
4.2
47
5
33
276
8.4
24
0
4
2 TRUE
Season3 Team3
G3 GS3 Att3
RushYds3 RushAvg3 RushLng3 RushTD3
Rec3
RecYds3 RecAvg3 RecLng3 RecTD3 FUM3
Lost3 Starting
2007MiamiDolphins 7 7
119
602
5.1
60
4
39
389
10
43
1
0
0 TRUE
Blue = NCAA data
Red = NFL data
Weka Data Processing
Weka is a machine learning algorithm database built
in Java.
Only accepts .csv files in particular format.
Preprocessing:
Apply filters to fix missing stats
Remove all NFL data except statistic being predicted
Classify the desired statistic: if numeric separate into
ranges, if nominal separate by values.
Specify attributes
Building the Tree
Tree is constructed from specified attributes.
Weka converts tree to classification rules.
Accuracy is measured using cross validation.
Cross validation: Break the training data into a
specified number of sets, use each set once as the test
data, while the rest is used as training data.
Initial Results
Initial runs with all attributes used failed; created a 1
layer tree mapped to false for predicted statistic.
The accuracy varies greatly with slight changes to
attributes used.
Tree size seems to increase as the attributes used
decreases.
Analysis
The initial 1 layer tree that was built gave an accuracy
of 68%.
This is the worst possible tree, so I should be able to
get accuracy better than this.
Attribute selection needs to improve.
Next
Improve attribute selection to optimize accuracy.
(If time) Implement other algorithms to compare
accuracy.
Questions?