presentation 2

Transcript presentation 2

Optimization of Batting Order
Frank R. Zheng
A Quick Introduction to Baseball
 Two teams alternate batting and fielding.
 Batting team tries to score runs.
 Runners must advance through first, second and third base in order to
reach home
 Runners are advanced by players getting hits, drawing walks, stealing
bases, or errors by the opposing team’s defense
 The team with the most runs at the end of the game wins
Batting Order
 Before each game, the team’s coach must submit the batting order of
the team
 The batting order dictates the order in which players step up to the
plate
 Substitutions such as pitch hitters or pitch runners are allowed, but are
relatively rare
 The optimal batting order maximizes the expected run production
Batting Order Optimization as a Scheduling Problem
 Finding the optimal batting order for a team can be thought of as a
single-machine scheduling problem
 Each batter is modeled as a job, and the batting order is a set of 9 such
jobs
 The objective function is to maximize the run production of the lineup
 This is a complicated function that requires simulation to analyze
Approach to Optimize Batting Order
 Each baseball team has a roster of ~15 batters, of which only 9
compose the batting order
 Brute forcing all the possible lineups is somewhat impractical – need to
calculate 15!/6! combinations (over 1.8 billion unique lineups)
 Solution is to combine a qualitative “conventional wisdom” approach
with a data-driven quantitative methodology
Batting Order Conventional Wisdom
 Over the many decades baseball has been played, coaches have
dedicated much thought to finding the best lineup
 Traditional lineups follow this general order
 1-2 – batters who get on base on a lot
 3-5 – batters who get a lot of extra base hits
 6-8 – weak batters
 9 – pitcher/weak batter/batter who gets on base a lot
 Key is to have players with a high realization value (lots of runs batted
in) follow those with a high potential value (getting on base a lot)
 i.e., get runners on base so your power hitters can drive them
home
Underlying Causes of Run Production
 There is a limited set of events that have the potential to score runs
 We refer to these as “Run-Producing Events” or RPEs
 RPEs include






Singles (1B)
Doubles (2B)
Triples (3B)
Home Runs (HR)
Bases on Balls/Batter Hitter by Pitch (BB+HBP)
Errors (ERR)
Batting Performance
 Does the model fully capture differences among player batting
characteristics?
Regression Value
OUT
1B
BB+HBP
2B
3B
HR
ERR
-0.1040
0.4659
0.3255
0.7613
1.0456
1.4031
0.4340
 How to distinguish between ‘table setters’ vs. ‘sluggers/cleanup
hitters’?
Realization Value vs. Potential Value
 Realization Value is the expected number of runs each RPE actually
scores
 Potential Value is the effect each RPE has on the team’s chances to
score additional runs in the same inning
 Differentiating between these two metrics allows us to quantitatively
determine which players create the potential for scoring runs and
which ones are good at bringing those players to home plate
OUT
1B
BB+HBP
2B
3B
HR
ERR
Realization Value
0.0000
0.2314
0.0328
0.5120
0.7411
1.7387
0.1000
Potential Value
-0.1040
0.2345
0.2927
0.2493
0.3045
-0.3356
0.3340
Total Value
-0.1040
0.4659
0.3255
0.7613
1.0456
1.4031
0.4340
Differentiating Players
 By comparing each individual’s realization value and potential value to
the team’s overall averages, we can group players into one of four
categories
 (R+, P+) Strong Hitters – players who bat in a lot of runs but also create
the potential for more runs
 (R+, P-) Run Producers – players who bat in a lot of runs
 (R-, P+) Table Setters – players who create a lot of potential for more runs
 (R-, P-) Weak Hitters – the team’s worst players
 This gives us the quantitative data we need to apply the conventional wisdom
discussed earlier
Overview of Heuristic
 Now we have the tools we need to combine the holistic conventional
wisdom with quantitative data
 We adapted this heuristic from the work of Sokol
 After determining which players fall into which set, we attempt to
follow the conventional wisdom of placing batters with high realization
values after a group of batters with high potential values
 We want to build up potential value and then release it with
realization value
 The optimal order of the four sets is
 (R-, P+)  (R+, P+)  (R+, P-)  (R-, P-)
Heuristic Steps
 Select the two batters with the highest P in the (R-, P+) set and assign
them to the top two slots in the batting order, by order of increasing P
 Place all batters in the (R+, P+) group in the next slots, ordered by
decreasing P
 Fill as many remaining slots as possible with batters from the (R+, P-)
group, ordered by decreasing P
 If there are any remaining slots, fill them with batters in the (R-, P-)
group, ordered by increasing P
 For each player left in the (R-, P+) group, replace a (R-, P-) player if
possible, ordering the new (R-, P+) players by increasing P
Application to 2011 New York Yankees
 In order to see the effects of our heuristic, we applied it to the 2011
New York Yankees
 First, we placed each player into the appropriate category
NYY 2011 - Realization Value vs. Potential Value (Difference from Team Average)
Potential Value (PV)
Brett Gardner
(R-, P+)
Table Setters
Derek Jeter
(R+, P+)
Strong Hitters
Nick Swisher
Eric Chavez
Jesus Montero
Alex Rodriguez
Eduardo Nunez
Realization Value (RV)
Francisco Cervelli
Andruw Jones
Chris Dickerson
(R-, P-)
Weak Hitters
Russel Martin
Jorge Posada
Curtis Granderson
Mark Teixeria
Robinson Cano
(R+, P-)
Run Producers
Simulation
 In order to determine the value of our objective function (the expected
number of runs scored per game) we need to simulate a game of
baseball using the designated lineup
 Our simulation follows the structure of a normal game of baseball
 At each point in time, the next batter steps up to the plate and either
generates a RPE or gets out, depending on that player’s distribution
 RPEs advance runners according to the rules of baseball or by
probabilistic outcomes determined using data from the 2011 season
 The number of outs and runs is recorded for each of 16,200 games
Results of Analysis
Standard Lineup
Batting Order
1
2
3
4
5
6
7
8
9
Player
Derek Jeter
Curtis Granderson
Robinson Cano
Alex Rodriguez
Mark Teixeira
Nick Swisher
Jorge Posada
Russel Martin
Brett Gardner
Heuristic Lineup
Set
R-, P+
R+, PR+, PR+, P+
R+, PR-, P+
R-, PR-, PR-, P+
 This lineup generated an
average of 5.68 runs, and is
expected to have a 61.3%
chance of winning a 5-game
series against the Detroit Tigers
Batting Order
1
2
3
4
5
6
7
8
9
Player
Brett Gardner
Derek Jeter
Alex Rodriguez
Robinson Cano
Curtis Granderson
Andruw Jones
Mark Teixeira
Russel Martin
Nick Swisher
Set
R-, P+
R-, P+
R+, P+
R+, PR+, PR+, PR+, PR-, PR-, P+
 This lineup generated an average
of 5.84 runs, with a 64.7%
chance of winning a 5-game
series against the Detroit Tigers
Conclusions and Other Applications
 The heuristic was only able to generate a lineup with a 3% increase in
the amount of expected runs
 Since statistical analysis in baseball is a known quantity, it may be the
case that the NYY have already studied this problem in great detail
 Even if the gains in expected run production were minimal, there are
other applications for our methodology
 Potential trades or acquisitions of new players can be evaluated by what
effect they would have on the team’s expected run production
 Can apply a game-theoretic approach to maximize your expected win rate
by adjusting the distribution of your team’s run production to maximize
the potential of winning a game against a specific team