Baseball Myths: - Dartmouth College

Download Report

Transcript Baseball Myths: - Dartmouth College

Baseball Findings
The statistics behind the game
Harlan Thompson
Sungjin Cho
Ryan Fagan
An Introduction
• Throughout its long history, baseball has been the subject
of many statistical studies. It lends itself well to statistics
because very careful records are kept of everything that
happens in every game.
• The topics that have been studied range from the affect of
interleague play on team standings to the role of chance in
streaks and slumps
• Other topics of study include records and predicting the
outcomes of games.
• We thought that looking at home runs and salary would be
interesting because the great number of home runs hit and
the inflation of salaries are both controversial topics.
Home Runs Per Year
-How has the total number of home runs in major league baseball changed
from year to year?
year
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
HR/game
1.3879
1.1497
1.7303
1.4036
1.6301
1.4658
1.2685
1.6044
1.5674
1.547
1.7104
1.8105
year
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
HR/game
2.1216
1.501
1.463
1.575
1.6064
1.4639
1.7778
1.6399
1.7994
2.0688
2.0459
2.084
2.2749
Test #1
• We ran a regression with the year as the
independent variable and the number of
home runs as the dependent variable to find
out the rate at which the number of home
runs in the league is increasing.
Scatterplot
Results
Source |
SS
df
MS
-------------+-----------------------------Model | .911664082 1 .911664082
Residual | .966758514 23 .042032979
-------------+-----------------------------Total | 1.8784226 24 .078267608
Number of obs =
25
F( 1, 23) = 21.69
Prob > F
= 0.0001
R-squared = 0.4853
Adj R-squared = 0.4630
Root MSE
= .20502
-----------------------------------------------------------------------------hr |
Coef.
Std. Err.
t
P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------year | .0264817 .0056862 4.66 0.000 .0147189 .0382445
_cons | 1.350108 .079607 16.96 0.000 1.185428 1.514787
------------------------------------------------------------------------------
Interpretation
• The 95% confidence interval for the coefficient of year is totally
positive - this shows that the number of home runs is definitely
increasing each year.
• An R2 value of .4853 clearly shows a positive relationship,
although not a very strong one. This could be because many
other factors can affect the number of home runs hit -- weather,
injuries to certain players, etc.
• The coefficient of year is .0264817, so each year about .02648
more home runs are hit in each game. This is over 4 more home
runs per year.
Test #2
• We split up the home run data into 2 separate
groups 1976-1987 and 1988-1999.
• Then we ran a hypothesis test on the two groups to
find out if their variances are equal to determine
whether or not we could use a paired t test on the
data.
• We used the following hypotheses:
H0 : var(HR (‘76 - ‘87)) = var(HR(‘88-’99))
HA : var(HR(‘76 - ‘87)) not= var(HR(‘88-’99))
Results
-----------------------------------------------------------------------------Variable | Obs
Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+-------------------------------------------------------------------hr1 | 12 1.584108 .073678 .255228 1.421944 1.746272
hr2 | 12
1.775
.0807902 .2798653 1.597182 1.952818
---------+-------------------------------------------------------------------Comb. | 24 1.679554 .0570527 .2794998 1.561532 1.797577
-----------------------------------------------------------------------------Ho: sd(hr1) = sd(hr2)
F(11,11) observed = F_obs
= 0.832
F(11,11) lower tail = F_L = F_obs = 0.832
F(11,11) upper tail = F_U = 1/F_obs = 1.202
Critical values at .05 significance level: (.288, 3.47)
Because the F statistic does not lie outside of this region, we cannot reject the
null hypothesis!!
Interpretation
• The variance in home run hitting has not
changed significantly over the past 25 years.
• Therefore we can use these two sets of data
in a paired t test to determine whether or not
the number of home runs hit has increased.
Test #3
• Because we found that the two groups did not
have an appreciable difference in variance, we can
use a paired t test to determine whether or not the
number of home runs hit per year has risen from
the period 1976-1987 to the period 1988-1999.
• So we ran a hypothesis test on the two groups with
the following hypotheses:
H0 : HR (‘76 - ‘87) = HR(‘88-’99)
HA : HR(‘76 - ‘87) not= HR(‘88-’99)
Results
Paired t test
-----------------------------------------------------------------------------Variable | Obs
Mean
Std. Err. Std. Dev. [95% Conf. Interval]
---------+-------------------------------------------------------------------hr1 | 12 1.584108 .073678 .255228 1.421944 1.746272
hr2 | 12
1.775
.0807902 .2798653 1.597182 1.952818
---------+-------------------------------------------------------------------diff | 12 -.1908917 .0665798 .230639 -.3374327 -.0443506
------------------------------------------------------------------------------
Ho: mean(hr1 - hr2) = mean(diff) = 0
Ha: mean(diff) < 0
t = -2.8671
P < t = 0.0077
Ha: mean(diff) ~= 0
t = -2.8671
P > |t| = 0.0153
Ha: mean(diff) > 0
t = -2.8671
P > t = 0.9923
Interpretation
• The mean for the years from 1976 to 1987 was 1.584108
HR/game vs. 1.775 HR/game from 1988 to 1999.
• We can reject our null hypothesis because we found
t
= -2.8671 (much less than the critical value -1.96).
• The the probability of Type I error is only .0153.
• Therefore, the mean number of home runs per game from
1988 to 1999 was significantly greater than the mean
number from ‘76 to ‘87.
• So, the number of home runs per year does seem to be
increasing over time.
Home Runs by Position
First we looked at last year’s home runs by position for each team.
The following is a sample of the data we accumulated...
Team
Anaheim
NY Mets
San Fran
SS HR
6
4
20
1B HR
36
22
19
2B HR
9
25
33
3B HR
47
24
10
C HR
14
13
14
LF HR
35
15
49
CF HR
25
17
12
RF HR
34
18
24
Next we calculated the total number of home runs and at bats as well as the average
number of home runs per at bat from each position for the whole league
(in order of performance)...
Position
First Base
Left Field
Right Field
Center Field
Third Base
Catcher
Shortstop
Second Base
HR/AB
0.051925
0.047185
0.0462273
0.0391384
0.038288
0.0348063
0.0243771
0.0239095
HRs
752
629
667
627
523
381
354
300
ABs
14737
13098
14314
15623
13154
10681
14050
14535
TOT HR
206
138
181
Do some positions hit significantly more
than the average?
• The league average of home runs per at bat is .0384.
• For each position, we used binomial hypothesis tests to test
whether or not the number of home runs per at bat from
that position differs significantly from the mean.
• For each position,
Ho : HR/AB = .0384
HA : HR/AB not= .0384
(Reject if |z| > 1.96)
Results
SIGNIFICANTLY BETTER (reject null)
•
•
•
First Base: z = 7.978
Left Field: z = 5.731
Right Field: z = 5.104
ABOUT AVERAGE (accept null)
•
•
•
Center Field: z = 1.127
Third Base: z = 0.812
Catcher: z = -1.468
BELOW AVERAGE (reject null)
•
•
Shortstop: z = -8.145
Second Base: z = -11.143
Interpretation
• So, we’ve proven that first basemen, left fielders and right
fielders are significantly above the mean in home run hitting.
• Shortstop and second basemen are significantly below the mean
in home run hitting.
• Center fielders, third basemen and catchers are about average.
• This makes sense - the players at positions that require the most
mobility (shortstop, second base) would obviously not be as
powerful as those who play positions require less speed and
agility.
• It is interesting that center fielders are significantly different
from the other outfielders - they do have to have a lot more
flexibility and speed.
Does salary affect performance?
• We looked at team salary vs. number of wins to see if the amount of
money paid to the players has a significant affect on a team’s
performance. Below is some of the data we used.
2000
Team
Payrol l ($)
Wi ns
New York Yankees
$114,336,616
87
Los Angel es Dodgers
$105,040,202
86
New York Met s
$99,793,463
94
Bos ton Red Sox
$97,022,789
85
Atl anta Braves
$94,537,875
95
Cleveland Indi ans
$90,488,555
90
Ari zona Di amondbacks
$87,029,013
85
St. Louis Cardi nal s
$80,749,563
95
Bal ti more Orioles
$80,466,320
74
Texas Rangers
$72,683,709
71
Seatt le Mariners
$69,861,939
91
Det roit Tigers
$68,586,561
79
Toront o Bl ue Jays
$66,814,275
83
Chi cago Cubs
$65,297,578
65
Tampa Bay Devil Rays
$65,161,683
69
Col orado Rockies
$64,767,786
83
San Di ego P adres
San Francis co Gi ants
Anahei m Angels
$64,144,989
$59,566,105
$59,198,764
76
97
82
Hous ton Ast ros
P hi ladelphia P hill ies
Cinci nnati Reds
Oakl and Athleti cs
$58,294,429
$53,894,196
$53,894,196
$42,988,297
72
65
85
91
Chi cago White Sox
Mil waulkee Brewers
Mont real Expos
P it ts burgh P irat es
Kans as Cit y Royal s
Flori da Marl ins
Minnes ota Twi ns
$42,332,755
$41,478,423
$39,477,830
$36,273,762
$31,807,466
$30,941,620
$23,499,966
95
73
67
69
77
79
69
Wins vs. Payroll for 2000
114 .3
23 .5
65
97
Wins
Wins vs. Payroll for 1999
Wins vs. Payroll for 1998
Results for 2000
Re gression Results - 2000
Source |
SS
df
MS
---------+-----------------------------Model |
3171.9669
1
3171.9669
Residual | 13079.9633
28 467.141547
---------+-----------------------------Total | 16251.9302
29 560.411387
Number of obs =
F( 1,
28)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
30
6.79
0.0145
0.1952
0.1664
21.613
-----------------------------------------------------------------------------payroll |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------wins |
1.046749
.4017006
2.606
0.015
.2239026
1.869595
_cons | -19.44844
32.76286
-0.594
0.558
-86.56012
47.66324
-----------------------------------------------------------------------------Variable |
Obs
Mean
Std. Dev.
Min
Max
---------+----------------------------------------------------payroll |
30
65.30333
23.67301
23.5
114.3
wins |
30
80.96667
9.991318
65
97
Results for 1999
Re gression Results - 1999
Source |
SS
df
MS
---------+-----------------------------Model | 6826.84626
1 6826.84626
Residual | 7726.42119
28 275.943614
---------+-----------------------------Total | 14553.2675
29 501.836809
Number of obs =
F( 1,
28)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
30
24.74
0.0000
0.4691
0.4501
16.612
-----------------------------------------------------------------------------payroll |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------wins |
1.225894
.2464638
4.974
0.000
.7210361
1.730753
_cons | -50.38485
20.16826
-2.498
0.019
-91.69766
-9.072042
-----------------------------------------------------------------------------Variable |
Obs
Mean
Std. Dev.
Min
Max
---------+----------------------------------------------------payroll |
30
48.79
22.40171
14.7
92
wins |
30
80.9
12.51578
63
103
Results for 1998
Re gression Results - 1998
MS
df
SS
Source |
---------+-----------------------------1 4844.58182
Model | 4844.58182
28 144.900216
Residual | 4057.20604
---------+-----------------------------29 306.958202
Total | 8901.78786
Number of obs =
28)
F( 1,
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
30
33.43
0.0000
0.5442
0.5279
12.037
-----------------------------------------------------------------------------[95% Conf. Interval]
P>|t|
t
Std. Err.
Coef.
payroll |
---------+-------------------------------------------------------------------1.293793
.6169076
0.000
5.782
.1652225
.9553504
wins |
-8.52233
-64.08443
0.012
-2.677
13.56227
_cons | -36.30338
-----------------------------------------------------------------------------Max
Min
Std. Dev.
Mean
Obs
Variable |
---------+----------------------------------------------------71.9
8.3
17.52022
41.08
30
payroll |
114
54
13.52902
81
30
wins |
Interpretation
• The R2 value for the year 2000 (.1952) did
not reflect a significant correlation, however
years 1998 (.5442) and 1999 (.4691) reflect
a relationship between total payroll and
number of wins
• Because the coefficient of the number of
wins is roughly 1 for all three years, we can
conclude that an additional win costs about
a million dollars.
Salary and home run hitting
• Finally, we thought we’d combine these two studies of
salary and home run hitting and analyze how the
changes in average salary have been resulted in changes
in the number of home runs hit per person. Exactly how
many more home runs are we getting per $1?
• We looked at data from 1969 to 2000.
• We found average salary but we could not find average
number of home runs/player. However we thought the
leader in home run percentage might give some kind of
portrayal of the number of home runs being hit.
Salary vs. Home Runs
Year
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
Salary(thousands)
24.9
29.3
31.5
34.1
36.6
40.8
44.7
51.5
76.1
99.9
113.6
143.8
185.7
241.5
289.2
329.4
HR Pct Leader
9.16
7.95
9.49
7.57
10.2
6.93
7.17
7.81
8.46
7.18
9.02
8.76
8.76
7.45
7.49
6.87
Year
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Salary(thousands)
371.6
412.5
412.5
438.7
497.3
597.5
851.5
1028.7
1076.1
1168.3
1110.8
1120
1336.6
1398.8
1611.2
1895.6
HR Pct Leader
7.92
7.08
8.8
7.18
8.66
8.9
7.69
8.99
8.58
9.75
12.3
12.29
9.29
13.75
12.48
10.21
Regression
Results
Source |
SS
df
MS
-------------+-----------------------------Model | 40.3836608 1 40.3836608
Residual | 54.7139269 30 1.82379756
-------------+-----------------------------Total | 95.0975877 31 3.06766412
Number of obs =
32
F( 1, 30) = 22.14
Prob > F
= 0.0000
R-squared = 0.4247
Adj R-squared = 0.4055
Root MSE
= 1.3505
-----------------------------------------------------------------------------hrpct |
Coef. Std. Err.
t
P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------sal | .002094 .000445
4.71 0.000 .0011852 .0030029
_cons | 7.760353 .3369654 23.03 0.000 7.072178 8.448528
------------------------------------------------------------------------------
Interpretation
• The coefficient of salary is .002094 and the entire confidence interval
for this value is positive. So it seems that an increase in salary may
produce an increase in home run hitting.
• For every additional hundred thousand dollars in average salary, the
leading home run hitter would hit home runs .2% more.
• We found an R2 value of .4247, which is fairly significant. However,
from 1969 to 1976, salary stayed fairly standard (compared to the
inflation today), so this may have hurt our regression since home runs
were increasing at the time, although not as rapidly as recently.
• This suggests that home runs and salary may be increasing
independently through time. There may not be an actual relationship
between the two. Further study would be needed to determine if they
are related.