Class 9: Thurs., Oct. 7 - University of Pennsylvania

Download Report

Transcript Class 9: Thurs., Oct. 7 - University of Pennsylvania

Class 9: Thurs., Oct. 7
• Inference in regression (Ch 10.1-10.2)
– Confidence intervals for slope
– Hypothesis test for slope
– Confidence intervals for mean response
– Prediction intervals
• Confidence intervals and the polls
• I will e-mail HW 5 to you by tomorrow. It
will be due Tuesday, Oct. 19th.
CPS Wage-Education Data for
March 1988
Bivariate Fit of wage By educ
18000
16000
14000
wage
12000
10000
8000
6000
4000
2000
0
0 1 2 3 4 5 6 7 8 9 10
1213
15
educ
Linear Fit
Linear Fit
wage = -19.06983 + 50.414381 educ
1718
Inference Based on Sample
• The whole Current Population Survey (25,631 men ages
18-70) is a random sample from the U.S. population
(roughly 75 million men ages 18-70).
• In most regression analyses, the data we have is a
sample from some larger (hypothetical) population. We
are interested in the true regression line for the larger
population.
• Inference Questions:
– How accurate is the least squares estimate of the
slope for the true slope in the larger population?
– What is a plausible range of values for the true slope
in the larger population based on the sample?
– Is it plausible that the slope equals a particular value
(e.g., 0) based on the sample?
• Regression Applet:
Link on web site under Fun Links. Link entitled Simple
Linear Regression.
Full Data Set
Linear Fit
Bivariate Fit of wage By educ
wage = -19.06983 + 50.414381 educ
18000
16000
Summary of Fit
14000
wage
12000
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
10000
8000
6000
4000
2000
0.108609
0.108575
419.4715
640.1625
25631
0
0 1 2 3 4 5 6 7 8 9 10
1213
15
1718
educ
Parameter Estimates
Term
Intercept
educ
Estimate
-19.06983
50.414381
Std Error
12.08449
0.902171
t Ratio
-1.58
55.88
Prob>|t|
0.1146
0.0000
Random Sample of Size 25
Bivariate Fit of wage By educ
Linear Fit
1500
wage = 170.98998 + 35.345874 educ
1250
Summary of Fit
wage
1000
750
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
500
250
0
5
10
educ
15
0.150327
0.113384
308.4266
592.3128
25
20
Parameter Estimates
Term
Intercept
educ
Estimate
170.98998
35.345874
Std Error
217.7806
17.52198
t Ratio
0.79
2.02
Prob>|t|
0.4404
0.0555
Confidence Intervals
• Confidence interval: A range of values that are plausible
for a parameter given the data.
• 95% confidence interval: An interval that 95% of the time
will contain the true parameter.
• Approximate 95% confidence interval: Estimate of
parameter  2*SE(Estimate of parameter).
• Approximate 95% confidence interval for slope:
ˆ1  2 * SE(ˆ1)
• For wage-education data, ˆ1  50.41, SE(ˆ1 )  0.90 ,
approximate 95% CI = 50.41 2 * 0.90  (48.61,52.21)
• Interpretation of 95% confidence interval: It is most
plausible that the true slope is in the 95% confidence
interval. It is possible that the true slope is outside the
95% confidence interval but unlikely; the confidence
interval will fail to contain the true slope only 5% of the
time in repeated samples.
Conf. Intervals for Slope in JMP
• After Fit Line, right click in the parameter
estimates table, go to Columns and click
on Lower 95% and Upper 95%.
• The exact 95% confidence interval is close
ˆ1  2 * SE(ˆ1)
to
but
not
equal
to
Bivariate Fit of wage By educ
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.108609
0.108575
419.4715
640.1625
25631
Parameter Estimates
Term
Intercept
educ
Estimate
-19.06983
50.414381
Std Error
12.08449
0.902171
t Ratio
-1.58
55.88
Prob>|t|
0.1146
0.0000
Lower 95%
-42.75612
48.646075
Upper 95%
4.6164574
52.182687
Hypothesis Testing
• Simple Linear Regression Model:
E(Y | X )  0  1 X
• Is the slope equal to 0?
• Null hypothesis H 0 : 1  0
• Alternative (research)
hypothesis H a : 1  0
ˆ
 0
• Test statistic: t  SE1 (ˆ )
1
• Rough rule: Reject H 0 if |t|>=2. Accept H 0 if |t|<2.
• P-values : Find the p-value for the test. The p-value is a
measure of the credibility of the null hypothesis. Small
p-values give you evidence against the null hypothesis.
Large p-values suggest there is no evidence in the data
to reject the null hypothesis.
• The generally followed rule is to reject H 0 if the p-value
is less than 0.05 and accept H 0 if the p-value is greater
than 0.05.
• Hypothesis Testing in JMP:
ˆ1  0
Under Parameter Estimates, the t Ratio is the test statistic t 
.
ˆ
SE (  1 )
The Prob>|t| is the p-value for the test of H 0 : 1  0 versus H a : 1  0 .
• The test statistic is a standard error counter. It is
the relationship between ˆ1 and SE(ˆ1 ) that
matters, not the size of ˆ1 itself.
*
H
:



*
H
:



• Testing 0 1 1 * vs. a 1 1 . Use test
ˆ1  1
statistic t 
. Reject null hypothesis if
SE ( ˆ1 )
|t|>=2.
Bivariate Fit of wage By educ
18000
16000
14000
wage
12000
10000
8000
Parameter Estimates
Term
Estimate Std Error
Intercept
-19.06983 12.08449
educ
50.414381 0.902171
6000
4000
2000
t Ratio Prob>|t|
-1.58 0.1146
55.88 0.0000
0
0 1 2 3 4 5 6 7 8 9 10
1213
15
1718
educ
Bivariate Fit of wage By educ
2500
wage
2000
Parameter Estimates
Term
Estimate Std Error
Intercept
-150.8179 354.4876
educ
56.926375 25.46911
1500
1000
500
0
5
10
educ
15
20
t Ratio Prob>|t|
-0.43 0.6745
2.24 0.0354
Logic of Hypothesis Testing:
Hypoth. Testing in the Courtroom
• Null hypothesis: The defendant is innocent
• Alternative hypothesis: The defendant is guilty
• The goal of the procedure is to determine
whether there is enough evidence to conclude
that the alternative hypothesis is true. The
burden of proof is on the alternative hypothesis.
• A small p-value indicates that there is strong
evidence against the null hypothesis. A p-value
> 0.05 does not show that the null hypothesis is
true, only that there is not strong evidence
against the null hypothesis.
Car Price Example
• A used-car dealer wants to understand how
odometer reading affects the selling price of
used cars.
• The dealer randomly selects 100 three-year old
Ford Tauruses that were sold at auction during
the past month. Each car was in top condition
and equipped with automatic transmission,
AM/FM cassette tape player and air
conditioning.
• carprices.JMP contains the price and number of
miles on the odometer of each car.
Bivariate Fit of Price By Odometer
16000
Linear Fit
Price = 17066.766 - 0.0623155 Odometer
Price
15500
15000
Summary of Fit
14500
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum
Wgts)
14000
13500
15000
25000 30000 35000 40000 45000
Odometer
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95%
Intercept 17066.766 169.0246 100.97 <.0001 16731.342 17402.19
Odometer -0.062315 0.004618 -13.49 <.0001 -0.071479 -0.053152
0.650132
0.646562
303.1375
14822.82
100
• The used-car dealer has an opportunity to bid on
a lot of cars offered by a rental company. The
rental company has 250 Ford Tauruses, all
equipped with automatic transmission, air
conditioning and AM/FM cassette tape players.
All of the cars in this lot have about 40,000 miles
on the odometer. The dealer would like an
estimate of the average selling price of all cars
of this type with 40,000 miles on the odometer,
i.e., E(Y|X=40,000).
• The least squares estimate is
Eˆ (Y | X  40000)  17067 0.0623* 40000 $14,575
Confidence Interval for Mean
Response
• Confidence interval for E(Y|X=40,000): A range of plausible values
for E(Y|X=40,000) based on the sample.
• Approximate 95% Confidence interval:
Eˆ (Y | X  X 0 )  2 * SE{Eˆ (Y | X  X 0 )}
2
1
(
X

X
)
SE{Eˆ (Y | X  X 0 )}  RMSE  n 0
n  ( X i  X )2
i 1
• Notes about formula for SE: Standard error becomes smaller as
sample size n increases, standard error is smaller the closer X 0 is
to X
• In JMP, after Fit Line, click red triangle next to Linear Fit and click
Confid Curves Fit. Use the crosshair tool by clicking Tools, Crosshair
to find the exact values of the confidence interval endpoints for a
given X0.
Bivariate Fit of Price By Odometer
16000
Price
15500
15000
14500
14000
13500
15000
25000 30000 35000 40000 45000
Odometer
Approximate 95% confidence interval for Eˆ (Y | X  40,000) 
($14,514, $14,653)
A Prediction Problem
•
The used-car dealer is offered a particular 3year old Ford Taurus equipped with automatic
transmission, air conditioner and AM/FM
cassette tape player and with 40,000 miles on
the odometer. The dealer would like to predict
the selling price of this particular car.
•
Best prediction based on least squares
estimate:
Eˆ (Y | X  40000)  17067 0.0623* 40000 $14,575
Range of Selling Prices for
Particular Car
• The dealer is interested in the range of selling prices that
this particular car with 40,000 miles on it is likely to have.
• Under simple linear regression model, Y|X follows a
normal distribution with mean 0  1 * X and standard 
deviation . A car with 40,000 miles on it will be in
interval 0  1 * 40000 2 *
about 95% of the time.
• Class
5:
We substituted the least squares estimates
ˆ
ˆ
for 0 , 1, RMSE for 0 , 1, and said car with 40,000
miles on it will be in interval ˆ0  ˆ1 * 40000 2 * RMSE
about 95% of the time. This is a good approximation but
it ignores potential error in least square estimates.
Prediction Interval
• 95% Prediction Interval: An interval that has
approximately a 95% chance of containing the value of Y
for a particular unit with X=X0 ,where the particular unit is
not in the original sample.
• Approximate 95% prediction interval:
n
1
ˆ
E (Y | X  X 0 )  2 * RMSE 1   i1 ( X i  X )2
n
• In JMP, after Fit Line, click red triangle next to Linear Fit
and click Confid Curves Indiv. Use the crosshair tool by
clicking Tools, Crosshair to find the exact values of the
prediction interval endpoints for a given X0.
Bivariate Fit of Price By Odometer
16000
Price
15500
15000
14500
14000
13500
15000
30000 40000
Odometer
95% Confidence Interval for {Y | X  40000}  (14514, 14653)
95% Prediction Interval for X=40000  (13972, 15194)
Comparison of Confidence
Intervals for Mean Response and
Prediction Intervals
• Confidence Interval for Mean Response:
2
1
(
X

X
)
Eˆ (Y | X  X 0 )  2 * RMSE  n 0
2
n
(
X

X
)
i1 i
• Prediction Interval:
2
1
(
X

X
)
Eˆ (Y | X  X 0 )  2 * RMSE 1   n 0
n  ( X i  X )2
i 1
• Prediction interval is wider than confidence interval for
mean response because it is trying to predict the Y for a
particular unit with X=X0 rather than the mean for all
units with X=X0
• As sample size (n) becomes large, width of confidence
interval for mean response goes to zero whereas width
of prediction interval goes to 2*RMSE.
Confidence Intervals and the Polls
Top Stories - washingtonpost.com
In the aftermath of last week's debate, Bush leads Kerry 51 percent to 46 percent
among those most likely to vote, according to polling conducted Friday through
Sunday...
A total of 1,470 registered voters were interviewed, including 1,169 who were
determined to be likely voters. The margin of sampling error for results based on
either sample is plus or minus three percentage points.
• Margin of Error = 2*SE(Estimate).
• 95% CI for Bush-Kerry difference:
pˆ  2 * SE( pˆ )  pˆ  Margin of error
• 95% CI for difference between Bush and Kerry’s
proportions:5%  3%  (2%,8%)
Why Do the Polls Sometimes
Disagree So Much?
If the election were held today, would you vote for Bush or Kerry?
Published
Bush%
Kerry%
Error%
Polled
Source
Oct. 3
49
49
4
770
CNN-USA Today Gallup
Oct. 2
46
49
4
1,013
Newsweek
Sept. 29
51
46
3
1,100
Los Angeles Times
Sept. 27
52
44
4
758
CNN-USA Today Gallup
Sept. 20
50
42
3
1,088
CBS News/New York Times
Sept. 16
46
46
4
1,002
Pew
Sept. 16
55
42
4
767
Gallup
Validity of Confidence Interval
• Polls are conducted by attempting to randomly sample
U.S. citizens of voting age.
• Mean Estimated Difference in Vote Proportion : Average
Estimated Difference in Vote Proportion from repeated
random samples.
• SE(Estimated Difference in Vote Proportion) is the
“typical” amount by which the Estimated Difference in
Vote Proprtion for one random sample differs from the
Mean Estimated Difference in Vote Proportion
• CI for True Difference in Vote Proportion = Estimated
Difference in Vote Proportion  2*SE(Estimated
Difference in Vote Proportion)
• Confidence’s interval “95% guarantee” that 95% of the
time it will contain true difference in vote proportion is
only true if mean estimated difference in vote proportion
= true difference in vote proportion.
• When mean estimated difference in vote proportion does
not equal true difference in vote proportion, there is bias.
Sources of Bias
• See Ch 3.3: pages 252-254
• Undercoverage: some groups in the population are left
out of the process of choosing the sample (for an opinion
poll conducted by telephone, people without a residential
phone are not covered).
• Nonresponse: An individual chosen for the sample can’t
be contacted or does not cooperate. Major problem in
telephone surveys.
• Response bias: Respondent’s or interviewer’s behavior
may cause bias. Respondents may lie, especially if
asked about illegal or unpopular behavior. Race or sex
of interviewer can affect responses.
• Wording of questions: Has very important influence on
survey results. UN Experiment.
Voting Polls
• The polls try to predict what will happen in the election.
Thus, they must address the question, who is likely to
vote?
• In exit polling from the 2000 election, 39% of
respondents identified themselves as Democrats, 34%
as Republicans and 33% as Independents. In 1996, the
composition was 39% Democrat, 34% Republican and
27% Independent.
• Should the polls adjust their results so that they reflect a
voter composition of more Democrats than Republicans?
– Gallup doesn’t do much adjustment.
– LA Times poll, Zogby’s poll make adjustments.