CHAPTER 10

Transcript CHAPTER 10

Inference for the Slope
of Regression
Chapter 14
1
Review

Before we begin the study of inference on
regression, let’s try to remember some things
about regression. Input the following into your
calculator and try to answer the following
questions.
Review











How do we make a scatterplot?
What do we look for in a scatterplot?
How do we find the correlation? What is the
symbol we use for correlation?
What does the correlation mean?
What does R2 mean?
How do we create a linear model?
How do we know if the linear model is
appropriate?
How do we create a residual plot?
What do we look for in a residual plot?
What does the slope mean?
What does the y-intercept mean?
Inference for the Slope of an LSRL

Remember, the LSRL is the line that best fits the data
using the criteria that the sum of the squares of the
residuals is minimized,
yˆ  a  bx
The slope b and intercept a of the least-squares line
are statistics. That is, we calculate them from sample
data. These statistics would take somewhat different
values if we repeated the study with a different
sample. To do formal inference, we think of a and b
as estimates of unknown parameters. We test to
determine if there is an association between the
data. We almost always test to see if there is a
change in b (the slope).
Assumptions and Conditions

The Linearity Assumption - The data must be
linear to complete this inference procedure (this
is the first and most important assumption).
 The
straight enough condition - this condition is
satisfied if the scatterplot looks straight. (It is
generally NOT a good idea to draw a straight line
through the scatterplot when checking - graders see
this as making an absolute statement and it makes the
line look straighter than it really is.) You can verify
by checking the residuals (remember we are looking
for no clear pattern).
Assumptions and Conditions

The Independence Assumption - The data must
independent from one another
 Randomization
condition – Did the data come from
a random sample?
 10% Condition: Is the data less than 10% of the
population?
 Independent Sample Condition: Does the residual
plot shows clear evidence of dependence in the data
(a clear pattern)? If you notice patterns, clumps, or
trends, then the data would suggest a failure of
independence.
Assumptions and Conditions

Normality Assumption - The data must be
approximately normal.
 The nearly normal condition – Look at the histogram
or normal probability plot of the RESIDUALS.

Equal Variance Assumption – The variability
should be the same for all values of x.
 No thickening plot condition – Does the scatterplot
thicken? Be aware of a fan shape or other growing
and shrinking parts within the scatterplot.

All four assumptions and their corresponding conditions
must be checked in order to perform inference for the
LSRL. You should check them in the exact order given
in this lesson: 1) Linearity; 2) Independence; 3)
Normality; and last 4) Equal Variance or LINE.
Inference for the Slope of an LSRL


The values of y that we observe vary about their means
according to a normal distribution.
The first step in inference is to estimate the unknown
parameters , , and . When the regression model
describes our data and we calculate the least-squares line
y = a + bx, the slope b of the least-squares line is an
unbiased estimator of the true slope , and the intercept
a of the least-squares line is an unbiased estimator of the
true intercept .
Slope and Intercept




Back to our previous example:
We can easily find that the LSRL is y = 91.268 + 1.493x for an
IQ score of infants at three years against the intensity of
crying soon after birth. The slope is particularly important. A
slope is a rate of change. The true slope  says how much
higher average IQ is for children with one more peak in their
crying measurement.
Because b = 1.493 estimates the unknown , we estimate that
on the average IQ is about 1.5 points higher for each added
crying peak.
We need the intercept a = 91.27 to draw the line, but it has no
statistical meaning in this case. No child had fewer than 9
crying peaks, so we have no data near x = 0. We suspect that
all normal children would cry when snapped with a rubber
band, so that we will usually not observe x = 0.
Standard Error


Remember: residual  observedy - predictedy
 y  yˆ
There are n residuals, one for each data point. Because  is
the standard deviation of responses about the true
regression line, we estimate it with the standard deviation
of the residuals, se (many times just labeled s). The
Standard Error for slope is SEb. The spread of the x-values
is sx. The residuals from a LSRL always have mean zero,
so that simplifies their standard error.
se
SEb 
, where
sx n 1
n
1 n
1
2
2
ˆ
se 
residuals

(
y

y
)


i
i
n  2 i 1
n  2 i 1
The Test Statistics: Another t-test?

The null hypothesis will be that there is no true linear
relationship between x and y (usually the slope is 0 – make sure
to put this into the context of the problem). The most common
null hypothesis is H0: β = 0. We use the test statistic:
b
t
SEb
In terms of a random variable T having the df = n - 2
distribution, the P-value for a test of Ho against
Ha:  > 0 is P(T ≥ t)
Ha:  < 0 is P(T ≤ t)
Ha:  ≠ 0 is 2P(T ≥ |t|)
This is like all of the other t-tests that we have completed. The
test statistic is just the standardized version of the least-squares
slope b. It is another t statistic.
Confidence Intervals (CI)


A confidence interval is more useful because it
shows the accuracy for the estimate of b. The
confidence interval for  has the familiar form
estimate  t*SEb
Because b is our estimate, the confidence interval
becomes b ± t*SEb, df = n – 2, and (once again):
se
SEb 
, where
sx n  1
n
1 n
1
2
2
ˆ
se 
residuals

(
y

y
)


i
n  2 i 1
n  2 i 1
Example 1
A study conducted on child development took a random
sample of 38 children aging from early infancy to speech
to determine if crying activity could help predict a child’s
intellectual development. Determine if there is there a
useful linear relationship between the crying activity and
intellectual development.
Example 1

Step 1: Identify population Parameter, state the
null and alternative Hypotheses, determine
what you are trying to do (and determine what
the question is asking).
 The
population are babies. The parameter is the slope
of the regression line. We want to determine if there
is an association between crying and intellectual
development.
H0: β = 0 There is no linear association between crying
and intellectual development.
HA: β ≠ 0 There is a linear association between crying
and intellectual development.
Example 1

Step 2: Verify the Assumptions by checking the
conditions
 Linearity

Assumption
Straight enough condition: There is no obvious bend
in the scatterplot. The residual plot shows a random
scatter about the line.
 Independence
Assumption
Randomization Condition: We are told that the
children were chosen randomly.
 10% Condition: This data is less than 10% of the
population.
 Independent Sample Condition: The residual plot
shows no clear evidence of dependence in the data (no
clear pattern)

Example 1

Step 2: Verify the Assumptions by checking the
conditions
 Normality
Assumption
 Nearly Normal Condition: A histogram of the
residuals show a skewed right distribution. Since
we have a moderately sized sample, we will
proceeded with caution.
 Equal Variance Assumption
 No Thickening Scatterplot Condition: The
residual plot shows no obvious trends in the
spread.
Example 1

Step 3: If conditions are met, Name the
inference procedure, find the Test statistic, and
Obtain the p-value in carrying out the inference:
We will use a Linear Regression t-Test for Slope.
Use the calculator to determine the estimated
regression equation, the t-score, r, r2, df, and the pvalue.
yˆ  IQ based on cryingact ivit y:
yˆ  91.268 1.493(cryingpeaks)
Test statistic
P-value
t  3.065
df  36
r  0.455
p  .0041
r 2  0.207
s  17.499
Example 1

Step 4: Make a decision and State your
conclusion in context of the problem using pvalue.
With a low p-value of .0041, we reject the null
hypothesis at the α = .05 level and conclude that there
is strong evidence that there is a useful linear
relationship between crying activity and intellectual
development; however, r is small implying that the
association is relatively weak.
Example 2
By conventional wisdom, some people with colds
generally avoid dairy products since it’s thought
to produce extra mucus. This claim was
challenged by a group of researchers in Australia
in 1990. They gathered a random sample of
participants throughout the country and infected
them with rhinovirus (or the common cold). For
10 days, each participant kept track of their milk
consumption and collected all of the tissue used
for any nasal discharge of mucus. Once all of the
tissue was gathered, the researchers extracted the
nasal mucus and measured the nasal mucus
secretions.
Example 2
Determine if dairy consumption increases mucus
discharge when a person has a cold.
Glasses of
Milk per Day
Ounces of
Mucus
0.6
0.7
3
4
6
7
8
9
0.2
0.9
0.7
0.5
0.9
0.3
0.2
11
1.0
Glasses of
Milk per Day
0
0
1
1
Ounces of
Mucus
0.7
0.3
1.1
0.2
1
2
3
Example 2

Step 1: Identify population Parameter, state the
null and alternative Hypotheses, determine
what you are trying to do (and determine what
the question is asking).
 The
population are people with colds in Australia.
The parameter is the slope of the regression line. We
want to determine if there is a positive association
between milk consumption (or dairy consumption)
and mucus discharge when a person has a cold. We
will perform a Linear Regression T-test on the slope:
H0: β = 0 There is no linear association between milk
consumption and mucus discharge.
HA: β > 0 There is a positive linear association between
milk consumption and mucus discharge.
Example 2

Step 2: Verify the Assumptions by checking the
conditions
 Linearity

Assumption
Straight enough condition: There is no obvious bend
in the scatterplot. The residual plot shows a random
scatter about the line.
 Independence
Assumption
Randomization Condition: We are told that the
sample was chosen randomly.
 10% Condition: This data is less than 10% of the
population.
 Independent Sample Condition: The residual plot
shows no clear evidence of dependence in the data (no
clear pattern)

Example 2

Step 2: Verify the Assumptions by checking the
conditions
 Normality
Assumption
 Nearly Normal Condition: A histogram of the
residuals show a bimodal distribution. Since we
have a small sample size, our assumption of
normality may be violated. If normality is not
appropriate, our results may not be valid. We
proceed with caution.
 Equal Variance Assumption
 No Thickening Scatterplot Condition: The
residual plot shows no obvious trends in the
spread.
Example 2

Step 3: If conditions are met, Name the
inference procedure, find the Test statistic, and
Obtain the p-value in carrying out the inference:
We will use a Linear Regression t-Test for Slope.
Use the calculator to determine the estimated
regression equation, the t-score, r, r2, df, and the pvalue.
yˆ  Mucus Discharge:
Test statistic
P-value
yˆ  0.5095 0.0208(glasses of milk)
t  0.8481
df  12
r  0.2378
p  .2065
r 2  0.0566
s  0.3184
Example 2

Step 4: Make a decision (reject or fail to reject
H0). State your conclusion in context of the
problem using p-value.
With a p-value of 0.2065, we fail to reject the null
hypothesis at the α = .05 level and conclude that there
is little evidence to suggest that there is a linear
relationship between dairy consumption and mucus
discharge. We are likely to obtain similar results
simply by sampling variation. However, our
normality assumption was violated, so our results
may not be valid.
Example 3
Natives of Nenana, Alaska host a contest to guess the exact minute
that a wooden tripod placed on the frozen Tanana River will fall
through the breaking ice. The closest guess can win up to $300,000.
Determine if there is a linear relationship between year and days to
fall through the ice. The following shows the data:
Year after 1900
Days
Years
Days
17
119.48
29
124.65
18
130.40
30
127.79
19
122.61
31
129.39
20
131.45
32
121.43
21
130.28
33
127.81
22
131.63
34
119.59
23
128.08
35
134.56
24
131.63
36
120.54
25
126.77
37
131.83
26
115.67
38
125.84
27
131.24
39
118.56
28
124.65
40
110.64
Example 3

Step 1: Identify population Parameter, state the
null and alternative Hypotheses, determine
what you are trying to do (and determine what
the question is asking).
 We
want to determine if there is an association
between the year and the time it takes for the tripod
to fall through the ice. We will perform a Linear
Regression T-test on the slope:
H0: β = 0 There is no linear association between the
year and the number of days for the tripod
to fall through the ice
HA: β ≠ 0 There is a linear association between year
and the number of days for the tripod to fall
through the ice.
Example 3

Step 2: Verify the Assumptions by checking the
conditions
 Linearity Assumption
 Straight enough condition: There is no obvious bend in the
scatterplot. The residual plot shows a random scatter about
the line.
 Independence Assumption
 Randomization Condition: The sample is time bound which
raises suspicions about independence. The data is not
random.
 10% Condition: This data is likely to be more than 10% of
the population.
 Independent Sample Condition: The residual plot shows
no clear evidence of dependence in the data (no clear
pattern)
 Independence
has been violated so our conclusion
may not be valid.
Example 3

Step 2: Verify the Assumptions by checking the
conditions
 Normality
Assumption
 Nearly Normal Condition: A histogram of the
residuals show a unimodal and symmetric
distribution.
 Equal Variance Assumption
 No Thickening Scatterplot Condition: The
residual plot shows no obvious trends in the
spread.
Example 3

Step 3: If conditions are met, Name the
inference procedure, find the Test statistic, and
Obtain the p-value in carrying out the inference:
 We
will use a Linear Regression t-Test for Slope.
Use the calculator to determine the estimated
regression equation, the t-score, r, r2, df, and the pvalue.
yˆ  Numbered day of ice break - up :
yˆ  627.9704 0.2605( year)
t  1.507
df  22
Test statistic
P-value
r  0.3059
p  .1460
r 2  0.0936
s  5.86
Example 3

Step 4: Make a decision (reject or fail to reject
H0). State your conclusion in context of the
problem using p-value.
With a p-value of 0.1460, we fail to reject the null
hypothesis at the α = .05 level and conclude that there
is little evidence to suggest that there is a linear
relationship between year and the number of days
until the ice breaks. We are likely to obtain similar
results simply by sampling variation. However, our
independence assumption is suspect, so our results
may not be valid.
Now perform a 95% Confidence Interval for β
Example 3 (part 2)

Step 1: Identify population Parameter that you
wish to estimate and determine what you are
trying to do (and determine what the question is
asking).
 We
want to approximate the true slope, β, of the
Linear Regression Line between year and the number
of days until the ice breaks with 95% confidence.
Example 3 (part 2)

Step 2: Verify the Assumptions by checking the
conditions
 Linearity Assumption
 Straight enough condition: There is no obvious bend in the
scatterplot. The residual plot shows a random scatter about
the line.
 Independence Assumption
 Independent Sample Condition: The residual plot shows
no clear evidence of dependence in the data (no clear
pattern)
 Randomization Condition: The sample is time bound which
raises suspicions about independence. The data is not
random.
 10% Condition: This data is likely to be more than 10% of
the population.
 Independence has been violated so our conclusion
may not be valid.
Example 3 (part 2)

Step 2: Verify the Assumptions by checking the
conditions
 Normality
Assumption
 Nearly Normal Condition: A histogram of the
residuals show a unimodal and symmetric
distribution.
 Equal Variance Assumption
 No Thickening Scatterplot Condition: The
residual plot shows no obvious trends in the
spread.
Example 3 (part 2)

Step 3: Name the inference, do the work, and
state the Interval :
We will use a 95% Confidence Interval for Linear
Regression for Slope.
( y  yˆ ) 2
755.49
se 

 5.86
n2
22
se
5.86
ˆb  t *
 0.2605 2.074
n2
sx n  1
7.07 23
 0.2605 ( 2.074).1728 0.2605 0.3584
 ( 0.618, 0.098)
Example 3 (part 2)

Step 4: Make a decision (reject or fail to reject
H0). State your conclusion in context of the
problem using p-value.
I am 95% confident that the ice has been breaking
up, on average, between 0.618 days earlier to 0.098
days later each year. However, independence was
violated so our results may not be valid.
Example 3 (part 3)

Suppose data continued to be collected over the next 60
years and the results were found through the computer
output below:
yˆ  break up day :
R - Squared  11.3%
s  5.673with 91-2  89 degrees of freedom
Variable C oe ff S E(C oe ff) t - ratio P - Valu e
Intercept 128.950 1.525
84.6
 0.0001
Year Since
1900
- 0.07606 0.0226
- 3.36
Daˆte  128.95- 0.076(Year since1900)

0.0012
If we performed another hypothesis test, what would be
our conclusion (assuming all the assumptions are
satisfied) with the new output of data?
Example 3 (part 3)

Step 4: Make a decision (reject or fail to reject
H0). State your conclusion in context of the
problem using p-value.
With a p-value of 0.0012, we reject the null
hypothesis at the α = .05 level and conclude
that there is strong evidence that, on average,
the ice breakup has an association with the
year after 1900 and it is breaking up earlier
each year.
Example 3 (part 4)
yˆ  break up day :
R - Squared  11.3%
s  5.673with 91-2  89 degrees of freedom
Variable C oe ff S E(C oe ff) t - ratio P - Valu e
Intercept 128.950 1.525
84.6
 0.0001
Year Since
1900
- 0.07606 0.0226
- 3.36
Daˆte  128.95- 0.076(Year since1900)

Now, create a 95% confidence interval for the slope and
interpret your results.
CI  b  tn*2 SEb
*
t89
 1.987

0.0012
CI  0.07606 1.987(0.0226)  0.07606 0.0449
CI  (0.12096,0.03116)
We are 95% confident that the true slope is between
-0.12096 and -0.03116; for every increase of one year, the
ice appears to break up 0.03116 to 0.12096 days earlier.
Example 3 (part 4)
yˆ  break up day :
R - Squared  11.3%
s  5.673with 91-2  89 degrees of freedom
Variable C oe ff S E(C oe ff) t - ratio P - Valu e
Intercept 128.950 1.525
84.6
 0.0001
Year Since
1900
- 0.07606 0.0226
- 3.36
Daˆte  128.95- 0.076(Year since1900)

How would you determine the t-score and the p-value if
it were not given? Verify that the t-score and p-values
are correct.
t


0.0012
b   0  0.07606  0

 3.3655
0.0226
SEb
To find the p-value, we use the tcdf function of the
calculator: p  value  tcdf (min,max,df )  tcdf (E99,3.3655,89)  0.000575
This is 2-tailed so multiply by 2: p-value ≈ 0.00115
Assignment
Lesson:
Read:
Chapter 27 Inference for Linear
Chapter 27
Regression
Problems:
1 - 37 (odd)
WS:
Regression
And
Correlation

CHAPTER 10

Transcript CHAPTER 10

Directory