Transcript Slide 1

Least Squares Regression

Fitting a Line to Bivariate Data

Correlation

tells us about

strength

(scatter) and

direction

of the linear relationship between two quantitative variables.

In addition, we would like to have a numerical description of how both variables vary together. For instance, is one variable increasing faster than the other one? And we would like to make predictions based on that numerical description.

But which line best describes our data?

The regression line

The least-squares regression line is the unique line such that the sum of the squared vertical (

y

) distances between the data points and the line is the smallest possible. Distances between the points and line are squared so all are positive values. This is done so that distances can be properly added (Pythagoras).

Properties

The least-squares regression line can be shown to have this equation:

y

ˆ   0

b x

1 where

b

1 

r s y s x b

0

b x

1 

b

1

b

0 is the predicted

y

value (y hat) is the

slope

is the

y-in tercept

“b 0 " "b 1 "

is in units of

y

is in units of

y

/ units of

x

How to:

First we calculate the

slope of the line, b

; from statistics we already know:

b

1 

r s y r s y s x

is the correlation.

is the standard deviation of the response variable

y.

is the the standard deviation of the explanatory variable

x.

s x

Once we know

b

1

,

the slope, we can calculate

b 0

, the y-intercept:

b

0

b x

1 where

x

and means of the

y x

are the sample and

y

variables

This means that we don't have to calculate a lot of squared distances to find the least squares regression line for a data set. We can instead rely on the equation.

But typically, we use a 2-var stats calculator or stats software.

BEWARE!!!

Not all calculators and software use the same convention:

y

ˆ 

a

bx

Some use instead: ˆ 

ax

b Make sure you know what YOUR calculator gives you for

a

and

b

before you answer homework or exam questions.



Software output

intercept slope

R

2

absolute value of r

R

2 intercept slope

The equation completely describes the regression line.

NOTE: The regression line always passes through the point with coordinates (xbar, ybar).

The distinction between explanatory and response variables is crucial in regression. If you exchange

y

for

x

will get the wrong line. Recall that in calculating the regression line, you

b

1 

r s y s x

Regression examines the distance of all points from the line

in the y direction only.

Hubble telescope data about galaxies moving away from earth: These two lines are the two regression lines calculated either correctly (

x

= distance,

y

= velocity, solid line) or incorrectly (

x

= velocity,

y

= distance, dotted line).

Correlation versus regression

The

correlation

is a measure of spread (scatter) in both the

x

and

y

directions in the linear relationship.

In

regression

we examine the variation in the response variable (

y

) given change in the explanatory variable (

x

).



Making predictions: interpolation

The equation of the least-squares regression allows to predict

y

for any

x

within the range studied. This is called

interpolating

.

ˆ  0.0144

x

 0.0008

Nobody in the study drank 6.5 beers, but by finding the value

x

= 6.5 we would expect a blood alcohol content of 0.094 mg/ml.

y

ˆ  0 .

0144 * 6 .

5  0 .

0008

y

ˆ  0 .

936  0 .

0008  0 .

0944 mg/ml

Year 1 977 1 978 1 979 1 980 1 981 1 982 1 983 1 984 1 985 1 986 1 987 1 988 1 989 1 990 (in 1000’s) Powerboat s 4 47 5 85 6 14 6 45 6 75 7 11 7 19 4 60 4 81 4 98 5 13 5 12 5 26 5 59 Dead Manate es 1 3 2 1 2 4 1 6 2 4 2 0 1 5 3 4 3 3 3 3 3 9 4 3 5 0 4 7 ˆ  0.125

x

 41.4

There is a positive linear relationship between the number of powerboats registered and the number of manatee deaths.

The least squares regression line has the equation: ˆ  0.125

x

 41 .4

Thus if we were to limit the number of powerboat registrations to 500,000, what could we expect for the number of manatee deaths?  0 .

125 ( 500 )  41 .

4 

y

ˆ  62 .

5  41 .

4  21 .

1 Roughly 21 manatees.

Extrapolation

Extrapolation

is the use of a regression line for predictions outside the range of

x

values used to obtain the line. This can be a very stupid thing to do, as seen here.

!!!

!!!

The

y

intercept

Sometimes the

y

-intercept is not a realistic possibility. Here we have negative blood alcohol content, which makes no sense… But the negative value is appropriate for the equation of the regression line. There is a lot of scatter in the data, and the line is just an estimate.

y

-intercept shows negative blood alcohol

R-squared = r

2

; the proportion of y-variation explained by changes in x.

r

2 , the coefficient of determination,

is the square of the correlation coefficient.

r

2 represents

the proportion of the variation in y

(vertical scatter from the regression line)

that can be explained by changes in x

.

b

1 

r s y s x

r r

2 = -1 = 1 Changes in

x

explain 100% of the variations in

y

.

Y

can be entirely predicted for any given value of

x

.

r r

2 = 0.87

= 0.76

r r

2 = 0 = 0 Changes in

x

explain 0% of the variations in y. The value(s)

y

takes is (are) entirely independent of what value

x

takes.

Here the change in

x

only explains 76% of the change in

y

. The rest of the change in

y

(the vertical scatter, shown as red arrows) must be explained by something other than

x.

Example: SAT scores

1120 1070 1020 970 920 870 820 0

SAT Mean per State vs % Seniors Taking Test

10 y = -2.2375x + 1023.4

R 2 = 0.7542

20 30 40 50

% of Seniors Taking Test

60 70 80

SAT scores: calculations

x

 33.882

s x

 24.103

y

 947.549

s y

 62.1

r

  .868

b

r s y s x

,

a bx

slope

b

  .868

62.1

24.103

intercept

a

   2.23635

least squares prediction line 

x

SAT scores: result

1120 1070 1020 970 920 870 820 0

SAT Mean per State vs % Seniors Taking Test

10 y = -2.2375x + 1023.4

R 2 = 0.7542

20 30 40 50

% of Seniors Taking Test

60 70 If 57% of NC seniors take the SAT, the predicted mean score is r 2 = (-.868) 2 .7534

= About 75% of the variation in state mean SAT scores is explained by differences in the % of seniors that take the 80 test.

y

  

r r

2 =0.7

=0.49

r r

2 =0.9

=0.81

There is a great deal of variation in BAC for the same number of beers drunk. A person’s blood volume is a factor in the equation that was overlooked here. We changed number of beers to number of beers/weight of person in lb.

 In the first plot, number of beers only explains 49% of the variation in blood alcohol content.

 But number of beers / weight explains 81% of the variation in blood alcohol content.

 Additional factors contribute to variations in BAC among individuals (like maybe some genetic ability to process alcohol).

Grade performance

If class attendance explains 16% of the variation in grades, what is the correlation between percent of classes attended and grade?

1. We need to make an assumption: attendance and grades are

positively

correlated. So r will be positive too.

2.

r

2 = 0.16, so

r

= +√0.16 = + 0.4

A weak correlation.

Transforming relationships

A scatterplot might show a clear relationship between two quantitative variables, but issues of influential points or non linearity prevent us from using correlation and regression tools. Transforming the data – changing the scale in which one or both of the variables are expressed – can make the shape of the relationship linear in some cases. Example: Patterns of growth are often exponential, at least in their initial phase. Changing the response variable

y

into log(

y

) or ln(

y

) will transform the pattern from an upward-curved exponential to a straight line.

Exponential bacterial growth

In ideal environments, bacteria multiply through binary fission. The number of bacteria can double every 20 minutes in that way. 5000 4000 3000 2000 1000 0 0 30 60 90 120 150 180 210 240 Time (min) 1 - 2 - 4 - 8 - 16 - 32 - 64 … Exponential growth 2

n

, not suitable for regression.

4 3 2 1 0 0 30 60 90 120 150 180 210 240 Time (min) log(2

n

) =

n

*log(2) ≈ 0.3n

Taking the log changes the growth pattern into a straight line.

Body weight and brain weight in 96 mammal species

r

= 0.86, but this is misleading.

The elephant is an influential point. Most mammals are very small in comparison. Without this point,

r

= 0.50 only. Now we plot the log of brain weight against the log of body weight. The pattern is linear, with

r

= 0.96. The vertical scatter is homogenous → good for predictions of brain weight from body weight (in the log scale).

Inference for least squares lines

Inference for simple linear regression

Simple linear regression model Conditions for inference Confidence interval for regression parameters Significance test for the slope Confidence interval for E(y) for a given x Prediction interval for y for a given x

 ˆ  0.125

x

 41.4

The data in a scatterplot are a random

sample

from a population that may exhibit a linear relationship between

x

and

y

. Different sample  different plot.

Now we want to describe the

population mean response E(y)

as a function of the explanatory variable x: E(y)= b 0

+

b 1

x.

And to assess whether the observed

relationship

is

statistically significant

(not entirely explained by chance events due to random sampling).

Simple linear regression model

In the population, the linear regression equation is

E(y) =

b

0 +

b

1 x.

Sample data then fits the model: Data = fit + residual

y i

= ( b 0 + b 1

x i

) + ( e

i

) where the e

i

are

independent

and

Normally

distributed

N

(0, s ).

Linear regression assumes

equal standard deviation of y

( s is the same for all values of

x

).

E(y) =

b 0

+

b 1

x

The intercept b 0

,

the slope b 1 , and the standard deviation s of

y

are the unknown parameters of the regression model .

We rely on the random sample data to provide unbiased estimates of these parameters.

 The value of

ŷ

from the least-squares regression line is really a prediction of the mean value of

y

(

E(y)

) for a given value of

x

.

 The least-squares regression line (

ŷ = b 0 + b 1

x

) obtained from sample data is the best estimate of the true population regression line (

E(y) =

b 0

+

b 1

x

).

ŷ

unbiased estimate for mean response

E(y)

b

0

unbiased estimate for intercept b

0

b

1

unbiased estimate for slope b 1

The

population standard deviation

s for

y

at any given value of

x

represents the spread of the normal distribution of the e

i

around the mean

E(y)

.

The

regression standard error, s

,

for

n

sample data points is calculated from the residuals (

y i

ŷ i

):

s e

 

residual

2

n

 2  (

y i n

  2

i

) 2

s e

is an unbiased estimate of the regression standard deviation s .

Conditions for inference

 The observations are

independent.

 The relationship is indeed

linear.

 The standard deviation of

y,

σ

,

is the same for all values of

x

.

 The response

y

varies

normally

around its mean.

Using residual plots to check for regression validity

The residuals (

y

ŷ

) give useful information about the contribution of individual data points to the overall pattern of scatter. We view the residuals in a

residual plot:

If residuals are scattered randomly around 0 with uniform variation, it indicates that the data fit a linear model, have normally distributed residuals for each value of

x,

and constant standard deviation

σ

.

Residuals are randomly scattered 

good!

Curved pattern  the relationship is

not linear.

Change in variability across plot 

σ not equal

for all values of

x

.

What is the relationship between the average speed a car is driven and its fuel efficiency?

We plot fuel efficiency (in miles per gallon, MPG) against average speed (in miles per hour, MPH) for a random sample of 60 cars. The relationship is curved. When speed is log transformed (log of miles per hour, LOGMPH) the new scatterplot shows a positive,

linear

relationship.

Normal quantile plot for residuals:

The plot is fairly straight, supporting the assumption of normally distributed residuals.

 Data okay for inference.

Residual plot:

The spread of the residuals is reasonably random —no clear pattern. The relationship is indeed linear. But we see one low residual (3.8, −4) and one potentially influential point (2.5, 0.5).

Standard Error for the Slope

 Three aspects of the scatterplot affect the standard error of the regression slope:    spread around the line,

s e

spread of

x

values,

s x

sample size,

n

.

 The formula for the standard error (which you will probably never have to calculate by hand) is:   1 

n s e

 1

s x

Slide 1- 35

Confidence interval for

b 1 Estimating the regression parameters b 0 , b 1 is a case of one-sample inference with unknown population standard deviation.  We rely on the

t

distribution, with

n

– 2 degrees of freedom .

A level

C

confidence interval for the slope,

b

1 ,

is proportional to the standard error of the least-squares slope:

b

1 ± t* SE(b 1 )

t* is the t critical for the t (n – 2) distribution with area C between –t* and +t*.

We estimate the standard error of

b 1

with where

s e

 

n y

  2

2

 

1 

n s e

 1

s x

n is the sample size, s x is the ordinary standard deviation of the x values

Confidence interval for

b 0 A level C c

onfidence interval for the intercept,

b

0 ,

is proportional to the standard error of the least-squares intercept: 

b

0 ± t* SE b0

The intercept usually isn’t interesting. Most hypothesis tests and confidence intervals for regression are about the slope.

Hypothesis test for the slope

We may look for evidence of a

significant relationship

between variables

x

and

y

in the population from which our data were drawn.

For that, we can test the hypothesis that the regression slope parameter

β

1 is equal to zero.

H

0 :

β

1 = 0 vs. H

a

:

β

1 ≠ 0

slope

b

1 

r s y s x

Testing

H

0 :

β

1 = 0 also allows to test the

hypothesis of no correlation

between

x

and

y

in the population.

Note: A test of hypothesis for

b

0 is irrelevant (

b

0 is often not even achievable).

Hypothesis test for the slope (cont.)

We usually test the hypothesis

H

0 :

β

1 = 0 vs. H

a

:

β

1 ≠ 0

but we can also test

H

0 :

β

1 = 0 vs. H

a

:

β

1 < 0

or

H

0 :

β

1 = 0 vs. H

a

:

β

1 > 0

To do this we calculate the test statistic

t

b

1  0

 

1 Use the

t dist. with n – 2 df

to find the P-value of the test.

Note: Software typically provides two-sided p-values.

Using technology

Computer software runs all the computations for regression analysis. Here is some software output for the car speed/gas efficiency example.

SPSS Slope Intercept p-value for tests of significance Confidence intervals

The

t

-test for regression slope is highly significant (

p

< 0.001). There is a significant relationship between average car speed and gas efficiency.

Excel “intercept”: intercept “logmph”: slope SAS P-value for tests of significance confidence intervals

Confidence Intervals and Prediction Intervals for Predicted Values

  Once we have a useful regression, how can we indulge our natural desire to predict, without being irresponsible?

Now we have standard errors —we can use those to construct a confidence interval for the predictions and to report our uncertainty honestly.

An Example: Body Fat and Waist Size

 Consider an example that involves investigating the relationship in adult males between

% Body Fat

and

Waist

size (in inches). Here is a scatterplot of the data for 250 adult males of various ages: Slide 1- 43

Confidence Intervals and Prediction Intervals for Predicted Values(cont.)

For our

%body fat

and

waist

questions we could ask: size example, there are two 1.

2.

Do we want to know the mean

%body fat

for

all

with a

waist

size of, say, 38 inches?

men Do we want to estimate the

%body fat

for a particular man with a 38-inch

waist

?

The predicted %body fat is the same in both questions

all

, but we can predict the

mean %body fat

for men whose

waist

size is 38 inches with a lot more precision than we can predict the

%body fat

of a

particular individual

inches.

whose

waist

size happens to be 38

Confidence Intervals and Prediction Intervals for Predicted Values(cont.)

 We start with the same prediction in both cases.

 We are predicting for a new individual, one that was not in the original data set.

  Call his

x

-value

x ν .

The regression predicts

%body fat

as

y

ˆ    0

b x

1 

Confidence Intervals and Prediction Intervals for Predicted Values(cont.)

 Both intervals take the form

y

 

t

n

 2 

SE

 The

SE

’s will be different for the two questions we have posed.

1.

Confidence Intervals and Prediction Intervals for Predicted Values(cont.)

The standard error of the

mean

predicted value is:

SE

SE

2

  

1

x

 

x

2 

s e

2

n

2.

Individuals vary more than means, so the standard error for a single predicted value is larger than the standard error for the mean: 

SE

2

x

 

x

2 

s e

2

n

s e

2

Confidence Intervals and Prediction Intervals for Predicted Values (cont.)

Confidence interval for

 

Prediction interval for y

   *

t n

 2

SE

2

  

1

x

 

x

2 

s e

2

n

 

t n

*  2

SE

2

  

1

x

 

x

2 

s e

2

n

s e

2

Confidence Intervals for Predicted Values

   Here’s a look at the difference between predicting for a mean and predicting for an individual.

The solid green lines near the regression line show the 95% confidence intervals for the mean predicted value, and the dashed red lines show the prediction intervals for individuals.

The solid green lines and the dashed red lines curve away from the least squares line as x moves farther away from xbar.

More on confidence intervals for

  As seen on the preceding slides, we can calculate a

confidence interval for the population mean

of all responses

y

when

x

takes the value

x

 (within the range of data tested): denote this expected value

E(y

)

by   This interval is centered on ŷ  , the unbiased estimate of   .

The true value of the population mean   at a particular value

x

,

will indeed be within our confidence interval in

C

% of all intervals calculated from many different random samples.

The

level C confidence interval for the mean response

μ

 at a given value

x

 of

x

is centered on ŷ  (unbiased estimate of

μ

 ):

y

 

t

*

n

 2

SE

(  ˆ  )

t* is the t critical for the

t (n – 2)

distribution with area C between –t* and +t*.

A separate confidence interval is calculated for

μ

 along all the values that

x

 takes. Graphically, the series of confidence intervals is shown as a continuous interval on either side of

ŷ

 .

95% confidence bands for

 

as

x

varies over all x values

More on prediction intervals for y

 One use of regression is for

predicting

the value of

y

,

ŷ

, for any value of

x

within the range of data tested:

ŷ

=

b

0 +

b

1

x

.

But the regression equation depends on the particular sample drawn. More reliable predictions require statistical inference: To estimate an

individual

response

y

 for a given value of

x

,

we use a

prediction interval.

If we randomly sampled many times, there would be many different values of

y

obtained for a particular

x

 following

N

(0,

σ

) around the mean response

μ

 .

The

level C prediction interval for a single observation

on

y

 when

x

takes the value

x

 is:

y

 

t

*

n

 2 (  )

t* is the t critical for the

t (n – 2)

distribution with area C between –t* and +t*.

The prediction interval represents mainly the error from the normal distribution of the residuals e

i

.

Graphically, the series confidence intervals is shown as a continuous interval on either side of

ŷ

.

95% prediction interval for

ŷ

as

x

varies over all x values

 The

confidence interval for

μ

 contains with

C

% confidence the population mean

μ

 of all responses at a particular value

x

.

 The

prediction interval

contains

C

% of all the individual values taken by

y

at a particular value

x

 .

95% prediction interval for y

95% confidence interval for

  Estimating   uses a smaller confidence interval than estimating an individual in the population (sampling distribution narrower than population distribution).

1918 flu epidemics 1918 influenza epidemic Date

week 1 week 2 week 3 week 4 week 5 week 6 week 7 week 8 week 9 week 10 week 11 week 12 week 13 week 14 week 15 week 16 week 17

# Cases

36 531 4233 8682 7164 2229 600 164 57 722 1517 1828 1539 2416 3148 3465 1440

# Deaths

0 0 130 552 738 414 198 90 56 50 71 137 178 194 290 310 149

1918 influenza epidemic

10000 9000 8000 7000 6000 5000 4000 3000 2000 10000 9000 8000 7000 6000 1000 0 w ee k 1

1918 influenza epidemic

w ee k 3 w ee k 5 w ee k 7 w ee k 9 w ee k 11 w ee k 13 w ee k 15 w ee k 17 # Cases # Deaths 4000 2000 1000 0 diagnosis. 800 700 600 500 400 300 200 100 0 we ek 1 ek 3 we ek 5 ek 7 ek 9 ek 1 1 ek 1 3 ek 1 5 ek 1 7 deaths in a given week and the number of new 800 700 600 500 400 300 200 100 0 # Deaths

1918 flu epidemic: Relationship between the number of deaths in a given week and the number of new diagnosed cases one week earlier.

EXCEL Regression Statistics

Multiple R R Square Adjusted R Square Standard Error Observations 0.911 0.830 0.82 85.07 s 16.00 r = 0.91

Coefficients St. Error t Stat

P-value

Lower 95% Upper 95%

Intercept 49.292 29.845 1.652 0.1209 (14.720) 113.304

FluCases0

0.072 0.009 8.263 0.0000 0.053 0.091

b 1

SE b

1

P-value for

H

0 :

β

1 = 0

P-value very small  reject

H

0 

β

1 significantly different from 0 There is a

significant relationship

between the number of flu cases and the number of deaths from flu a week later.

SPSS Least squares regression line 95% prediction interval for y

95% confidence interval for

y

CI for mean weekly death count one week after x  = 4000 flu cases are diagnosed:

µ

 within about 300 –380.

Prediction interval for a weekly death count one week after x  = 4000 flu cases are diagnosed: y  within about 180 –500 deaths.

What is this?

A 90% prediction interval for the height (above) and a 90% prediction interval for the weight (below) of male children, ages 3 to 18.