Linear regression

Download Report

Transcript Linear regression

Objectives

2.3

   

Least-squares regression

Regression lines Prediction and Extrapolation Correlation and

r

2 Transforming relationships

Adapted from authors’ slides © 2012 W.H. Freeman and Company

Straight Line Regression

    A

regression

is a formula that describes how a response variable

y

changes (on average) as an explanatory variable

x

changes.

We often use a regression line to

predict

value of

x

. The

predicted value

for

y

the value of

y y

ˆ.

for a given In regression, the distinction between explanatory and response variables is important.

A

straight line regression

has the form

b y

is the observed value

y

ˆ is the predicted

y

value (y hat)

b

1 is the

slope

b

0 is the

y-intercept

The least squares regression line

The

least-squares

regression line is the unique line such that the sum of the squared vertical (

y

) differences between the data points and the line is as small as possible. (These differences are called

residuals

.) This is the line that best predicts

y

from

x

(not the other way).

Distances between the points and line are squared so that they all are positive values.

How to compute the slope and intercept:

First we calculate the

slope of the line, b 1

, from statistics we have already computed.

r s y s x

is the correlation.

is the standard deviation of the response variable

y.

is the standard deviation of the explanatory variable

x.

b

1

s y s x

Interpretation:

b

1 by 1 unit.

is the (average) change in the value of y when x is changed  Once we know

b

1

,

the

y-intercept

.

we can calculate

b

0 ,

x

and

y

variables.

x b

0 

y

b

1

x

 Interpretation

:

b

0 is the predicted value when not always meaningful).

x

 = 0 (although this value of

x

is

Different scale different intercept and slope

 It is important to note that if we change the scale of either the x or y axis that the slope and intercept will also change. To see this look at the yearly temperature data in (can be downloaded from Dostat) in Statcrunch and observe the effect it has when we change year to a different scale (observe that year is rather arbitrary, why use 2012 and not 10008?).

 Do not be fooled by the size of the slope. If it is small (what ever that means), it does not mean it is insignificant, or the correlation is small. The size of a slope depends on the scaling that we use. The significance depends on the standard error (more of this later on).

Efficiency of a biofilter, by temperature

In StatCrunch: Stat-Regression-Simple Linear For every degree that temperature goes up, the efficiency can be expected to increase by

b

1 = 0.0757 units.

The predicted efficiency when temperature equals 10 is

y

   

Relationship between ozone and carbon pollutants

In StatCrunch: Stat-Regression-Simple Linear

y

 

x

For each unit that carbon goes up, ozone can be expected to increase by

b

1 = 0.005708 units. The predicted efficiency when carbon equals 15 is

y

    However, the relationship is not strong so the prediction may not be all that accurate.

Categorical variables in scatterplots

Often, things are not simple and one-dimensional. We need to group the data into categories to reveal trends.

What may look like a positive linear relationship is in fact a series of negative linear associations.

Here, the habitat for each observation is a lurking variable.

Plotting data points from different habitats in different colors allows us to make that important distinction.

Relationship between lean body mass and metabolic rate in men and women.

Both men and women follow the same positive linear trend, but women show a stronger association. As a group, males typically have larger values for both variables.

Comparison of racing records over time for men and women.

Each group shows a very strong negative linear relationship that would not be apparent without the gender categorization.

Correlation versus regression

The

correlation

is a measure of spread (scatter) in both the

x

and

y

directions in the linear relationship.

In

regression

we examine the variation in the response variable (

y

) given the explanatory variable (

x

).

Coefficient of determination, R

2

R

2 is called the

coefficient of determination

.

R

2 represents the percentage of the variation of

y

that can be explained by the prediction from

x

.

(That is, it is the amount of vertical scatter from the regression line relative to the overall vertical scatter.)

R

2 is meaningful for any fit of the response variable to one or more explanatory variables.

In the case of straight line fit only, however,

R

2 =

r

2 , where

r

is the correlation coefficient (positive or negative).

The r-squared and the linear model

 The basic idea in statistical modeling is to fit the simplest model, which best explains the data.  In the case of simple regression this means fitting a line through the points. The only model simpler than that is fitting a flat line (the slope is zero) through the points. The R-squared is a way of comparing the gain by fitting slope over a flat line.  If the R-squared is 1, then the residuals are zero and the y-axis is totally determined by the x-axis (the linear model is best). If the r squared is zero, then the constant model is best. Usually the R squared is somewhere in between.  Note that the value of the slope can be very small (say 0.00003), but the R-squared can still be one.

Efficiency of a biofilter, by temperature

R

2 = 79.4% is the proportion of the variation in Efficiency that is explained by the straight line regression on Temperature.

Relationship between ozone and carbon pollutants

In StatCrunch: Stat-Regression-Simple Linear

y

 

x R

2 = 44.7% is the proportion of the variation in Ozone Level that is explained by the straight line regression on Carbon Pollutant Level.

The distinction between explanatory and response variables is crucial in regression. If you exchange

y

for

x

in calculating the regression line, you will get

a different line

which is

a predictor of x for a given value of y

. This is because the least squares regression of

y

on

x

is concerned with the distance of all points from the line

in the y direction only

.

Here is a plot of Hubble telescope data about galaxies moving away from earth.

The solid line is the best prediction of

y

= velocity from

x

= distance.

The dotted line is the best prediction of

x

= velocity from

y

= distance.

Examples

 For example, if you want to predict the girth of a 8 week calf given his 8 week weight, then the response variable is the girth (it should be on the y axis) and the explanatory variable is the weight (on the x-axis).

 Another example, is trying to predict your midterm scores. It makes sense to predict your midterm 3 score based on your midterm 2 score. Thus midterm 3 is the response variable (on y-axis) and midterm 2 is the explanatory variable in the x-axis.

Residuals

The distances from each point to the least-squares regression line give us potentially useful information about the contribution of individual data points to the overall pattern of scatter.

These distances are called

residuals

, because they are what is “left over” after fitting the line. Points above the line have a

positive

residual.

The sum of the residuals is always 0.

Points below the line have a

negative

residual.

Predicted

ŷ

Observed

y y

Residual plots

Residuals

are the differences between

y

-observed and

y

-predicted. We plot them in a

residual plot

, which plots residuals vs.

x

.

If the data are best predicted simply by using a straight line then the residuals will be scattered randomly above and below 0.

Only the

y

-axis is different. The

x

-axis in a residual plot is the same as on the scatterplot.

Constant mean and spread.

Residuals are randomly scattered – good!

Non-constant mean.

Curved pattern —means the relationship you are looking at (e.g., a straight line) has not been fit properly.

Non-constant spread.

A change in variability across a plot indicates that the response variable is less predictable for some values of x than for others.

This can affect the accuracy of statistical inference.

Outliers and influential points

Outlier:

an observation that lies outside the overall pattern of observations.

Influential observation:

an observation that markedly changes the regression if removed. This is often an outlier on the

x

-axis.

Child 19 = outlier in y direction Child 18 = outlier in x direction

Child 19 is an outlier of the relationship.

Child 18 is only an outlier in the

x

direction and thus might be an influential point.

Are these points influential?

outlier in

y

-direction All data Without child 18 Without child 19 influential

Always plot your data

A correlation coefficient and a regression line can be calculated for any relationship between two quantitative variables. However, outliers greatly influence the results, and running a linear regression on a nonlinear association is not only meaningless but misleading. So make sure to always plot your data before you run a correlation or regression analysis.

Anscombe’s examples:

The four data sets below were constructed so that they each have correlation

r

= 0.816, and the regression lines are all approximately

ŷ

= 3 + 0.5

x

. For all four sets, we would predict

ŷ

= 8 when

x

= 10.

Anscombe’s examples:

The four scatterplots show that the correlation or regression analysis is not appropriate for just any data set with two numerical variables.

A moderate linear association. A straight line is regression OK. Statistical inference for SLR is OK.

An obviously nonlinear relationship. A straight line regression is not OK. Fit a different curve.

One point deviates from the highly linear pattern. This influential outlier must be examined closely before proceeding.

Just one very influential point; all other points have the same

x

value. What experiment was conducted here?

Vocabulary: lurking vs. confounding

 A

lurking variable

is a variable that is not among the explanatory or response variables in the analysis and yet, if observed and considered, may influence the interpretation of relationships among those variables.

 Two variables are

confounded

when their effects on a response variable cannot be distinguished (statistically) from each other. The confounded variables can be explanatory variables or lurking variables.

Association is not causation

. Even if a statistical association is very strong, this is not by itself good evidence that a change in

x

will cause a change in

y. The association would be just as strong if we reversed the roles of x and y.

Cautions before rushing into a correlation or a regression analysis

 Do not use a regression on

inappropriate

data.

 Clear pattern in the residuals  Presence of large outliers  Clustered data falsely appearing linear  Beware of

lurking

variables.

Use residual plots for help in seeing these.

 Avoid

extrapolating

(predicting beyond values in the data set)

.

 Recognize when the correlation/regression is being performed on values that are

averages of another variable

.

 An observed relationship in the data, however strong it is,

does not imply causation

just on its own.