Slide 1

Transcript Slide 1

Simple Linear Regression and Correlation

Here, we have two quantitative variables for each of 16 students. 1) How many beers they drank, and 2) Their blood alcohol level (BAC) We are interested in the relationship between the two variables: How is one affected by changes in the other one?

Student 6 7 9 11 1 2 3 13 4 5 8 10 12 14 15 16 Beers 5 2 9 7 3 3 4 5 8 3 5 5 6 7 1 4 Blood Alcohol 0.1

0.03

0.19

0.095

0.07

0.02

0.07

0.085

0.12

0.04

0.06

0.05

0.1

0.09

0.01

0.05

Associations Between Variables

When you examine the relationship between two variables, a new question becomes important: 1.Is your purpose simply to explore the nature of the relationship? 2.Do you wish to show that one of the variables can explain variation in the other?

response variable

measures an outcome of a study. An

explanatory variable

explains or causes changes in the response variable.

Looking at relationships

 Start with a graph  Look for an overall pattern and deviations from the pattern  Use numerical descriptions of the data and overall pattern (if appropriate)

Scatterplots

In a

scatterplot,

one axis is used to represent each of the variables, and the data are plotted as points on the graph. Student 7 9 11 1 2 3 6 13 4 5 8 10 12 14 15 16 Beers 3 3 4 5 2 9 7 5 8 5 6 3 5 7 1 4 BAC 0.1

0.03

0.19

0.095

0.07

0.02

0.07

0.085

0.12

0.04

0.06

0.05

0.1

0.09

0.01

0.05

Interpreting scatterplots

 After plotting two variables on a scatterplot, we describe the relationship by examining the

form, direction,

and

strength

of the association. We look for an overall pattern …  Form: linear, curved, clusters, no pattern   Direction: positive, negative, no direction Strength: how closely the points fit the “form”

Form and direction of an association

Linear No relationship Nonlinear

Positive association:

High values of one variable tend to occur together with high values of the other variable.

Negative association:

High values of one variable tend to occur together with low values of the other variable.

No relationship:

about

and

vary independently. Knowing

tells you nothing

Strength of the association

The

strength

of the relationship between the two variables can be seen by how much variation, or

scatter,

there is around the main form.

With a strong relationship, you can get a pretty good estimate of

if you know

With a weak relationship, for any

you might get a wide range of

values.

This is a

weak

relationship. For a particular state median household income, you can’t predict the state per capita income very well.

This is a

very strong

relationship. The daily amount of gas consumed can be predicted quite accurately for a given temperature value.

The correlation coefficient "r"

 The correlation coefficient is a measure of the direction and strength of a linear relationship.  It is calculated using the mean and the standard deviation of both the

and

variables.  Correlation can only be used to describe

quantitative

variables. Categorical variables don’t have means and standard deviations.

The correlation coefficient "r"



1  1

i n

  1 

x i s x x

 

y i s y y x

= 0.7

y y

y = 9.5

You DON'T want to do this by hand.

Make sure you learn how to use your calculator or software.

"r" ranges from -1 to +1

" quantifies the

strength

and

direction

of a linear relationship between 2 quantitative variables.

Strength:

how closely the points follow a straight line.

Direction

: is positive when individuals with higher

values tend to have higher values of

Correlation only describes linear relationships

No matter how strong the association, does not describe curved relationships.

Note: You can sometimes transform a non-linear association to a linear form, for instance by taking the logarithm. You can then calculate a correlation using the transformed data.

Explanatory and response variables

response variable

measures or records an outcome of a study. An

explanatory variable

explains changes in the response variable.

Typically, the

explanatory

independent variable

is plotted on the

axis, and the

response

dependent variable

is plotted on the

axis.

Response (dependent) variable:

blood alcohol content

y Blood Alcohol as a function of Number of Beers

0.20

0.18

0.16

0.14

0.12

0.10

0.08

0.06

0.04

0.02

0.00

1 2 3 4 5 6

Number of Beers

7 8 9

Explanatory (independent) variable:

number of beers

Correlation

tells us about

strength

(scatter) and

direction

of the linear relationship between two quantitative variables.

In addition, we would like to have a numerical description of how both variables vary together. For instance, is one variable increasing faster than the other one? And we would like to make predictions based on that numerical description.

But which line best describes our data?

The regression line

 A regression line is a straight line that describes how a response variable

changes as an explanatory variable

changes.

 We often use a regression line to predict the value of

value of

for a given  In regression, the distinction between explanatory and response variables is important.

The regression line

The least-squares regression line is the unique line such that the sum of the squared vertical (

) distances between the data points and the line is as small as possible. Distances between the points and line are squared so all are positive values. This is done so that distances can be properly added (Pythagoras).

Properties

The least-squares regression line can be shown to have this equation:

ˆ 

0 

x y

is the predicted

value (y hat) is the

slope

is the

y-in tercept



How to:

First we calculate the

slope of the line, b

; from statistics we already know:

1 

r r s y s x

is the correlation.

is the standard deviation of the response variable

is the the standard deviation of the explanatory variable

s y s x

Once we know

b 1 ,

the slope, we can calculate

b 0

, the y-intercept:

0 



 means of the

and

variables  

The equation completely describes the regression line.

To plot the regression line you only need to plug two

values into the equation, get

and draw the line that goes through those points.

Hint: The regression line always passes through the mean of

and

The points you use for drawing the regression line are derived from the equation. They are NOT points from your sample data (except by pure coincidence).

Making predictions

The equation of the least-squares regression allows you to predict

for any

within the range studied.  0 .

0144

 0 .

0008 Nobody in the study drank 6.5 beers, but by finding the value

= 6.5 we would expect a blood alcohol content of 0.094 mg/ml.

ˆ



0 .

0144 * 6 .

5



0 .

0008

ˆ



0 .

936



0 .

0008



0 .

0944 mg/ml

 0 .

125

 41 .

4 The data in a scatterplot are a random

sample

from a population that may exhibit a linear relationship between

and

. Different sample  different plot.

Now we want to describe the

population mean response

variable

as a function of the explanatory

b 0

b 1

And to assess whether the observed

relationship

statistically significant

(not entirely explained by chance events due to random sampling).

Statistical model for linear regression

In the population, the linear regression equation is m

y =

0 +

1 x.

Sample data then fits the model: Data = fit + residual

y i

= ( b 0 + b 1

x i

) + ( e

) where the e

are

independent

and

Normally

distributed

(0, s ).

Linear regression assumes

equal variance of y

( s is the same for all values of

Estimating the parameters m

y =

b 0

b 1

The intercept b 0

the slope b 1 , and the standard deviation s of

are the unknown parameters of the regression model .

We rely on the random sample data to provide unbiased estimates of these parameters.

 The value of

from the least-squares regression line is really a prediction of the mean value of

( m

) for a given value of

 The least-squares regression line (

ŷ = b 0 + b 1

) obtained from sample data is the best estimate of the true population regression line ( m

b 0

b 1

unbiased estimate for mean response m

y b

unbiased estimate for intercept b

unbiased estimate for slope b 1

The

population standard deviation

s for

at any given value of

represents the spread of the normal distribution of the e

around the mean The

regression standard error, s

for

sample data points is calculated from the residuals (

y i

–

ŷ i

s

 

residual n

 2 2   (

y

n

  2

y

) 2

is an unbiased estimate of the regression standard deviation s .

Conditions for inference

 The observations are

independent.

 The relationship is indeed

linear.

 The standard deviation of

is the same for all values of

 The response

varies

normally

around its mean.

Confidence interval for regression parameters

Estimating the regression parameters b 1 is a case of one-sample inference with unknown population variance.  We rely on the

distribution, with

– 2 degrees of freedom .

A level

confidence interval for the slope,

1 ,

is proportional to the standard error of the least-squares slope:

b

**t* SE**

t* is the t critical value for the t (n – 2) distribution with area C between –t* and +t*.

Significance test for the slope

We can test the hypothesis

0 : b 1 = 0 versus a 1 or 2 sided alternative.

We calculate

t = b 1 / SE b1

which has the

t (n – 2) distribution

to find the p-value of the test.

Testing the hypothesis of no relationship

We may look for evidence of a

significant relationship

between variables

and

in the population from which our data were drawn.

For that, we can test the hypothesis that the regression slope parameter

is equal to zero.

0 :

1 = 0 vs. H 0 :

1 ≠ 0

slope

1 

r s y s x

Testing

0 :

1 = 0 also allows to test the

hypothesis of no correlation

between

and

in the population.

Calculations for regression inference

To estimate the parameters of the regression, we calculate the standard errors for the estimated regression coefficients.

The standard error of the least-squares slope

1 is:

SE b

1   (

x i s



x i

) 2

What is the relationship between the average speed a car is driven and its fuel efficiency?

We plot fuel efficiency (in miles per gallon, MPG) against average speed (in miles per hour, MPH) for a random sample of 60 cars. The relationship is curved. When speed is log transformed (log of miles per hour, LOGMPH) the new scatterplot shows a positive,

linear

relationship.

Using technology

Computer software runs all the computations for regression analysis. Here is some software output for the car speed/gas efficiency example.

JMP Slope Intercept Standard error p-value for tests of significance

The

-test for regression slope is highly significant (

< 0.0001). There is a significant relationship between average car speed and gas efficiency.

13.4

Multiple Regression Analysis

Population multiple regression equation

 Up to this point we have considered, in detail, the linear regression model in one explanatory variable x.

=

+

 Usually more complex linear models are needed in practical situations.

 There are many problems in which a knowledge of more than one explanatory variable is necessary in order to obtain a better understanding and better prediction of a particular response.  In multiple regression, the response variable y depends on

explanatory variables

1 ,

2 ,  ,

x p ŷ

0 +

x 1 + b 2 x 2 +



+ b p x p

Data for multiple regression

 The data for simple linear regression problem consists of n (

x i

y i

) 

Data for multiple linear regression

consist of the value of a response variable y and

p n

 .

explanatory variables (

1 ,

2 ,  ,

x p

) on  We write the data and enter them in the form:

Case

1  2 n

x 1

x 11 x 21 x n1

x 2

x 12 x 22

Variables

  

x p

x 1p x 2p x n2  x np

y 1 y 2 y n

We have data on 224 first-year computer science majors at a large university in a given year. The data for each student include: * Cumulative GPA after 2 semesters at the university (

response variable) * SAT math score (SATM, x1, explanatory variable) * SAT verbal score (SATV, x2, explanatory variable) * Average high school grade in math (HSM, x3, explanatory variable) * Average high school grade in science (HSS, x4, explanatory variable) * Average high school grade in English (HSE, x5, explanatory variable)

Case

1 2

SATM

720 590

SATV Variables

 700 350 … …

HSE

9 6

GPA

3.8

2.6

224 550 490 … 7

3.0

Multiple linear regression model

For “

” number of explanatory variables, we can express the population mean response ( m

) as a linear equation: m

0 +

1 + … +

p x p

The statistical model for

sample data (

i =

1, 2,

… n

) is then: Data = fit

y i

= ( b 0 + b 1

+ residual … + b

p x pi

) + ( e

) Where the e

are independent and normally distributed

(0, s ).

Multiple linear regression assumes equal variance s 2 of

. The parameters of the model are b

1 …

Estimation of the parameters

We selected a random sample of

individuals for which

+ 1 variables were measured (

1 … ,

x p

). The least-squares regression method minimizes the sum of squared deviations

e i

y i

–

ŷ i

) to express

as a linear function of the

explanatory variables:

ŷ i

0 +

+… +

b k x pi

As with simple linear regression, the constant

0 is the

intercept.

ŷ b

0 are unbiased estimates of population parameters

…

b p

0 …

Confidence interval for

β j

Estimating the regression parameters

0 , … ,

β j ,

… ,

β p

is a case of one sample inference with unknown population variance.  We rely on the

distribution, with

– p – 1 degrees of freedom .

A level

confidence interval for

β j

is:

b

**t* SE**

b j

- SE b j is the standard error of b j —we rely on software to obtain SE b j .

- t* is the t critical for the t (n – p – 1) distribution with area C between – t* and +t*.

Significance test for

β j

To test the hypothesis

0 : b

= 0 versus a 1 or 2 sided alternative.

We calculate the

statistic

t = b

/SE

b j

which has the

t (n – p – 1) distribution

to find the p-value of the test.

ANOVA F-test for multiple regression

For a multiple linear relationship the ANOVA tests the hypotheses

0 :

1 =

2 = … = β

= 0

versus

a :

0 not true by computing the

statistic:

= MSM / MSE When

0 is true,

follows the

(p,

−

− 1) distribution. The p-value is P(F > f ).

A significant p value doesn’t mean that all p explanatory variables have a significant influence on y —only that at least one does.

ANOVA table for multiple regression

Source Model Error Total Sum of squares SS  ( ˆ



) 2  (

y i



y i

) 2  (

y i



) 2 df

p n − p − 1 n

− 1

SST = SSM + SSE DFT = DFM + DFE

Mean square MS SSM/DFM SSE/DFE

P-value MSM/MSE Tail area above F The

sample standard error, s,

for

sample data points is calculated from the residuals

e i

y i

– ŷ

i s

2 

 

p e i

2  1  

( 

y i p

 ˆ

 1 ) 2 

SSE DFE



MSE

is an unbiased estimate of the regression standard deviation

We have data on 224 first-year computer science majors at a large university in a given year. The data for each student include: * Cumulative GPA after 2 semesters at the university (

We finally run a multiple regression model with

all the variables together.

P-value very significant R 2 fairly small (22%) HSM significant

The overall test is significant, but only the average high school math score (HSM) makes a significant contribution in this model to predicting the cumulative GPA.

This conclusion applies to computer majors at this large university.

The United Nations Development Reports provide data on a large number of human development variables for 182 OECD (Organization for Economic Cooperation and Development) countries. The variables examined here from the 2009 report HDR_2009 are: * HDI * LEB * ALR •GDP * URB • PEH - United Nations human development index rank - life expectancy at birth (years) in 2007 - adult literacy rate (% aged 15 and above) - gross domestic product per capita (purchasing power parity in US$) - % urban population in 2010 - public expenditure on health (as % of total government expenditure) Here are the summary statistics for a sample of twenty countries:

Here is the data:

HDI Country

4 Canada 9 Switzerland 13 United States 18 Italy 38 Malta 46 Lithuania 50 Uruguay 75 Brazil 92 China 93 Belize 98 Tunisia 99 Tonga 105 Phillipines 113 Bolivia 117 Moldova 124 Nicaragua 127 Tajikistan 150 Sudan 172 Mozambique 178 Mali

LEB

80.6

81.7

79.1

81.1

79.6

71.8

76.1

72.2

72.9

76.0

73.8

71.7

71.6

65.4

68.3

72.7

66.4

57.9

47.8

48.1

ALR GDP

99.0 35,812 99.0 40,658 99.0 45,592 98.9 30,353 92.4

2,308 99.7 17,575 97.9 11,216 90.0

93.3

75.1

77.7

99.2

93.4

90.7

99.2

78.0

99.6

60.9

44.4

26.2

9,567 5,383 6,734 752 3,748 3,406 4,206 2,551 2,570 1,753 2,086 802 1,083

URB

80.6

73.6

82.3

68.4

94.7

67.2

89.0

86.5

44.9

52.7

67.3

25.3

66.4

66.5

43.8

57.3

26.5

45.2

38.4

33.3

PEH

17.9

19.6

19.1

14.2

14.7

13.3

9.2

7.2

9.9

10.9

6.5

11.1

6.4

11.6

11.8

16.0

5.5

6.3

12.6

12.2

The first step in multiple linear regression is to study all pair-wise relationships between the

+ 1 variables. Here is the output for all pair-wise correlations.

Scatterplots for all 10 pair-wise relationships are also necessary to understand the data. Note that the relationship between GDP and the other variables appears to be non-linear.

Let’s first run two simple linear regressions to predict

LEB

using

ALR

alone and another using

GDP

alone

Note: R 2 = 1218.41/1809.228 = 0.673 R 2 = 609.535/1809.228 = 0.337, - the proportion of variation in LEB explained by each variability separately.

- When ALR or GDP are used alone, both P-values are very significant.

Now, let’s run a multiple linear regression using

ALR

and

GDP

together.

Note: R 2 = 1333.465/1809.228 = 0.737

- slight increase from using

ALR

alone b 1 = 0.335 with SE = 0.066; when used alone, we had b 1 = 0.392 with SE = 0.064

b 2 = .00019with SE = 0.00009; when used alone, we had b GDP SE GDP = 0.00013

= 0.00039 with

Now consider a multiple linear regression using all four explanatory variables:

2 = 0.807 R = 0.899

R is the correlation between y and

P-value very significant At least one regression coefficient is different from 0 INC is significant URB is significant GDP and PEH are not significant

We now

drop

the least significant variable from the previous model: GDP.

R 2 is almost the same P-value very significant ALR significant URB significant PEH not

The conclusions are about the same. But notice that the actual regression coefficients have changed.

predicted LEB

 32 .

10  .

ALR

 .

URB

 .

PEH

 .

00005

GDP predicted LEB

 30 .

31  .

ALR

 .

URB

 .

PEH

Let’s run a multiple linear regression with the

ALR and URB only.

R 2 is marginally lower P-value very significant ALR significant

The ANOVA test for

ALR

URB significant

and

URB is very significant  at least one is not zero.

The

tests for

ALR and

URB are very significant  each one is not zero.

When taken together, only ALR and URB are significant predictors of LEB.

Slide 1

Transcript Slide 1

Simple Linear Regression and Correlation

Associations Between Variables

Looking at relationships

Scatterplots

Interpreting scatterplots

Form and direction of an association

Strength of the association

The correlation coefficient "r"

The correlation coefficient "r"

"r" ranges from -1 to +1

Correlation only describes linear relationships

Explanatory and response variables

The regression line

The regression line

Properties

How to:

Making predictions

ˆ

0 .

0144 * 6 .

5

0 .

0008

ˆ

0 .

936

0 .

0008

0 .

0944 mg/ml

Statistical model for linear regression

s

residual n

y

n

y

Conditions for inference

Confidence interval for regression parameters

b

t* SE

Significance test for the slope

Testing the hypothesis of no relationship

slope

Calculations for regression inference

Using technology

Multiple Regression Analysis

Population multiple regression equation

=

+

Data for multiple regression

Multiple linear regression model

Estimation of the parameters

Confidence interval for

b

t* SE

Significance test for

t = b

ANOVA F-test for multiple regression

ANOVA table for multiple regression

Directory

**t* SE**

**t* SE**