Chapter 3 Examining Relationships “Get the facts first, and then you can distort them as much as you please.” Mark Twain.

Download Report

Transcript Chapter 3 Examining Relationships “Get the facts first, and then you can distort them as much as you please.” Mark Twain.

Chapter 3 Examining Relationships

“Get the facts first, and then you can distort them as much as you please.” Mark Twain

3.1 Scatterplots

 Many statistical studies involve MORE THAN ONE variable.

 A SCATTERPLOT represents a graphical display that allows one to observe a possible relationship between two quantitative variables.

Response Variable vs. Explanatory Variable

 Response Variable – Measures an outcome of a study  Explanatory variable – Attempts to explain the observed outcomes

Response Variable vs. Explanatory Variable

 When we think changes in a variable

x

explain , or even cause , changes in a second variable,

y

, we call

x

an explanatory variable and y a response variable.

y

Response Variable

x

Explanatory variable

IMPORTANT!

 Even if it appears that y can be “predicted” from

x

, it does not follow that

x

causes

y

.

 ASSOCIATION DOES NOT IMPLY CAUSATION.

When examining a scatterplot, look for an overall PATTERN.

 Consider: – Direction – – Form Strength – Positive association – Negative association – outliers

Positive vs. Negative Association

 Positive Association (between two variables) – Above-average values of one tend to accompany above-average values of the other – Below-average values of one tend to accompany below-average values of the other  Negative Association (between two variables) – Above-average values of one tend to accompany below-average values of the other

3.2 Correlation

 Describes the direction and strength of a straight-line relationship between two quantitative variables.

 Usually written as

r

.

r

n

1  1   

x i s

x

    

x y i

y

 

s y

Facts About Correlation

      Positive

r

indicates positive association between the variables and negative

r

indicates negative association.

The correlation

r

always fall between –1 an 1 inclusive.

The correlation between

x

and

y

does NOT change when we change the units of measurement of

x

,

y

, or both.

Correlation ignores the distinction between explanatory and response variables.

Correlation measures the strength of ONLY association between two variables.

straight-line The correlation is STRONGLY observations.

affected by a few outlying

3.3 Least-Squares Regression

 If a scatterplot shows a linear relationship between two quantitative variables, least squares regression is a method for finding a line that summarizes the relationship between the two variables, at least within the domain of the explanatory variable

x

.

 The least-squares regression line (

LSRL

) is a mathematical model for the data.

Regression Line

 Straight line  Describes how a response variable

y

as an explanatory variable

x

changes.

changes  Sometimes it is used to PREDICT

y

for a given value of

x

.

the value of  Makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

y

Residual

 A difference between an OBSERVED

y

and a PREDICTED

y

:

y

Some Important Facts About the LSRL

   It is a mathematical model for the data.

It is the line that makes the sum of the squares of the residuals AS SMALL AS POSSIBLE.

   intercept.

 

s

b r s x

(On the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y)

Some Important Facts About the LSRL

a

 The slope

b

is the approximate change in

y

when

x

increases by 1.

 The

y

-intercept a is the predicted value of

y

when

x

 0

.

Coefficient of Determination

 Symbolism:

r

2  It is the fraction of the variation in the values of

y

that is explained by the least-squares regression of

y

on

x

.

 Measure of HOW SUCCESSFUL the regression is in explaining the response.

Calculation of

r

2 

r

2  SSM 

SSE

SSM where SSM   

y

y

 2 sum of squares about the mean

y

SSE   2 sum of squares of residuals

Example

L1 2 4 6

x

?

y

?

L2 6 12 15 

y

y

2    2

Example Solution

L1 2 4 6

x

 4

y

 11 L2 6 12 15 

y

y

2  25 1 16 42   2 .25

1 .25

1.50

Things to Note:

 Sum of deviations from mean = 0.

 Sum of residuals = 0.

r

2 > 0 does not mean

r

> 0. If

x

and

y

negatively associated, then

r

< 0.

are

Outlier

 A point that lies outside the overall pattern of the other points in a scatterplot.

 It can be an outlier in the

x

direction, in the

y

direction, or in both directions.

Influential Point

 A point that, if removed, would considerably change the position of the regression line.

 Points that are outliers in the

x

often influential.

direction are

Words of Caution

 Do NOT CONFUSE the slope

b

correlation

r

.

of the LSRL with the – – The relation between the two is given by the formula

b r

s s x

 If you are working with normalized data, then b does equal r since

s y

s x

 1  When you normalize a data set, the normalized data has a mean = 0 and standard deviation = 0.

More Words of Caution

 If you are working with normalized data, the regression line has the simple form

y n

rx n x n y n

 Since the regression line contains the mean of

x

and the mean of

y

, and since normalized data has a mean of 0, the regression line for normalized

x

and

y

values contains (0, 0).