Chapter 15 Pearson’s r and Regression

Download Report

Transcript Chapter 15 Pearson’s r and Regression

Chapter 15 (1e) or 13 (2/3e)

Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression

Introduction:    Scattergrams / Scatterplots  Graphs that display relationships between two interval-ratio variables.

The Regression Line, Slope, and Intercept.

 The regression line, y=a+bX, summarizes the linear relationship between X and Y. Predicts the score of Y from a score of X.

 b represents the slope of the line.

 a, called the intercept, is the point on the Y-axis where the regression line crosses it.

Pearson’s r and the Coefficient of Determination (r 2 )   r is a measure of association for two I-R variables.

r 2 tells you how much variation in the dependent variable is explained by the independent variable.

Scattergram / Scatterplot

 Has two dimensions:  The X (independent) variable is arrayed along the horizontal axis.

 The Y (dependent) variable is arrayed along the vertical axis.

 Each dot on a scattergram is a case in the data set.

 The dot is placed at the intersection of the case’s scores on X and Y.

Example of a Hypothetical Scattergram Showing the Relationship Between X and Y  Shows the relationship between % College Educated (X) and Voter Turnout (Y) on election day for the 50 cities.

Turnout By % College

73

Turnout

68 63 58 53 48 43 15 17 19 21 29 31 33 35 23 25

% College

27

Scattergram Example (cont.)

 Horizontal X axis - % of population of a city with a college education.

 Scores range from 15.3% to 34.6% and increase from left to right.

Turnout By % College

73 68 63

Turnout

58 53 48 43 15 17 19 21 23 25

% College

27 29 31 33 35

Scattergram Example (cont.)  Vertical (Y) axis is voter turnout.

 Scores range from 44.1% to 70.4% and increase from bottom to top

Turnout By % College Turnout 73 68 63 58 53 48 43 15 20 25 % College 30 35

   The Regression Line on a Scattergram A single straight line that comes as close as possible to all data points.

“least squares regression line” Indicates strength and direction of the relationship.

Turnout By % College

73 68 63 58

Turnout

53 48 43 15 17 19 21 23 25

% College

27 29 31 33 35

 

Strength of Regression Line

The greater the extent to which dots are clustered around the regression line, the stronger the relationship.

This relationship is weak to moderate in strength.

Turnout By % College

73 68 63

Turnout

58 53 48 43 15 17 19 21 23 25

% College

27 29 31 33 35

Direction of Regression Line

   Positive: regression line rises left to right.

Negative: regression line falls left to right.

This a positive relationship: As % college educated increases, % turnout increases.

Turnout By % College

73 68 63 58

Turnout

53 48 43 15 17 19 21 23 25

% College

27 29 31 33 35

Scattergrams and Linearity  Inspection of the scattergram should always be the first step in assessing the correlation between two interval-ratio variables. In addition to assessing the strength and direction, the relationship must also be

linear.

Turnout By % College

73 68 63 58 53 48 43 15 17 19 21 23 25

% College

27 29 31 33 35

The Regression Line: Formula

 This formula defines the regression line: 

y = a + bx

 Where:  Y = score on the dependent variable  a = the Y intercept or the point where the regression line crosses the Y axis.

 b = the slope of the regression line or the amount of change produced in Y by a unit change in X  X = score on the independent variable

Regression and Prediction

  We can use the regression line to find the predicted value of

y

(symbolized as

y’

) for values of

x

.

Once we know the values of the coefficients b and a, we can use the following prediction formula by substituting any value for x to predict y. The predicted level of y can be calculated by:

y

' ( 

y

) 

a

bx

 We can also use the regression formula to accurately plot the regression line on our scattergram.

Regression Analysis: Healey’s definitional formula for calculating the slope of the line (Formula 15.2 (1e) or 13.2 (2/3e)

b

 

  

 

2  Note: The numerator is the covariation of

x

and

y

(how x and y vary together). The denominator is the sum of the squared deviations around the mean of

x

 Regression Analysis: *computational

formula* for b (Formula 13.3 in 2/3e)

Below is the *computational (working) formula to calculate

b.

It is a re-arrangement of the theoretical formula and is much easier to calculate!

b

n

XY n

X

 ( 

X

)( 

Y

) 2  ( 

X

) 2   The slope tells you what the change in Y is, for every unit of X.

The sign of the slope coefficient (+/- b) tells you whether the relationship is positive or negative.

Regression Analysis

 The Y intercept (a) is computed from Healey, Formula 15.3 (1e) or 13.4 (2/3e):

a

y

b x

 The intercept (a) is the point where the regression line crosses the Y-axis, when X=0.

Results of a Hypothetical Regression Analysis of the Relationship Shown in the Scattergrams Above:  For the relationship between % college educated and % turnout:  Assume b (slope) = .42

 Assume a (Y intercept)= 50.03

  A slope of .42 means that % turnout increases by .42 (less than half a percent) for every unit increase of 1 in % college educated.

The Y intercept means that the regression line crosses the Y axis at Y = 50.03.

An example of prediction:

   We can use the regression equation y’=a+bx for

prediction.

For instance, we could ask, what % turnout would be expected in a city where only 10% of the population was college educated?

What % turnout would be expected in a city where 70% of the population was college educated?

This is a positive relationship so the value for Y increases as X increases. Our prediction:  

For X =10, Y = 54.5

For X =70, Y = 79.7

Calculating the Correlation Coefficient: Formula for Pearson’s r  Definitional formula for Pearson’s r:

r

    

x

x x

2  

x y

  

y

 

2   

*Use the computational formula to calculate*

:

r

n

XY

 ( 

X

)( 

Y

) [

n

X

2  ( 

X

) 2 ][

n

Y

2  ( 

Y

) 2 ]

Pearson’s r

  Like Gamma, r varies from -1.00 to +1.00

Pearson’s r is a measure of association for Interval-Ratio variables.  For the hypothetical relationship between % college educated and turnout, assume r =.32

 This relationship would be positive and weak to moderate.

 As level of education increases, % turnout increases.

The Coefficient of Determination: r

2  Total variation in

y

 the

explained variation

  2  2 and the

unexplained variation

2 ( )  The explained variation (the portion explained by x) is represented by the formula:

r

2    

y

y

'  

y y

  

Or, alternatively: r 2 = (r) 2

Practical Example using Healey #15.1 (1e) (Problem 13.1 in 2/3e)    The computation and interpretation of a, b, Pearson’s r and r 2 will be illustrated using a similar example from Healey Problem 15.1 ( % Turnout by Education (Years of Schooling) but with only 5 cases) The variables are:   Voter turnout (Y) is the dependent variable.

Average years of school (X) is the independent variable.

The sample is 5 cities.

 This is only to simplify the calculation. A sample of 5 is actually very small.

Data from Problem:

City A B C D E X 11.9

12.1

12.7

12.8

13.0

Y 55 60 65 68 70  The scores on each variable are displayed in table format:  Y = % Turnout  X = Years of Education

1. Draw and Interpret the Scattergram:  

The relationship between X and Y is linear. Estimate regression line. Relationship is positive and strong.

2. Make a Computational Table: X Y X 2 Y 2 XY 11.9

12.1

12.7

12.8

13.0

∑ X = 62.5

55 60 65 68 70

∑Y = 318

141.61

146.41

161.29

163.84

169

∑ X 2 = 782.15

3025 3600 4225 4624 4900

∑Y 2 = 20374

654.5

726 825.5

870.4

910

∑XY = 3986.4

 

X

Sums ( Σ)

are needed to compute b, a, and Pearson’s r.

As well, the

mean of X and Y

are needed:  

X

/

n

 62 .

5 / 5  12 .

5

Y

 

Y

/

n

 318 / 5  63 .

6

3. Next, calculate b and a….

 Calculate slope:

b

n

XY

 ( 

X

)( 

Y n

X

2  ( 

X

) 2 )  Calculate y-intercept:

a

Y

b X

Interpret Slope (b), the Intercept (a)

b

n

XY

 ( 

X

)( 

Y

)

n

X

2  ( 

X

) 2  5 ( 3986 .

4 )  ( 62 .

5 )( 318 ) 5 ( 782 .

15 )  ( 62 .

5 ) 2  12 .

67 For every unit increase in X, Y increases by 12.67. This means that for 1 additional year of schooling, voter turnout goes up by 12.67%.

a

Y

b X

 63 .

6  12 .

67 ( 12 .

5 )   94 .

78 This is the point at which the regression line crosses the Y-axis (when X is equal to 0, Y is equal to -94.78)

Find the Regression Line*:

Y

a

bX

  94 .

78  12 .

67 (

X

) *Note: you can now substitute two values for X and solve for Y to find points to plot the actual regression line on your scattergram.

For prediction: Suppose years of schooling = 10 years… Then, Y = -94.78 + 12.67 (10) = 31.92. We would predict that when average years of education is 10 years, the voter turnout would be 31.92%

4. Pearson’s r

 Calculate the correlation coefficient r

r

n

XY

 ( 

X

)( 

Y

) [

n

X

2  ( 

X

) 2 ][

n

Y

2  ( 

Y

) 2 ]

Interpret Pearson’s r

r

n

XY

 ( 

X

)( 

Y

) [

n

X

2  ( 

X

) 2 ][

n

Y

2  ( 

Y

) 2  5 ( 3986 .

4 )  ( 62 .

5 )( 318 ) [ 5 ( 782 .

15 )  ( 62 .

5 ) 2 ][ 5 ( 20374 )  ( 318 ) 2 ]  .

984 An r of 0.98 indicates an extremely strong relationship between years of education and voter turnout for these five cities (use the table given for gamma to estimate strength)

5. Find the Coefficient of Determination (r 2 ) and Interpret:

r

2  (

r

) 2  (.

984 ) 2  .

968 The coefficient of determination is r Education, by itself, explains 96.8% of the variation in voter turnout.

2 = .968.

6. Testing r for significance:

 We can test the relationship between % turnout and years of education (represented by Pearson’s r) for significance using the 5 step model and the following formula:

t obtained

r n

 2 1 

r

2  Degrees of Freedom = N-2

Step 1: Assumptions

 There are 3 main assumptions…    1. The dependent and independent are normally distributed. We can test this by looking at the histograms for the two variables.

2. The relationship between X and Y is linear. We can check this by looking at the scattergram.

3. The relationship is homoscedastic. We can test homoscedasticity by looking at the scattergram and observing that the data points form a “roughly symmetrical, cigar shaped pattern” about the regression line.

 If the above 3 assumptions have been met, then we can use linear regression and correlation and test r for significance.

Step 2: Null and Alternate Hypotheses:    H o : ρ = 0.0

H 1 : ρ ≠ 0.0

(Note that ρ (rho) is the population parameter, while r is the sample statistic.)     Step 3: Sampling Distribution and Critical Region: S.D. = t-distribution Alpha = .05

DF = n - 2 = 5 - 2 = 3 t critical = 3.182

Step 4. Computing the Test Statistic:

Use Formula 15.6 (1e) or 13.6 in 2/3e)

t obtained

r n

 2 1 

r

2  .

984 5  2 1  (.

984 ) 2  9 .

53  

Step 5. Decision and Interpretation:

T obtained = 9.53 > t critical = 3.182

Reject H o . The relationship between % turnout and years of schooling is significant.

Always include a brief summary of your results:  There is a very strong, positive relationship between % voter turnout and years of schooling for the five cities. As years of schooling increase, the % of voter turnout goes up. The relationship is significant (t=9.53, df=3,

α

= .05) . Years of schooling explain 96.8% of the variation in % voter turnout.

Practice Problem

 Working with a partner, calculate, interpret and summarize the results for Healey 1e #15.1 (2/3e #13.1) for “% Turnout” and “Unemployment” and for “% Turnout” and “Negative Campaigning”.