Action Research Correlation and Regression INFO 515 Glenn Booker INFO 515 Lecture #7 Measures of Association Measures of association are used to determine how strong the relationship.

Download Report

Transcript Action Research Correlation and Regression INFO 515 Glenn Booker INFO 515 Lecture #7 Measures of Association Measures of association are used to determine how strong the relationship.

Action Research
Correlation and Regression
INFO 515
Glenn Booker
INFO 515
Lecture #7
1
Measures of Association
Measures of association are used to
determine how strong the relationship is
between two variables or measures, and
how we can predict such a relationship
 Only applies for interval or ratio scale
variables


INFO 515
Everything this week only applies to interval
or ratio scale variables!
Lecture #7
2
Measures of Association

For example, I have GRE and GPA scores
for a random sample of graduate students


INFO 515
How strong is the relationship between GRE
scores and GPA? Do these variables relate to
each other in some way?
If there is a strong relationship, how well can
we predict the values of one variable when
values of the other variable are known?
Lecture #7
3
Strength of Prediction

Two techniques are used to describe the
strength of a relationship, and predict
values of one variable when another
variable’s value is known


INFO 515
Correlation: Describes the degree (strength)
to which the two variables are related
Regression: Used to predict the values of one
variable when values of the other are known
Lecture #7
4
Strength of Prediction

Correlation and regression are linked -the ability to predict one variable when
another variable is known depends on the
degree and direction of the variables’
relationship in the first place


INFO 515
We find correlation before we calculate
regression
So generating a regression without checking
for a correlation first is pointless (though we’ll
do both at once)
Lecture #7
5
Correlation
There are different types of statistical
measures of correlation
 They give us a measure known as the
correlation coefficient


INFO 515
The most common procedure used is known as
the Pearson’s Product Moment Correlation, or
Pearson’s ‘r’
Lecture #7
6
Pearson’s ‘r’

Can only be calculated for interval or ratio
scale data


Its value is a real number from -1 to +1
Strength: As the value of ‘r’ approaches -1
or +1, the relationship is stronger. As the
magnitude of ‘r’ approaches zero, we see
little or no relationship
INFO 515
Lecture #7
7
Pearson’s ‘r’

For example, ‘r’ might equal 0.89, -0.9,
0.613, or -0.3


Which would be the strongest correlation?
Direction: Positive or negative correlation
can not be distinguished from looking at ‘r’

INFO 515
Direction of correlation depends on the type
of equation used, and the resulting constants
obtained for it
Lecture #7
8
Example of Relationships

Positive direction -- as the independent
variable increases, the dependent variable
tends to increase:
Student
1
2
3
4
5
INFO 515
GRE (X)
1500
1400
1250
1050
950
GPA1 (Y)
4.0
3.8
3.5
3.1
2.9
Lecture #7
9
Example of Relationships

Negative direction -- as the dependent
variable increases, the independent
variable decreases:
Student
1
2
3
4
5
INFO 515
GRE (X)
1500
1400
1250
1050
950
GPA2 (Y)
2.9
3.1
3.4
3.7
4.0
Lecture #7
10
Positive and Negative Correlation
Data from
slide 9
Data from
slide 10
GPA1
Observed
4.00
GPA2
Observed
4.00
Linear
Linear
3.80
3.80
3.60
3.60
3.40
3.40
3.20
3.20
3.00
3.00
2.80
2.80
900
1000
1100
1200
1300
1400
1500
900
1000
1100
1200
1300
1400
1500
GRE
GRE
Positive correlation, r = 1.0
Negative correlation, r = 1.0
Notice that high ‘r’ doesn’t tell whether the correlation is positive or negative!
INFO 515
Lecture #7
11
*Important Note*

An association value provided by a
correlation analysis, such as Pearson’s ‘r’,
tells us nothing about causation

INFO 515
In this case, high GRE scores don’t necessarily
cause high or low GPA scores, and vice versa
Lecture #7
12
Significance of r

We can test for the significance of r (to
see whether our relationship is statistically
significant) by consulting a table of critical
values for r (Action Research p. 41/42)


INFO 515
Table “VALUES OF THE CORRELATION
COEFFICIENT FOR DIFFERENT LEVELS OF
SIGNIFICANCE”
Where df = (number of data pairs) – 2
Lecture #7
13
Significance of r
We test the null hypothesis that the
correlation between the two variables is
equal to zero (there is no relationship
between them)
 Reject the null hypothesis (H0) if the
absolute value of r is greater than the
critical r value



INFO 515
Reject H0 if |r| > rcrit
This is similar to evaluating actual versus
critical ‘t’ values
Lecture #7
14
Significance of r Example
So if we had 20 pairs of data
 For two-tail 95% confidence (P=.05), the
critical ‘r’ value at df=20-2=18 is 0.444
 So reject the null hypothesis (hence
correlation is statistically significant) if:


r > 0.444 or r < -0.444
INFO 515
Lecture #7
15
Strength of “|r|”

Absolute value of Pearson’s ‘r’ indicates
the strength of a correlation






1.0
0.9
0.7
0.4
0.2
to
to
to
to
to
0.9:
0.7:
0.4:
0.2:
0.0:
very strong correlation
strong
moderate to substantial
moderate to low
low to negligible correlation
Notice that a correlation can be strong,
but still not be statistically significant!
(especially for small data sets)
INFO 515
Lecture #7
16
*Important Notes*
The stronger the r, the smaller the
standard estimate of the error, the better
the prediction!
 A significant r does not necessarily mean
that you have a strong correlation


INFO 515
A significant r means that whatever correlation
you do have is not due to random chance
Lecture #7
17
Coefficient of Determination

By squaring r, we can determine the
amount of variance the two variables
share (called “explained variance”)


R Square is the coefficient of determination
So, an “R Square” of 0.94 means that
94% of the variance in the Y variable is
explained by the variance of the X variable
INFO 515
Lecture #7
18
What is R Squared?
•
•
The Coefficient of determination, R2,
is a measure of the goodness of fit
R2 ranges from 0 to 1
•
•
INFO 515
R2 = 1 is a perfect fit (all data points fall
on the estimated line or curve)
R2 = 0 means that the variable(s) have
no explanatory power
Lecture #7
19
What is R Squared?
•
•
Having R2 closer to 1 helps choose which
regression model is best suited to a
problem
Having R2 actually equal zero is very
difficult
•
INFO 515
A sample of ten random numbers from Excel
still obtained an R2 of 0.006
Lecture #7
20
Scatter Plots

It’s nice to use R2 to determine the
strength of a relationship, but visual
feedback helps verify whether the model
fits the data well


Also helps look for data fliers (outliers)
A scatter plot (or scatter gram) allows us
to compare any two interval or ratio
scale variables, and see how data points
are related to each other
INFO 515
Lecture #7
21
Scatter Plots
Scatter plots are two-dimensional graphs
with an axis for each variable
(independent variable X and dependent
variable Y)
 To construct: place an * on the graph for
each X and Y value from the data
 Seeing data this way can help choose the
correct mathematical model for the data

INFO 515
Lecture #7
22
Scatter Plots
Y
(Dep.)
X=2
* Data point (2, 3)
Y=3
(0, 0)
INFO 515
X
(Indep.)
Lecture #7
23
Models
Allow us to focus on select elements of the
problem at hand, and ignore irrelevant
ones
 May show how parts of the problem relate
to each other
 May be expressed as equations,
mappings, or diagrams
 May be chosen or derived before or after
measurement (theory vs. empirical)

INFO 515
Lecture #7
24
Modeling
Often we look for a linear relationship –
one described by fitting a straight line as
well to the data as possible
 More generally, any equation could be
used as the basis for regression modeling,
or describing the relationship between two
variables


INFO 515
You could have Y = a*X**2 + b*ln(X) +
c*sin(d*X-e)
Lecture #7
25
Linear Model
Y
(Dep.)
Y = m*X + b or Y = b0 + b1*X
m = slope
1 unit of X
b = Y axis intercept
X
(Indep.)
INFO 515
Lecture #7
26
Linear Model



Pearson’s ‘r’ for linear regression is
calculated per (Action Research p. 29/30)
Define: N = number of data pairs
SX = Sum of all X values
SX2 = Sum of all (X values squared)
SY = Sum of all Y values
SY2 = Sum of all (Y values squared)
SXY = Sum of all (X values times Y values)
Pearson’s r = [N*(SXY) – (SX)*(SY)] /
sqrt[(N*(SX2) – (SX)^2)*(N*(SY2) – (SY)^2)]
INFO 515
Lecture #7
27
Linear Model

For the linear model, you could find the
slope ‘m’ and Y-intercept ‘b’ from



m = (r) * (standard deviation of Y) /
(standard deviation of X)
b = (mean of Y) – (m)*(mean of X)
But it’s a lot easier to use SPSS’
slope=b1 and Y intercept = b0
INFO 515
Lecture #7
28
Regression Analysis
Allows us to predict the likely value of
one variable from knowledge of another
variable
 The two variables should be fairly highly
correlated (close to a straight line)
 The regression equation is a mathematical
expression of the relationship between 2
variables on, for example, a straight line

INFO 515
Lecture #7
29
Regression Equation


Y = mX + b
In this linear equation, you predict Y values (the
dependent variable) from known values of X (the
independent variable); this is called the
regression of Y on X

INFO 515
The regression equation is fundamentally an equation
for plotting a straight line, so the stronger our
correlation -- the closer our variables will fall to a
straight line, and the better our prediction will be
Lecture #7
30
Linear Regression
y
y^
y
^
y = a + b*x
y = y^ + e
x
Choose “best” line by minimizing the sum of the squares of the vertical
distances between the data points and the regression line
INFO 515
Lecture #7
31
Standard Error of the Estimate
Is the standard deviation of data around
the regression line
 Tells how much the actual values of Y
deviate from the predicted values of Y

INFO 515
Lecture #7
32
Standard Error of the Estimate

After you calculate the standard error of
the estimate, you add and subtract the
value from your predicted values of Y to
get a % area around the regression line
within which you would expect repeated
actual values to occur or cluster if you
took many samples (sort of like a
sampling distribution for the mean….)
INFO 515
Lecture #7
33
Standard Error of Estimate
The Standard Error of Estimate for Y
predicted by X is
sy/x = sqrt[sum of(Y–predicted Y)2 /(N–2)]
where ‘Y’ is each actual Y value
‘predicted Y’ is the Y value predicted by
the linear regression
‘N’ is the number of data pairs
 For example on (Action Research p.
33/34), Sy/x = sqrt(2.641/(10-2)) = 0.574

INFO 515
Lecture #7
34
Standard Error of the Estimate

So, if the standard error of the estimate is
equal to 0.574, and if you have a
predicted Y value of 4.560, then 68% of
your actual values, with repeated
sampling, would fall between 3.986 and
5.134 (predicted Y +/- 1 std error)

INFO 515
The smaller the standard error, the closer
your actual values are to the regression
line, and the more confident you can be in
your prediction
Lecture #7
35
SPSS Regression Equations
Instead of constants called ‘m’ and ‘b’,
‘b0’ and ‘b1’ are used for most equations
 The meaning of ‘b0’ and ‘b1’ varies,
depending on the type of equation which
is being modeled


INFO 515
Can repress the use of ‘b0’ by unchecking
“Include constant in equation”
Lecture #7
36
SPSS Regression Models
Linear model
Y = b0 + b1*X
 Logarithmic model
Y = b0 + b1*ln(X) where ‘ln’ = natural log
 Inverse model
Y = b0 + b1/X
Similar to the form X*Y = constant,
which is a hyperbola

INFO 515
Lecture #7
37
SPSS Regression Models
Power model
Y = b0*(X**b1)
 Compound model
Y = b0*(b1**X)


INFO 515
Where “**” indicates
“to the power of”
A variant of this is the Logistic model, which
requires a constant input ‘u’ which
is larger than Y for any actual data point
Y = 1/[ 1/u + b0*(b1**X) ]
Lecture #7
38
SPSS Regression Models
“exp” means
“e to the power of”;
e = 2.7182818…
Exponential model
Y = b0*exp(b1*X)
 Other exponential functions



INFO 515
S model
Y = exp(b0 + b1/X)
Growth model (is almost identical to the
exponential model)
Y = exp(b0 + b1*X)
Lecture #7
39
SPSS Regression Models

Polynomials beyond the Linear model
(linear is a first order polynomial):


Quadratic (second order)
Y = b0 + b1*X + b2*X**2
Cubic (third order)
Y = b0 + b1*X + b2*X**2 + b3*X**3
These are the only equations which use constants b2 & b3

Higher order polynomials require the
Regression module of SPSS, which can do
regression using any equation you enter
INFO 515
Lecture #7
40
Y = whattheflock?

To help picture these equations



Make an X variable over some typical range
(0 to 10 in a small increment, maybe 0.01)
Define a Y variable
Calculate the Y variable using Transform >
Compute… and whatever equation you want to
see


INFO 515
Pick values for b0 and b1 that aren’t 0, 1, or 2
Have SPSS plot the results of a regression of Y
vs X for that type of equation
Lecture #7
41
How Apply This?
Given a set of data containing two
variables of interest, generate a scatter
plot to get some idea of what the data
looks like
 Choose which types of models are most
likely to be useful
 For only linear models, use Analyze /
Regression / Linear...

INFO 515
Lecture #7
42
How Apply This?
Select the Independent (X) and
Dependent (Y) variables
 Rules may be applied to limit the scope of
the analysis, e.g. gender=1
 Dozens of other characteristics may also
be obtained, which are beyond our scope
here

INFO 515
Lecture #7
43
How Apply This?
Then check for the R Square value in the
Model Summary
 Check the Coefficients to make sure they
are all significant (e.g. Sig. < 0.050)
 If so, use the ‘b0’ and ‘b1’ coefficients
from under the ‘B’ column (see Statistics
for Software Process Improvement
handout), plus or minus the standard
errors “SE B”

INFO 515
Lecture #7
44
Regression Example
For example, go back to the
“GSS91 political.sav” data set
 Generate a linear regression (Analyze >
Regression > Linear) for ‘age’ as the
Independent variable, and ‘partyid’ as the
Dependent variable
 Notice that R2 and the ANOVA summary
are given, with F and its significance

INFO 515
Lecture #7
45
Regression Example
Model Summary
Model
1
R
.075a
R Sq uare
.006
Adjusted
R Sq uare
.005
Std. Error of
the Estimate
2.082
a. Predictors: (Constant), AGE OF RESPONDENT
ANOVAb
Model
1
Reg ression
Residual
Total
Sum of
Squares
36.235
6457.063
6493.298
df
1
1490
1491
Mean Square
36.235
4.334
F
8.361
Sig .
.004a
a. Predictors: (Constant), AGE OF RESPONDENT
b. Dependent Variable: POLITICAL PARTY AFFILIATION
INFO 515
Lecture #7
46
Regression Example
The R Square of 0.006 means there is a
very slight correlation (little strength)
 But the ANOVA Significance well under
0.050 confirms there is a statistically
significant relationship here - it’s just a
really weak one

INFO 515
Lecture #7
47
Regression Example
Output from Analyze > Regression > Linear
Coefficientsa
Model
1
(Constant)
AGE OF RESPONDENT
Unstandardized
Coefficients
B
Std. Error
3.333
.148
-.009
.003
Standardized
Coefficients
Beta
-.075
t
22.462
-2.892
Sig .
.000
.004
a. Dependent Variable: POLITICAL PARTY AFFILIATION
Output from Analyze > Regression > Curve Estimation
Coefficients
AGE OF RESPONDENT
(Constant)
INFO 515
Unstandardized
Coefficients
B
Std. Error
-.009
.003
3.333
.148
Lecture #7
Standardized
Coefficients
Beta
-.075
t
-2.892
22.462
Sig .
.004
.000
48
Regression Example
The heart of the regression analysis is in
the Coefficients section
 We could look up ‘t’ on a critical values
table, but it’s easier to:
 See if all values of Sig are < 0.050 - if
they are, reject the null hypothesis,
meaning there is a significant relationship



INFO 515
If so, use the values under B for b0 and b1
If any coefficient has Sig > 0.050, don’t use
that regression (coeff might be zero)
Lecture #7
49
Regression Example

The answer for “what is the effect of age
on political view?” is that there is a very
weak but statistically significant linear
relationship, with a reduction of 0.009
(b1) political view categories per year

INFO 515
From the Variable View of the data, since low
values are liberal and large values
conservative, this means that people tend to
get slightly more liberal as they get older
Lecture #7
50
Curve Estimation Example
For the other regression options, choose
Analyze / Regression / Curve Estimation…
 Define the Dependents (variable) and the
Independent variable - note that multiple
Dependents may be selected
 Check which math models you want used
 Display the ANOVA table for reference

INFO 515
Lecture #7
51
Curve Estimation Example
SPSS Tip: up to three regression models
can be plotted at once, so don’t select
more than that if you want a scatter plot
to go with the data and the regressions
 For the same example just used, get a
summary for the linear and quadratic
models (Analyze > Regression > Curve Estimation)
 Find “R Square” for each model



INFO 515
Generally pick the model with largest R Square
Already saw Linear output, now see Quadratic
Lecture #7
52
Curve Estimation Example

For the quadratic regression, R Square
is slightly higher, and the ANOVA is still
significant
Model Summary
R
.094
R Square
.009
Adjusted
R Square
.008
Std. Error of
the Estimate
2.079
The independent variable is AGE OF RESPONDENT.
ANOVA
Reg ression
Residual
Total
Sum of
Squares
57.801
6435.496
6493.298
df
2
1489
1491
Mean Square
28.901
4.322
F
6.687
Sig .
.001
The independent variable is AGE OF RESPONDENT.
INFO 515
Lecture #7
53
Curve Estimation Example

The Quadratic coefficients are all
significant at the 0.050 level
Coefficients
AGE OF RESPONDENT
AGE OF RESPONDENT
** 2
(Constant)
Unstandardized
Coefficients
B
Std. Error
-.048
.018
.000
.000
4.191
.412
Standardized
Coefficients
Beta
-.410
t
-2.691
Sig .
.007
.341
2.234
.026
10.175
.000
Interpret as partyid = (4.191 +/- 0.412) + (-0.048 +/- 0.018)*age
+ (0.0003918+/- 0.0001754)*age**2
Edit the data table, then double click on the cells to get the values of
b2 and its std error.
INFO 515
Lecture #7
54
Curve Estimation Example
The data set will be plotted as the
Observed points, with the regression
models shown for comparison
 Look to see which model most closely
matches the data
 Look for regions of data which do or don’t
match the model well (if any)

INFO 515
Lecture #7
55
Curve Estimation Example
<- quadratic
<- linear
INFO 515
Lecture #7
56
Curve Estimation Procedure
See which models are significant (throw
out the rest!)
 Compare the R Square values to see
which provides the best fit
 Use the graph to verify visually that the
correct model was chosen
 Use the model equation’s ‘B’ values and
their standard errors to describe and
predict the data’s behavior

INFO 515
Lecture #7
57