Transcript The Basics of Social Research 2/e
Regression
single and multiple
Overview
• Defined: A model for predicting one variable from other variable(s).
• Variables: • Relationship: IV(s) is continuous, DV is continuous Relationship amongst variables • Example: Can we predict height from weight? (or weight from height?, or weight from multiple variables?, or height from multiple variables?, • Assumptions: Normality. Linearity. Multicollinearity
Regression is about finding the best straight line
The best straight line is the one that minimizes S, the sum of the squares
Once we find the best straight line, we know the “intercept” and the “slope”:
Same Intercept, Different slope
70 60 50 40 30 20 10 0 0 2 4 6 Number of Pints 8 10
Same slope, Different Intercept
80 70 60 50 40 30 20 10 0 0 2 4 6 Number of Pints 8 10
Relationship between correlation and regression
• •
Correlation
expresses the strength and direction of the relationship between two variables.
Regression
is an extension of correlation, and allows you to make predictions about one variable from other variable(s) Bivariate regression (1 IV and 1 DV) produces the same result as correlation Multiple regression (1+ IVs and 1 DV) goes a step farther than correlation
• •
Relationship between correlation and regression
Hypothesis: What is the relationship between gun ownership and murder rate within a city?
Correlation: Imagine you are a researcher interested in the relationship between number of registered weapons (“weapons”) and the murder rate (“murder”) so you collect data on those two variables from many different cities. You find a strong positive relationship (.885) between the two variables that is statistically significant (p=.003).
Correl ations
Automatic weapons in thousands Murder rate (in murders per 100,000) Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Automatic weapons in thousands 1 8 .
.885** .003
8 **. Correlation is s ignificant at t he 0.01 level (2-tailed).
Murder rate (in murders per 100,000) .885** .003
8 1 .
8
Relationship between correlation and regression
• Regression: Now, imagine you are the Mayor of Los Angeles. You are considering lifting the ban on automatic weapons. You want to predict whether lifting the ban (so increasing the number of automatic weapons on the streets) will impact the murder rate. You are going to use the data (from the above 8 cities) to PREDICT the relationship for a 9th city – Los Angeles. You find a strong positive relationship (.885) between the two variables that is statistically significant (p=.003).
Coefficients a
Model 1 Unstandardized B Coefficients 4.047
Std. Error 1.089
Standardized Coefficients (Constant) Automatic weapons in thousands .853
.183
a. Dependent Variable: Murder rate (in murders per 100,000) Beta .885
t 3.715
4.656
Sig.
.010
.003
Relationship between correlation and regression
• Regression: We can now use numbers from output to create a “regression line” For example, the regression line is: Y = a + bX Y = the unknown score on the variable you are predicting. a = the Y-intercept of the regression line.
b = the slope of the regression line. X = the known score on the other variable you are using to make a prediction.
Y = a + b * X Murders = 4.047 + .853 *
Weapons Coefficients a
Model 1 Unstandardized B Coefficients 4.047
Std. Error 1.089
Standardized Coefficients (Constant) Automatic weapons in thousands .853
.183
a. Dependent Variable: Murder rate (in murders per 100,000) Beta .885
t 3.715
4.656
Sig.
.010
.003
Relationship between correlation and regression
• Regression: Y = a + b * X Murders = 4.047 + .853 *
Weapons
If you are the Mayor of Los Angeles, simply insert into the regression equation the number of weapons on the street in Los Angeles (X), and you can predict the number of murders (Y) If 1000 weapons, then murders will be = 857 If 2000 weapons, then murders will be = 1710 If 3000 weapons, then murders will be = 2563
Coefficients a
Model 1 Unstandardized B Coefficients 4.047
Std. Error 1.089
Standardized Coefficients (Constant) Automatic weapons in thousands .853
.183
a. Dependent Variable: Murder rate (in murders per 100,000) Beta .885
t 3.715
4.656
Sig.
.010
.003
Multiple Regression
• • Using several “predictors” simultaneously Example: Study about internalizing violence (DV) Degree of witnessing violence X1 Measure of life stress X2 Measure of social support X3 DV
Multiple Regression
• Given this diagram, what would you want to know: (1) When all three entered, overall prediction (variance) of DV Model Summary R .37
a R Square .135
Adjusted R Square .108
Std. Error of the Estimate 2.2198
a. Predictors: (Constant), Social support, Current stress, Amount violenced witnessed DV
Multiple Regression
(2) unique prediction of each variable Coefficients a (Constant) Amount violenced witnessed Current stress Unstandardized Coefficients B .477 1.289
.038
.273
Std.
Error .018
.106
Standardized Coefficients Beta .201
.247
t .37
2.1
2.6
Social support -.074
.043
-.166
-2 a. Dependent Variable: Internalizing symptoms on CBCL Sig.
.712
.039
.012
.087
DV
Y
ˆ
b
1
X
1 0 .
038
Wit b
2
X
0 2
b
.
273 3
X
3
Stress b
0 0 .
074
SocSupp
0 .
477
Coefficients a (Constant) Amount violenced witnessed Current stress Unstandardized Coefficients B .477 1.289
.038
.273
Std.
Error .018
.106
Standardized Coefficients Beta .201
.247
t .37
2.1
2.6
Social support -.074
.043
-.166
-2 a. Dependent Variable: Internalizing symptoms on CBCL Sig.
.712
.039
.012
.087
Correlations Sta tistic s Amount violenced witnessed Current stress Social support Amount violenced witnessed Current stress Social support Internalizing symptoms on CBCL .050
.080
.200* -.080
.270** *. Correlation is significant at the 0.05 level (2-tailed).
**. Correlation is significant at the 0.01 level (2-tailed).
-.170
Internalizin g symptoms on CBCL
Multiple Regression
• The three things you typically want to know are… Overall effect (of all variables) = R 2 Unique effect of each variable, while controlling for the others = Beta Unique effect of each variable, without controlling for others = correlation matrix (same as separate bivariate regressions)
Multiple Regression
• • What we have just talked about is: Entry (all simultaneously) But you have other options as well: Hierarchical Stepwise (you specify order) (computer chooses based on criteria) • Backward • Forward • Stepwise
Hierarchical
• • • • You enter the variables in a specified order (called steps or blocks).
Block 1 tells you unique effect of the variable(s) Block 2 tells you unique effect of the new variable(s) And so forth
Forward
• • • • • Computer first enters predictor with highest correlation to DV Computer then enters predictor with highest semi partial correlation to DV • (if V1 explained 40% of DV, then 60% unexplained, so which variable is best explainer of the 60%) Computer then enters predictor with highest semi partial correlation to DV • (if V1 and V2 explained 80%, then which variable best explains the 20%, etc) and so forth… Stops when no new variables significantly explains the residual variation.
Backward
• • • • Computer enters all variables and calculates unique contribution of each.
A removal criteria is set, and if variable(s) don’t meet the criteria, they are removed from analysis.
The new model is then analyzed, if variable(s) don’t meet the criteria, they are removed from the analysis.
Stops when no more variables meet criteria
Stepwise
• • • Combination of Forward and Backward Similar to Forward in that… Computer first enters predictor with highest correlation to DV Computer then enters predictor with highest semi-partial correlation to DV Similar to Backward in that… A removal criteria is set, and if variable(s) don’t meet the criteria, they are removed from analysis
How to choose which variables and how
• • Correlational matrix IV Variables somewhat correlated to DV IV Variables not too correlated with other IV Regression Analyze your hypothesis first Then start “exploratory” analysis • Statisticians frown upon too much exploratory work as “fishing” • Entry and Hierarchical preferred over stepwise. If stepwise, Backward preferred over others.