Regression Methods.Olawale Awe - LISA

Transcript Regression Methods.Olawale Awe - LISA

Regression Methods
OLAWALE AWE
LISA SHORT COURSE, DEPARTMENT OF
STATISTICS, VIRGINIA TECH.
NOVEMBER 21, 2013.
About
 What?

Laboratory for Interdisciplinary Statistical Analysis
 Why?

Mission: to provide statistical advice, analysis, and education to
Virginia Tech researchers
 How?

Collaboration requests, Walk-in Consulting, Short Courses
 Where?


Walk-in Consulting in GLC and various other locations
(www.lisa.stat.vt.edu/?q=walk_in)
Collaboration meetings typically held in Sandy 312
 Statistical Collaborators?

Graduate students and faculty members in VT statistics
department
Requesting a LISA Meeting
 Go to www.lisa.stat.vt.edu
 Click link for “Collaboration Request Form”
 Sign into the website using VT PID and password
 Enter your information (email, college, etc.)
 Describe your project (project title, research goals,
specific research questions, if you have already collected
data, special requests, etc.)
 Contact assigned LISA collaborators as soon as possible
to schedule a meeting
Laboratory for Interdisciplinary Statistical
Analysis
LISA helps VT researchers benefit from the use of
Statistics
Collaboration:
Visit our website to request personalized statistical advice and assistance with:
Experimental Design • Data Analysis • Interpreting Results
Grant Proposals • Software (R, SAS, JMP, SPSS...)
LISA statistical collaborators aim to explain concepts in ways useful for your research.
Great advice right now: Meet with LISA before collecting your data.
LISA also offers:
Educational Short Courses: Designed to help graduate students apply statistics in their research
Walk-In Consulting: M-F 1-3 PM GLC Video Conference Room for questions requiring <30mins
Also 11AM-1PM Port (Library/Torg Bridge) and 9.30-11.30 AM ICTAS Café X
All services are FREE for VT researchers. We assist with research—not class projects or homework.
www.lisa.stat.vt.edu
4
Outline
 Introduction to Regression Analysis
 Simple Linear Regression
 Multiple Linear Regression
 Regression Model Assumptions
 Residual Analysis
 Assessing Multicollinearity: Correlation and VIF
 Model Selection Procedures
 Illustrative Example (Brief Demo with SPSS/PASW)
 Model Diagnostic and Interpretation
Introduction
 Regression is a statistical technique for investigating,
describing, and predicting the relationship between
two or more variables.
 Regression has been regarded as the most widely used
technique in statistics.
 As basic to statistics as the Pythagorean theorem is to
geometry (Montgomery et al,2006).
6
Regression: Intro
 Regression Analysis has tremendous applications in
almost every field of human endeavor.
 One of the most popular statistical techniques used by
researchers.
 Widely used in engineering, physical and chemical
sciences, economics, management, social sciences, life
and biological sciences, etc.
 Easy to understand and interpret.
 Simply put, Regression analysis is used to
find equations that fit data.
7
When do we use Regression Technique?
Response Variable
Explanatory Variable(s)
Categorical
Continuous
Categorical &
Continuous
Categorical
Contingency
Table or
Logistic
Regression
Logistic
Regression
Logistic
Regression
Continuous
ANOVA
Regression
ANCOVA or
Regression with
categorical
variables
8
9
SIMPLE LINEAR
REGRESSION
Simple Linear Regression
 Simple Linear Regression (SLR) is a statistical method
for modeling the relationship between ONLY two
continuous variables.
 A researcher may be interested in modeling the
relationship between Life Expectancy and Per Capita
GDP of seven countries as follows.
 Scatterplots are first used to graphically examine the
relationship between the two variables.
10
Types of Relationships Between Two Continuous
Variables
 A scatter plot is a visual representation of the
relationship between two variables.
 Positive and negative linear relationship
11
Other Types of Relationships…
 Curvilinear Relationships
 No Relationship
12
Simple Linear Regression
Can we describe the
behavior between
the two variables
with a linear equation?
 The variable on the x-axis is often called the explanatory
or predictor variable(X).
 The variable on the y-axis is called the response
variable(Y).
13
Simple Linear Regression Model
 The Simple Linear Regression model is given by
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
where 𝑦𝑖 is the response of the ith observation
𝛽0 is the y-intercept
𝛽1 is the slope
𝑥𝑖 is the value of the predictor variable for the ith
observation
𝜀𝑖 ~iid Normal 0, 𝜎 2 is the random error
𝑖 = 1, … , 𝑛
14
Interpretation of Slope and Intercept Parameter
 β1 is the difference in the predicted value of Y for one unit




difference in X.
β0 is the mean response if the predictor variable is
zero(has no practical meaning but should be included)
If β1>0 there exists positive relationship .It means as
variable X increases, Y also increases.
If β1 <0 there exists negative relationship between the
variables. It means that as variable X decreases, Y
increases.
If Β1=0, It means there is no relationship between the
two variables(see graphs below).
15
Graphs of Relationships Between Two
Continuous Variables
β1>0
β1<0
β1=0
16
Line of Best fit
 A line of best fit is a straight line that best represents






your data on a scatter plot.
Identical to line of a straight line in elementary math
class. y=mx+b ,m=slope,
b= y-intercept.
Residual is r= y-𝑦
E(r)=0(more on residual later)
Where y=observed response
𝑦=predicted response.
17
Regression Assumptions
 Linearity between the dependent and independent
variable(s).
 Observations are independent


Based on how data is collected.
Check by plotting residuals vs the order in which the data was
collected.
 Constant variance of error terms.

Check using a residual plot (plot residuals vs. 𝑦)
 The error terms 𝜀𝑖 are normally distributed.

Check by making a histogram or normal quantile plot of the
residuals.
18
Example 1
 Consider a data on 15 American Women collected by a
researcher as follows:
We can fit a model of the form: Weight =β0 +β1Age+ϵ to
the data.
19
Scatter Plot of Weight vs Age
Line of best fit
20
Model Estimation and Result
 β0=𝑌 − β1𝑋
 β1=r*
𝑌
𝑆𝑥
𝑆𝑦
The estimated regression line is
Weight =-87.52+3.45Age
Can you interpret these results?
21
Description/Interpretation
 The above results can be interpreted as follows:
-Sig.(P value) of 0.000 indicates that the model is a good fit to
the data. It means Age has a significant contribution to the
average variability in the weights of the women.
-The value of β1 (slope=3.45) indicates a positive relationship
between the weight and age.
 The slope coefficient indicates that for every additional unit
increase in age, we can expect weight to increase by an average of
3.45 kilograms.
-R indicates that there is high association between the DV and
the predictor variable.
R-Squared value of 0.991 means that 99% of the average
variability in weight of the women is explained by the model.
22
Prediction
 Using the regression model above, we can predict the
weight of a woman who is 73 years old :
Weight = -87.52 + 3.45(75)
Weight = -87.52 +3.45*75
Weight =171
 Exercise:
-Using the SLR model above, predict the weight of a
woman whose age is 82.
Ans: 195kg
23
MULTIPLE
REGRESSION
Frequently there are many predictors that we
want to use simultaneously
 Multiple linear regression model:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + … + 𝛽𝑝 𝑥𝑝𝑖 + 𝜀𝑖
Similar to simple linear regression, except now there is
more than one explanatory variable.
 In this situation each 𝛽𝑗 represents the partial slope of the
predictor 𝑗 = 1, … , 𝑝.
 Can be interpreted as “the mean change in the response
variable for one unit of change in the predictor variable
while holding other predictors in the model constant. ”.
•
25
Example 2:
 Suppose the researcher in our example 1 above is
interested in knowing if height also contributes to change
in weight:
26
Step 1: Scatterplots
27
Model Estimation with SPSS
28
Multiple Regression
 The new model is therefore written as
Weight = 𝛽0 + 𝛽1 𝐴𝑔𝑒 + 𝛽2 𝐻𝑒𝑖𝑔ℎ𝑡 + Error
So, the fit is : Weight = -81.53+3.46Age -1.11Height
29
Model Interpretation
 The result of the model estimation above shows that height
does not contribute to the average variability in weight of the
women. The high p-value shows that it is not statistically
significant(changes in height are not associated with changes
in weight).
 No statistically significant linear dependence of the mean of
weight on height was detected.
 Note that the value of R-Squared and Adjusted R-Squared did
not decrease as we add additional independent variable(s).
 For every one unit increase in age, average weight increases at
a rate of 3.46units, while holding height constant.
30
Model Diagnostic and Residual Analysis
 Residual is a measure of the variability in the response
variable not explained by the regression model.
 Analysis of the residual is always an effective way to
discover several violations of model assumption.
 Plotting residuals is a very effective way to investigate
how well the regression model fits the data.
 A residual plot is used to check the assumption of
constant variance and to check model fit (can the model
be trusted?).
31
Diagnostics: Residual Plot
 The residuals should fall in a symmetrical pattern
and have a constant spread throughout its range.
 Good residual plot: no pattern.
32
We Can Plot:
 Residual vs Independent Variable(s)
 Residual vs Predicted values
 Residual vs Order of the data
 Residual Lag Plot
 Histogram of Residual
 Standardized Residual vs Standardized Predicted Value
etc.
33
Residuals
 Column 3 in the table below show the residuals of the
regression model: Weight =-87.52+3.45Age
Residual is the deviation between the data and the fit.
(Actual Y- Predicted Y)
34
Residual Diagnostics: Very Important!
 Left: Residuals show non-constant variance.
 Right: Residuals show non-linear pattern.
35
Look at the Figures Below, What Do You Think?
36
Residual Plot
37
Residual Plots
38
What if the Assumptions Are Not Met?
 Linearity:
Transform the dependent variable (see next slide )
 Normality:
 Transform the data (also when outlier is present)
 Or use robust regression where normality is not
required
 Increase the sample size, if possible
 Homogeneity:
 Try transforming the data

39
Some Tips on Transformation

Log Y,
-Used if Y is positively skewed and has positive values.
𝑌
-If Y has a Poisson distribution(is a count data)
 1/Y
 -If variance of Y is proportional to the 4th power of E(Y)
 Sin-1 (Y)

-Used if Y is a proportion or rate
40
Multicollinearity
 A usual problem in multiple regression that
develops when one or more of the independent
variable(s) is highly correlated with one or more of
the other independent variables.
 How the explanatory variables relate to each other
is fundamental to understanding their relationship
with the response variable.
 Usually, when you see estimated beta weights
larger than 1 in any regression analysis, consider
the possibility of multicollinearity.
 Multicollinearity can be mild, or severe (depending
high correlations, or VIFs above 10).
41
Effects on P-Values
 You will get different p-values for the same
variables in different regressions as you
add/remove other explanatory variables.
 A variable can be significantly related to Y by itself,
but not be significantly related to Y after
accounting for several other variables. In that case,
the variable is viewed as redundant.
 If all the X variables are correlated, it is possible
ALL the variables may be insignificant, even if each
is significantly related to Y by itself.
42
Multicollinearity Effect on Coefficients
 Similarly, coefficients of individual explanatory
variables can change depending on what other
explanatory variables are present.
 May change signs sporadically.
 May be excessively large when there is
multicollinearity.
43
Multicollinearity Isn’t Tragic
 In most practical datasets there will be some degree
of multicollinearity. If the degree of
multicollinearity isn’t too bad (more on its
assessment in the next slides) then it can be safely
ignored.
 If you have serious multicollinearity, then your
goals must be considered and there are various
options.
 In what follows, we first focus on how to assess
multicollinearity, then what to do about it should it
be found to be a problem.
44
Assessing Multicollinearity: Two Methods
 There is typically a measure of multicollinearity in
most experiments.
 We discuss two methods for assessing
multicollinearity in this course:
 (1)Correlation matrix
 (2)Variance Inflation Factor(VIF)
45
Correlation Matrices
 A correlation matrix is simply a table indicating the
correlations between each pair of explanatory
variables.
 If you haven’t seen it before, the correlation
between two variables is simply the square root of
R2, combined with a sign indicating a positive or
negative association.
 If you see values close to 1 or -1 that indicates
variables are strongly associated with each other
and you may have multicollinearity problems.
 If you see many correlations all greater in absolute
value than 0.7, you may also have problems with
your model.
46
Correlation Matrix
A cursory look at the correlation matrix of the
independent variables shows if there is
multicollinearity in our experiment.
47
Correlation Matrix Involving the DV
Can help to assess the preliminary idea of the bivariate
association of the dependent variable with the
independent variables.
GDP FER
MS
CE
TR
GDP
1
0.99
0.95
0.61
FER 0.99
1
0.92
0.55
MS
0.95
0.92
1
0.76
CE
0.61
0.55
0.76
1
TR
0.14
0.09
0.09
0.05
ER
0.87
0.85
0.76
0.46
ER
0.14
0.08
0.09
0.05
1
0.36
0.87
0.85
0.76
0.46
0.36
1
48
Disadvantages of Using Correlation Matrices
 Correlation matrices only work with two variables
at a time. Thus, we can only see pairwise
relationships. If a more complicated relationship
exists, the correlation matrix won’t find it.
 Multicollinearity is not a bivariate problem.
 Use VIFs!
49
Variance Inflation Factors (VIFs)
 Variance inflation factors measure the relationship
of all the variables simultaneously, thus they avoid
the “two at a time” disadvantage of correlation
matrices.
 They are harder to explain.
 There is a VIF for each variable.
 Loosely, the VIF is based on regressing each
variable on the remaining variables. If the
remaining variables can explain the variable of
interest, then that variable has a high VIF.
50
Using VIFs
 The use if variance inflation factor is the most reliable
way to examine multicollinearity.
 VIF = 1/Tolerance= 1/1-R2
 Tolerance is the proportion of variance in the
independent variable not explained by its relationship
with the other independent variables.
 In practice, all VIFs are greater than 1.
 VIFs are considered “bad or severe” if they exceed 10.
51
So Multicollinearity is an Issue – What Do You Do
About It?
 Remember, if multicollinearity is present but not
excessive (no high correlations, no VIFs above 10),
you can ignore it.
 If multicollinearity is a big issue in your dataset,
your goal becomes extremely important.
52
Variance Inflation Factor
VIFs of 16.545 and 17.149 are ‘severe’.
53
If Your Goal is Prediction…
 With severe multicollinearity everything fails,
except if your goal is just prediction.
 If your main goal is prediction (using the available
explanatory variables to predict the response), then
you can safely ignore the multicollinearity.
54
If Interest Centers on the Real Relationships
Between the Variables…
 When you have serious multicollinearity, the
variables are sufficiently redundant that you cannot
reliably distinguish their effects.
 There is no single solution for this problem.
55
Some Tips:
 Drop one of the ‘offending’ variables from the
regression equation-but often the variables are so
intertwined that you cannot distinguish them.
 Combine the collinear variables-For example, if in
a sociological study you find the variables “father’s
education level” and “mother’s education level” are
strongly related, it may be sufficient to simply use
one variable, “parent’s education level”, which is
some function of the two parents.
 Sometimes, you may not be able to disentangle
your explanatory variables.
56
Dealing with Multicollinearity
 In many situations you get to select some of the
explanatory variables (in engineering studies you often
get to select almost all of them, in medical studies you
can select drug dosage).
 You can use centered independent variables.
 Use Ridge Regression or PCA (see Montgomery et al,
2006).
 Use one of the analytic procedures like LISREL (see
Adelodun and Awe,2013).
57
Note…
 Most importantly, make sure you set up your
experiments in a way that you do not “install”
multicollinearity.
 Since multicollinearity diagnostics are so easy to obtain
(through stat. packages), no researcher should ever
report results of regressions with obvious
multicollinearity problems!
58
SHORT QUIZ:
Consider the Regression Model Below:
 𝑦𝑖 = α + β𝑥1𝑖 + 𝜀𝑖
 Which values are known/unknown?
 Which are data, which are parameters?
 Which term is the slope? Intercept?
 What are the common assumptions about error structure
(fill in the blanks):
 𝜀𝑖 ~___(___,____)
 What is the difference between 𝛽1 and 𝛽1 ?
59
Let’s have a break for
few minutes!
60
A Brief Review…
 You have several explanatory variables and a single
response Y.
 You run the multiple regression first and check the
residuals and collinearity diagnostic measures.
 If the residuals look bad, deal with those first (you
may need a transformation or fit a polynomial ).
 Now suppose you have decent residuals…
61
With Decent Residuals…
 Check the collinearity measures. If these are




problematic (any VIF above 10 or high correlations),
then you must start removing or combining variables
before you can trust the output. This tends to be a
substantive, not a statistical, task.
The variables with the highest VIFs can be first
targeted for deletion. They are the “most redundant”
After you do anything, remember to check the
residuals again.
Now suppose you have decent residuals and
collinearity measures…
Look at the p-values, R2. If these are significant, stop.
Otherwise continue to the next slide…
62
Mode Selection Procedures
 “Model selection” refers to determining which of the
explanatory variables should be placed in a final model.
 Usually, we want a parsimonious model, or a model
which describes the response well but is as free from
multicollinearity as possible.
 "All models are wrong, but some are useful." So said
the statistician George Box.
63
Variable Subset Selection Uses Statistical Criteria
to Identify a Set of Predictors for Our Model
 Variable subset selection: Among a set of potential
predictors, choose a subset to include in the model based
on some statistical criterion, e.g. p-values
 Forward selection: Add variables one at a time starting
with the x most strongly associated with y. Stop when
no other ‘significant’ variables are identified.
 Drawback: Variables added to the model cannot be
taken out again.
64
Variable Subset Selection Continued
 Backwards elimination: Start with every candidate
predictor in the model. Remove variables one at a time
until all remaining variables are “significantly” associated
with response.
 Drawback: Variables taken out of the model cannot be
added back.
 Stepwise selection: As forward selection, but at each
iteration remove variables which are made obsolete by
new additions.
 Combination of forward and backward methods.
65
A Recap on Meaning and Interpretation of
Regression Results
 Let us review and familiarize ourselves with the meaning
of each entity that appears in the regression results.
 Note that regression procedure in SPSS and JMP are
similar.
 All the estimates and analyses can be done easily using
statistical packages.
66
Coefficient of Multiple Determination
 The coefficient of determination (or covariance), 𝑅 2 , is
the percent of variation in the response y explained by
the set of 𝑥1 , … , 𝑥𝑝−1 explanatory variables.
𝑆𝑆𝑅
𝑆𝑆𝐸
2
𝑅 =
=1−
𝑆𝑆𝑇𝑂
𝑆𝑆𝑇𝑂
0 ≤ 𝑅 2 ≤ 1 (closer to 1 is a better model)
2
 The adjusted coefficient of determination, 𝑅𝑎𝑑𝑗
,
introduces a penalty for more explanatory variables.
(takes sample size into account, and so more reliable)
𝑆𝑆𝐸
𝑛−𝑝
2
𝑅𝑎𝑑𝑗 = 1 −
𝑆𝑆𝑇𝑂
𝑛−1
67
Interpretation of Terms…
 P-value: The p-value in a regression provides a test of
whether that variable is significantly related to Y, after
accounting for everything else.
 ANOVA is used to evaluate the overall model
significance.
 Standard error: measure of uncertainty of an estimatemeasures the variability in the actual Y value from the
predicted Y.
 R is the correlation which measures how the variables
move in association with each other.
68
ANOVA Table for Simple Linear Regression
Source
Regression
Error
Total
SS
df
𝑛
𝑆𝑆𝑅 =
𝑦𝑖 − 𝑦
2
𝑖=1
𝑛
𝑆𝑆𝐸 =
𝑦𝑖 − 𝑦𝑖
2
𝑖=1
𝑛
𝑆𝑆𝑇𝑂 =
𝑦𝑖 − 𝑦
2
1
n2
MS
𝑀𝑆𝑅 =
𝑆𝑆𝑅
1
F
𝑀𝑆𝑅
𝑀𝑆𝐸
P-value
𝑃(𝐹 > 𝐹1−𝛼;1,𝑛−2 )
𝑆𝑆𝐸
𝑀𝑆𝐸 =
𝑛−2
n-1
𝑖=1
The F-test tests whether there is a linear relationship
between the two variables (used in determining if model is
significant).
Null Hypothesis 𝐻0 : 𝛽1 = 0
69
Alternative Hypothesis 𝐻𝐴 : 𝛽 ≠ 0
Illustrative Example: Brief SPSS Demo
 Please wait patiently for a brief SPSS Demo involving
example 3.
70
Example 3:Practical
 Suppose a researcher is interested in measuring the
effect of several economic indicators on the GDP of a
particular country in Africa(say Nigeria).
 He may specify the Multiple Linear Regression model as
follows:
GDPi=β0+ β1FERi+ β2MSi+ β3CEi+ β4TRi+ β5ERi+ 𝜀𝑖 .
Where i=1,…,50.
See the data and estimation of this model in the demo
section soon.
71
Where…
 GDP= Gross Domestic Product(Y)
 FER=Foreign Exchange Reserve(X1)
 MS=Money Supply(X2)
 CE=Capital Expenditure(X3)
 TR=Treasury Bill Rate(X4)
 ER=Exchange Rate(X5)
 𝜀𝑖 =Stochastic Error
 After inputting the data into SPSS/PASW, click on
Analyze-Regression-Linear…
72
MODEL DIAGNOSTIC
AND
INTERPRETATION
Look at the Following (SPSS) Regression Output:
Can You Diagnose and Interpret these Results?
74
Residuals Plots for final model
Histogram of Residuals.
Some Lessons…
 High R2 value does not always indicate a good model!
 Always check your residuals after each analysis.
 If you notice non-random patterns in your residuals, it
means that your model is missing something.
Possibilities include:
-A missing variable.
-A missing higher-order term of a variable in the model to
explain the curvature.
-A missing interaction between terms already in the model.
-etc.
 While trying to fit a parsimonious model, these
possibilities can be explored further and figured out by
the researcher.
78
Some References
 Michael Sullivan III. Statistics Informed Decisions Using Data. Upper Saddle River,









New Jersey: Pearson Education, 2004.
Michael H. Kutner, Christopher J. Nachtsheim, John Neter and William Li. Applied
Linear Statistical Models. New York: McGraw-Hill Irwin, 2005.
Gordon,Robert A(1968). Issues in multiple regression.American Journal of
Sociology Vol.73.pp.592-616.
Schroeder,Mary Ann(1990).Diagnosing and Dealing with Multicollinearity.Western
Journal of Nursing Research,12(2),175-187.
“Multicollinearity”.Dr. Bunty Ethington EDPR 7/8542. University of Memphis.
Montgomery et al(2006).Introduction to Linear Regression Analysis.3rd Ed. Wiley
Series.
Awe et al(2013).Regression Model Diagnostic,Test and Robustification in the
Presence of Multicollinear Covariates. International Journal of Electronic and
Computer Research(India). Vol.2(2).
Adelodun, A. A. and Awe, O.O.(2013). Using LISREL for Empirical Research.
Transnational Journal of Science and Technology(Macedonia).Vol.3(8).pp. 1-14.
www.google.com/search?q=prediction&newwindow.
www.lisa.stat.vt.edu/
79
Acknowledgement
 Thanks to the following:
Dr. Eric Vance
 Dr. Chris Franck
 Tonya Pruitt

80