Transcript No Slide Title
Regression Analysis: Outline
• Review on Regression Analysis • Regression with Categorical explanatory variables • Pooled Regression: Fixed Effect and Random Effect models
1
Regression Analysis in the overall context of Research
• Research Purpose
– Research questions, objectives, hypotheses
• Methodology
– Type of Study – Sampling plan and sample size determination – Data collection methods – Data analysis plan
• Execution
– Data collection and analysis – Data collection and Data analysis – Discussion and Conclusion – Research Evaluations 2
Regression Analysis: Review • What is Regression?
• Dependence measure~ estimate the overall relationships between the dependent and independent variables • Examples of dependent and independent variables?
• Regression and Causality (~ experiment, theory ) • Regression (~predict dependent) and Correlation (~ linear association) • Uses of Regression • Descriptive~ describe relationship and how strong?
• Inference ~ which variables are most important/ significant?
• Predictive ~ forecasting • Hypothesis Testing • Sample Size 3
Type of Variables in Regression Analysis • Independent • Dependent • Moderating • Mediating • Moderation-mediation
4
Moderating Variables
• Moderating Variables
Testing Moderation • Y = b0 + b1* X + b2* Z + b3* XZ +e Y = [b1 + b3* Z] X + [b0+b2*Z]
5
Mediator Variables
• Mediator Variables Attitude a BI c b B 6
Multivariate Research Methods: Regression Analysis: Review • How it works?
• Formalization of regression model: Systematic part • y = b 0 + b 1 x 1 + b 2 X 2 + …+b k X k + error – intercept, slope, error – Examples??
• What do we observe? Y and X’s and estimate b’s • Which variables to include?
– Theory, Prior research, common sense – If you don’t have any idea?
» statistical criteria: stepwise, Forward and Backward ( in cases of only metric data??) • Moderator Effects ~ Interaction Variables usystematic part • How to Obtain Estimates?
– Least square method of Regression – Any straight line you fit will have some error – Objective is to minimize that errors e.g. sums of squared values of difference between Y and Y-predicted.
– Or minimize the sum squares errors – Y = a + b*X + e leads to e = Y - a -b*X – e 2 = (Y - a - b*X) 2 ~ minimize sum of e 2 7
Multivariate Research Methods: Regression Analysis: Review • Interpretation of parameter estimates?
• Intercept • mean of the dependent ~ when value of all independent variables are zero • Mean of the dependent ~ when all slopes are zero • Not always meaningful • Slopes: • Change in Y as we change one unit of X.
• zero slope ? X does not affect Y • b 1 , b 2 ,…..b
k : partial regression coefficients • e.g. b 1 = Change in the value of Y if X 1 is changed by one unit while all other explanatory variables are ( X 2 …X k ) kept constant. 8
Multivariate Research Methods: Regression Analysis: Review • Interpretation of parameter estimates?
• Size of the regression coefficient • depends on the scale of the explanatory variable • Which variable is a good explanatory variables then size of the coefficient is not a good predictor for that. • Scale of the independent variables ~ within 10 times • Beta coefficients/ or standardized coefficients, • provides relative importance • Elasticity: This measures the percentage change in dependent variable for 1 % change in the independent variable.
elasticity
X Y
9
Multivariate Research Methods: Regression Analysis: Review • Is Regression coefficient Significant?
• Is Regression Significant?
• • Overall goodness of fit?
r 2
r
0 2
ESS TSS r
2 1 1
RSS TSS
• • r ~ coefficient of multiple correlation adjusted r 2 Y RSS ( error) TSS ESS Y= b 0 +bX X 10
Multivariate Research Methods: Regression Analysis: Review
He t e ro s c e das t ic it y Au t o c o rre lat io n Mu lt ic o llin e arit y Majo r as s u m pt io n s
Th e v arian c e o f t h e e rro r t e rm is c o n s t an t
Th e re is n o au t o c o rre lat io n in t h e e rro r t e rm
Th e re is n o e x ac t lin e ar re lat io n s h ip in t h e in de pe n de n t v ariable s
Th e re m u s t be v ariabilit y in t h e in de pe n de n t v ariable s
Th e re gre s s io n m o de l is c o rre c t ly s pe c ifie d
Th e re gre s s io n m o de l is lin e ar in param e t e rs
Th e m e an v alu e o f t h e e rro r t e rm is z e ro
No c o v ariat io n be t we e n e rro rs an d in de pe n de n t v ariable s
Th e e rro r t e rm is n o rm ally dis t ribu t e d
11
Multivariate Research Methods: Regression Analysis: Review • • Variance Detecting problems with the assumptions?
Heteroscedasticity • error variances are not same • when errors are related to either dependent or independent variables • e.g more stable saving ( or consumption) with lower income families/ larger variances with brand switchers than brand loyal customers Saving •Remedy ?? If we know the nature of heteroscedasticity, we can use WLS • Volatility ~ Finance ??
Income 12
Regression Analysis : Review • Detecting problems with the assumptions?
• Autocorrelation~ more a time-series problem • when errors are correlated with consecutive obs.
• Reasons?
• Omitted variables • Model mis-specification • Y Detection • Graphical methods • Durbin-Watson ~ DW= 2 (1-r), DW varies between 0 - 4 – ideal number is 2 e t Positive X Problem?
• Over estimate coeff. of determination and underestimate the standard errors Negative e t-1 13
Multivariate Research Methods: Regression Analysis: Review • • Detecting problems with the assumptions?
Multicollinearity • • • • X 1 Y X 2 X 1 Y X 2 presence of very high interrelations among explanatory variables (do not violate any assumption) Symptoms:The standard errors are likely to be high, Estimates are not reliable?
Detection • Bivariate correlation • Variance Inflation Factor (VIF)~ 10 • Tolerance = 1/VIF Remedies • Drop variables • composite variables e.g. Family life cycles, Social Status • Factor analysis VIF 1 1 r i 2 .
14
Multivariate Research Methods: Regression Analysis: Review • • • Detecting problems with the assumptions?
Linear in parameters • Y = a + b*X 2 + e ~ linear in parameters but non-linear in variables • Y = a + b 2 *X 1 + b*X 2 + e ~ non-linear in parameters: Non-linear regression The Regression model is correctly specified • Functional form, e.g. new consumer durable sales • Influential observation • outliers • whether one or a few observations??
15
Regression Analysis: Review •
Outliers:
In linear regression, an outlier is an observation with large residual. Problem with dependent variable??
•
Leverage:
An observation with an extreme value on a independent variable is called a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points can have an unusually large effect on the estimate of regression coefficients. •
Influence:
An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. • Detection • RESIDUAL CHECK – Standardized residual – Studentized residual – Problem approx.: abs. value > 2
e e
*
i
*
i
s s i e
1
i
e i
1
h i h i
16
Regression Analysis: Review • Transformation of variables – Dependent variable should be normally dist., constant variance etc – e.g. GNP per capita, Log(Price) etc – Retransformation ??
• Forecasting • model fit versus forecasting • forecasting independent variables • Model Selection / comparing models • adjusted R-sq • Model Validation • Cross-validation • Jackknife validation 17
Multivariate Research Methods: Regression Analysis: Limitations • Nominal independent variables ~ dummy variable regression – gender, income groups, ethnicity, region, race etc.
• Measurement error~ Structural equation models • X True = X obs + e x • Y=b0 +b1 * X True + e Y • Y= b0 +b1 * (X obs + e x ) + e Y • Y= b0 +b1 * X obs + b1*e x + e Y • Y= b0 +b1 * X obs + b1*e x + e Y Error term is correlated with x-variable ~ this violates the reg.
assumption 18
Regression Analysis: Limitations • Limited dependent variable – Censored dependent variable ~ lots of zeros Tobit Regression • • • Expenditures in home buying Demand in a supply restricted situation vacation expenditures X (e.g. income) – Truncated dependent variable ~ duration analysis, available in LIMDEP • • Interpurchase times duration of unemployment 19
Regression with Categorical Explanatory Variables • Some modeling problems • Is gender important in determining the level of expenditure on medical expenses?
• Do Nescafe’s supermarket coffee sales vary by state?
• How would you model the impact of local crime on housing prices if crime rate were rated - none, moderate or high?
• How do I include income as a determinant of cigarette demand when data have only been collected by income class?
• Examples • Medical expenditure = intercept+ b1* Gender + b2* age group + error • Sales=intercept+ b1*Provinces+ error 20
Interpretation of regression coefficients: Binary Coding • Midterm exam scores by sex
Y avg
,
fem Y Y i Y i D i
0
score
1
D i
1 ,
if male
0 ,
if female
0 0 1 • .
score 0 female 1 male 21
Interpretation of regression coefficients: Effect Coding • Midterm exam scores by sex • .
score Y i 0 1 D i Y i score D i 1 , if 1 , if male female 0 Y overall Y avg .
mean male 0 0 1 1 2 2 1 0 • Note: we are not estimating average score of female and male student: female 2 2 1 male 22
Regression Analysis: Non-Linear Regression • Example: Sales and Price dynamics of New Product Sales First Purchase Sales Price Time Time 23
Pooled Regression: Fixed Effect and Random Effect models • Panel Data – Cross Sectional Time Series Data • Observations on “n” individuals (or countries, firms etc), each measured at T points in time (T can be different for each measuring unit) • Observations are not independent • use panel structure to get better parameter estimates • Control for fixed or random individual differences • Example of Data Setup….
• Software : LIMDEP ( also SAS…) • Example: Cross-sectional survey 50% Female Participation in Labor Force??
24
y it i X ' it e it Pooled Regression: Fixed Effect and Random Effect models • Fixed Effect – individual slopes are different - shifted by “fixed” amount y it y it i X ' it i X ' it e it e it • Random Effect – individual differences are random rather than fixed – random slope terms. The slope is function of mean slope value plus random error y it X ' it ( e it u i ) - Unobserved heterogeneity that is stable over time - This u i is uncorrelated with X’s 25
Pooled Regression: Fixed Effect and Random Effect models
•
The Hausman Test:
• Model Selection – Fixed Effect vs Random Effect
– H0: that random effects would be consistent and efficient, versus – H1: that random effects would be inconsistent. Chi-Square Test Statistic. 26