Contingency Tables - Stony Brook University

Transcript Contingency Tables - Stony Brook University

Review of Building Multiple
Regression Models
• Generalization of univariate linear
regression models.
• One unit of data with a value of dependent
variable and p independent variables.
Multiple Regression Model
• Yi is value of dependent variable for i-th
unit.
• The values xi1, xi2, …, xip are values of the
independent variables.
• Zi is an unobservable error:
Yi  0  1x1i  2 x2i    p x pi  Zi .
Objectives
• Estimate the regression coefficients β0, β1,
…, βp.
• Estimate σ (crucial for tests).
• Test whether the regression coefficients β1,
…, βp are all simultaneously zero (note that
the intercept was left out).
• Test whether some of the regression
coefficients βq, …, βp are zero.
Assumptions for Multiple
Regression
•
•
•
•
Regression function is linear.
Error terms are independent.
Constant error variance.
Distribution of errors is normal.
Context of your second project
• Artificial data set, available on web site.
• Each set is individual.
– If you analyze the wrong data set, no credit!
• Three dependent variables.
– Three separate sections of your report!
• Six independent variables.
• 500 data points with replicated
observations.
Check Scatterplots
• Use scatterplot matrix to get a brief
summary look.
– Graphs, scatterplot, matrix.
• If Y vs xi is flat and patternless, then your
interpretation is that the regression
coefficient of xi is xero.
• Two of the dependent variables are random
samples.
Table of regression coefficients
• Contains the OLS estimates.
• The line (constant) refers to β0, the
intercept.
• There is a line for each variable in the
model that refers to βq, the partial regression
coefficient (slope) of the q-th independent
variable.
Table of regression coefficients
• Five columns of numbers
• Two are labeled “unstandardized
coefficients”
– B column contains the OLS estimates.
– Std. Error contains the estimated standard
deviation.
Table of regression coefficients
• One is the standardized coefficient.
– Scale free coefficient often used in social
science studies for comparison across studies.
• There is a column for t.
– As usual, t=(B-0)/(se B).
• There is a column for sig.
– Interpret as a p-value.
Interpretation
• There appears to be an association between
an independent variable and the dependent
variable if the observed significance level is
small for that coefficient.
• Specify which variable has associations and
the significant independent variables.
Refinement of Model
• Rerun regression using only those variables
that appear to be significant.
• Usually, the database of a study has many
variables that have no association with the
dependent variable.
• Most clients prefer that these variables not
be used.
– There are some technical problems with this
approach that are widely ignored.
Strategy of Stepwise Regression
• Let the computer do the work.
• In regression box, specify stepwise.
• The computer will see whether additional
variables can be added or added variables
deleted.
• There are three basic strategies: forward
selection, backward selection, and stepwise.
Using Stepwise Regression
• Examine final model selected.
• Note which variables are included.
• Examine information for excluded
variables.
– Check whether there is any possibility that one
of the variables left out might matter.
Checking the Model
• Residual plots.
• Diagnostics.
• Lack of Fit test.
Residual Plots
• Always plot unstandardized residuals
against unstandardized predicted.
• Plot unstandardized residuals against each
independent variable in model.
• If there is a time order to data, plot residuals
in time order.
Diagnostics
• Check for outliers.
• Check for influential points.
– Cook’s distance is useful. Deleting point with
largest Cook’s distance causes the greatest
change in the coefficients.
• Box plot of residuals.
• Q-Q plot of residuals.
Lack of Fit Test
• Need replicated points (same settings of
independent variables with different runs
determining dependent variable).
• Your data has replicated points.
• Design your studies so that you can do a
lack of fit test.
Approximate Lack of Fit Test
• Statistics, Compare Means, One-way anova.
• Dependent variable is residuals from
regression model that you think is correct.
• Independent variable is the second column
of your data set.
• Click OK.
Interpretation of Approximate
Lack of Fit Test
• If F test near one (observed significance
level large), then the model that generated
the residuals “appears to be adequate.”
• That is, there is no empirical reason to go
on.
• If F test is larger than one (small observed
significance level), model should be
improved.
Theory behind Lack of Fit Test
• One way analysis of variance.
• Covered next class.
• Happy Thanksgiving.