Strategies for identifying outliers and managing missing data.

Download Report

Transcript Strategies for identifying outliers and managing missing data.

Strategies for Identifying Outliers
and Managing Missing Data
R. Michael Haynes, PhD
[email protected]
Tarleton State University
A PRIORI MARCH 1, 2012
Assistant Vice President for Student Life Studies
POST HOC FEBRUARY 29, 2012
Executive Director of Institutional Research
Assistant Professor
Educational Leadership and Policy Studies
A little background...
Outlier analysis in multiple regression class
Data inspection (missing data) was a key aspect of
dissertation
Try to incorporate at the very least “a nod” to data
inspection in any assessment/project completed
Why is it important to evaluate your data set?
Can help in…..
Identifying input errors
Indentifying spurious data points
(an answer of “6” on a 1-5 Likert scale)
Makes your findings more sound
Good practice as recommended by
the American Psychological Association
(Wilkerson & APA Task Force on Statistical
Inference, 1999)
Desired Outcomes
Knowledge of various data inspection methods
visual
range of data set
Methods for managing missing data
list wise deletion
pair wise deletion
mean replacement
linear trend point
Criteria for identifying outliers/spurious data points
standardized residuals/predicted values
standard deviation diagnostics
Cook’s D values
Data inspection methods
Visual
Can alert you to missing cases
Most beneficial with smaller datasets where review of individual
cases is possible
Data inspection methods
SPSS minimum/maximum values function
Quick method of inspecting range of larger data sets
Descriptive Statistics
N
Minimum
Maximum
Mean
Std. Deviation
Learning Community
884
0
1
.14
.343
You are taking this survey:
874
1
3
2.01
.139
Recoded response to high
883
1
4
3.94
.367
school graduation year
variable HGRADYR
Valid N (listwise)
873
What to do about missing values?
SPSS options
Exclude cases listwise: Only cases with valid values for all variables are
included in the analyses.
Exclude cases pairwise: Cases with complete data for the pair of variables
being correlated are used to compute the correlation coefficient on which
the regression analysis is based. Degrees of freedom are based on the
minimum pairwise N
Replace with mean: All cases are used for computations, with the mean
of the variable substituted for missing observations
(SPSS Inc., 233 S.Wacker Drive, Chicago, IL, 60606)
Problems with these options…
Listwise excludes all values for a case missing even 1
variable value…throws the baby out with the bath water!
Pairwise only utilizes variables for which both values are
present
Can lead to distortion of findings through selection bias
(King, Honeker, Joseph, & Scheve, 1998)
More preferred options…
Choose “Transform” -> “Missing Values”
Enter variables with missing values into “New Variable” box
Under “Name and Method”, select one of the following:
Series Mean
Mean of Nearby Points
Median of Nearby Points
Linear Interpolation
Linear Trend at Point
I prefer the last option, Linear Trend at Point
Linear Trend at Point
Uses the theory of regression to calculate coefficients
based upon existing values
Generates a replacement value for each case on each
variable
More robust than simply replacing with mean
Identifying outliers… what is an outlier?
An unusual score in a distribution that is considered
extreme and may warrant special consideration
(Hinkle, Wiersma, & Jurs, 2003)
...a data point distinct or deviant from the rest of the data
(Pedhazur, 1997)
Why is it important to identify
potential outliers?
Can skew findings which in turn can skew
conclusions/decisions/programming
Can help identify case in dire need of additional
programming/resources…..finding that lost raft at sea!
As mentioned earlier, can assist in identifying data entry
errors
Strategies for identifying outliers
in your dataset
Standardized predicted and residual scores
Strategies for identifying outliers
in your dataset
Strategies for identifying outliers
in your dataset
Residuals 3 standard deviations away from mean
Rule of thumb….”99% of your dataset should fall within
+ or – 3 standard deviations from the mean”
Casewise Diagnosticsa
Case
Number
Percent Hispanic
Std. Residual
Enrollment
Predicted Value
Residual
75
-4.091
.180
.54883
-.368829
88
-3.195
.020
.30811
-.288109
175
-4.068
.060
.42682
-.366818
a. Dependent Variable: Percent Hispanic Enrollment
Strategies for identifying outliers
in your dataset
Cook’s D values
Considers each variables relationship to the other
variables in the dataset (Pedhazar, 1997)
Cook’s D values greater than 1 could be suspect
Strategies for identifying outliers
in your dataset
Cook’s D values
Considers each variables relationship to the other
variables in the dataset (Pedhazar, 1997)
Cook’s D values greater than 1 could be suspect
Saves values to dataset
OK, so what if some of your cases don’t pass this
3 prong approach and it’s not a data entry error?
Discard the case?
Rejects the notion that the data “is what it is…”
“Tightens-up” the model to be more representative
of the norm
Keep it in?
Distorts the whole for a special circumstance
Depending upon your research question, could bring
attention to a group needing special consideration
Either way, can be addressed in
limitations/conclusions/need for further research
References
Hinkle, D.E., Wiersma, W., & Jurs, S.G. (2003). Applied statistics for the
behavioral sciences (5th ed.). Boston, MA: Houghton Mifflin Company
King, G., Honaker, J., Joseph, A., & Scheve, K. (1998). Listwise deletion is
evil: What to do about missing data in political science [Electronic version].
Society for Political Methodology: American Political Science Association,
Washington University in St. Louis, St. Louis, MO. Retrieved February 2,
2009, from
http://polmeth.wustl.edu/workingpapers.php?order=dateasc&title=1998&sta
rtdate=1998-01-01&enddate=1998-12-31
Pedhazur, E. J. (1997). Multiple regression in behavioral research (3rd ed.).
South Melbourne, Australia: Wadsworth.
Wilkinson, L. & Task Force on Statistical Inference. (1999). Statistical
methods in psychology journals: Guidelines and explanation. American
Psychologist, 54, 594-604.