Analyzing and Interpreting Data

Download Report

Transcript Analyzing and Interpreting Data

Outliers Detection
Wahyu Wibowo

Outliers are observations with a unique
combination of characteristics identifiable
as distinctly different from the other
observations.
Why do outliers occur ?




from procedural error, such as a data entry
error or a mistake in coding
as the result of an extraordinary event, which
accounts for the uniqueness of the
observation
comprises extraordinary observations for
which the researcher has no explanation
unique in their combination of values across
the variables
METHODS OF DETECTING OUTLIERS
univariate,
 bivariate, or
 multivariate

Univariate Detection



For small samples (80 or fewer
observations), outliers typically are defined as
cases with standard scores of 2.5 or greater
For larger sample sizes, increase the
threshold value of standard scores up to 4
If standard scores are not used, identify
cases falling outside the ranges of 2.5 versus
4 standard deviations, depending on the
sample size

Boxplots provide a schematic graphical
summary of important features of a distribution,
including
 the center
 the spread of the middle of the data
(interquartile range)
 the behavior of the tails
 outliers
Graphical Method : Boxplot

Boxplots provide a schematic graphical
summary of important features of a
distribution, including
o the center
o the spread of the middle of the data
(interquartile range)
o the behavior of the tails
o outliers
Elements of a boxplot
Different criteria according to which an
outlier can be identified are available.
 A useful criterion for outlier is that of Tukey
(1977) where observations
o larger than Q3 + d, or
o smaller than Ql - d,
o Whereby d = 1.5(Q3 - Ql)

Bivariate Detection
pairs of variables can be assessed jointly
through a scatterplot.
 Cases that fall markedly outside the range
of the other observations will be seen as
isolated points in the scatter plot.
 an ellipse representing a bivariate normal
distribution’s confidence interval (typically
set at the 90% or 95% level) is
superimposed over the scatter plot

Multivariate Detection
This issue is addressed by the
Mahalanobis D2 measure, a multivariate
assessment of each observation across a
set of variables.
 Higher D2 values represent observations
farther removed from the general
distribution of observations in this
multidimensional space

RETENTION OR DELETION OF THE
OUTLIER
If they do portray a representative element
or segment of the population, they should
be retained to ensure generalizability to
the entire population.
 As outliers are deleted, the researcher
runs the risk of improving the multivariate
analysis but limiting its generalizability

Exercise
Open sample data SPSS,
“customer_dbase.sav”
 Identify outlier for metric data !

Referensi :
Multivariate Data Analysis
Joseph F. Hair, Jr. William C. Black
Barry J. Babin, Rolph E. Anderson
Seventh Edition, Pearson Education
Limited