Transcript WINsorizing

WINSORIZING
Kyle Allen
&
Matthew Whitledge
May 7, 2013
What is it and
why could it be
inappropriate?
WHAT IS WINSORIZING?
 What it isn’t…
 Trimming
 Truncating
 Any other method that completely removes observations from the
data
 Term first used in 1960
 John W. Tukey; W. J. Dixon
 “Numerical value of a wild observation is untrustworthy”
 However, its direction of deviation is important
 Decreasing the magnitude of the deviation, retaining its
direction
WINSORIZING
AN EXAMPLE
 Order the observations by value
 X i1 , X i2 , …X i100 , where i denotes the i th regressor
 If Winsorizing at 1% and 99%, then
 The value for X i1 will be replaced by the value for X i2
 The value for X i100 will be replaced by the value for X i99
Another example:
 X i1 , X i2 , …X i100
 Winsorize at 10% (5% from bottom and 5% from the top)
 Beginning Sample:
 Xi1, Xi2, Xi3, Xi4, Xi5, Xi6,… Xi95, Xi96, Xi97, Xi98, Xi99, Xi100
 Winsorized Sample
 Xi5, Xi5, Xi5, Xi5, Xi5, Xi6,… Xi95, Xi96, Xi96, Xi96, Xi96, Xi96
Winsorized at 5% and 95%
Obs. Original
Winsorized
Xi1
0.2
6.3
Xi2
0.9
6.3
Xi3
3.5
6.3
Xi4
4.8
6.3
Xi5
6.3
6.3
Xi6
7
7
Xi7
7.1
7.1
Xi8
7.2
7.2
Xi9-Xi92
…
…
Xi93
82
82
Xi94
83.2
83.2
Xi95
83.5
83.5
Xi96
98
98
Xi97
112
98
Xi98
114
98
Xi99
3150
98
Xi100
6572
98
WINSORIZING
ALTERNATIVES
 Are the observations really outliers?
 Look at Cook’s D measure
 Transform the variables
 Take the log or square root of the variable
 This shouldn’t be done only to increase significance
 Median based estimations
 Quantile regression
 Median absolute deviation
 Nonparametric methods
WINSORIZING
A SAS EXAMPLE
Lift Index Data
 Workers perform lifting tasks
 Each lift has an amount of stress associated with it
 Measuring the number of days an employee missed based on
the lift they were performing
 206 observations
WINSORIZING
SAS CODE
 proc sgplot data=isqsdata.lilesmerge;
scatter y=dayslost
x=alr;
scatter y=dayslost1 x=alr;
run;
 data isqsdata.lileswin; set isqsdata.lileswin;
if subject = 6 then dayslost = 27;
if subject = 35 then dayslost = 27;
run;
 proc qlim
data=isqsdata.liles;
model dayslost = alr;
endogenous dayslost ~ censored(lb=0);
run;
 proc qlim
data=isqsdata.lileswin;
model dayslost1 = alr;
endogenous dayslost1 ~ censored(lb=0);
run;
WINSORIZING
LOOK AT YOUR DATA
PROC GLIM (NON-WINSORIZED)
PROC GLIM (WINSORIZED)
WINSORIZING
IMPLICATIONS
 May impact significance
 The standard errors will decrease
 Depending on how symmetrical the data is, the mean may increase
or decrease
 For example, if there is an extremely positive outlier, it will decrease the
mean
 The significance will be determined by the proportionate
change in the estimated coef ficient, relative to the change in
the standard error
WINSORIZING
WHY COULD IT BE INAPPROPRIATE?
 May be appropriate for
 Ratios
 Book to Market
 Other measures in which the denominator can be extremely small
 Never winsorize valid observations
 Investment Returns
 R&D expenditures
 Truly exceptional observations
 Large number of biological elements
 Extremely low stress tolerances for mechanical implements
 Model should produce data we could actually see
WINSORIZING
BIBLIOGRAPHY
 Bibliography
 Brillinger, David R. “John W. Tukey: His Life and Professional
Contributions.” The Annals of Statistics. 30(2002): 1535-75.
 Dixon, W. J. “Simplified Estimation from Censored Normal Samples.”
The Annals of Mathematical Statistics. 31(1960): 385-91.
 Kafadar, Karen. “John Tukey and Robustness.” Proceedings of the
Annual Meeting of the American Statistical Association. 2001.
 Kruskal, William, Thomas Ferguson, John W. Tukey, E. J. Gumbel, and
F. J. Anscombe. “Discussion of the Papers of Messrs, Anscombe and
Daniel.” Technometrics. 2(1960): 157-66.
 Tukey, John W. and Donald H. McLaughlin. “Less Vulnerable
Confidence and Significance Procedures for Location Based on a
Single Sample: Trimming/Winsorization 1. The Indian Journal of
Statistics. 25(1963): 331-52.
 Westfall, Peter H. and Kevin S. S. Henning. Understanding Advanced
Statistical Methods. Boca Raton, FL: CRC Publishing, 2013.