Transcript WINsorizing
WINSORIZING
Kyle Allen
&
Matthew Whitledge
May 7, 2013
What is it and
why could it be
inappropriate?
WHAT IS WINSORIZING?
What it isn’t…
Trimming
Truncating
Any other method that completely removes observations from the
data
Term first used in 1960
John W. Tukey; W. J. Dixon
“Numerical value of a wild observation is untrustworthy”
However, its direction of deviation is important
Decreasing the magnitude of the deviation, retaining its
direction
WINSORIZING
AN EXAMPLE
Order the observations by value
X i1 , X i2 , …X i100 , where i denotes the i th regressor
If Winsorizing at 1% and 99%, then
The value for X i1 will be replaced by the value for X i2
The value for X i100 will be replaced by the value for X i99
Another example:
X i1 , X i2 , …X i100
Winsorize at 10% (5% from bottom and 5% from the top)
Beginning Sample:
Xi1, Xi2, Xi3, Xi4, Xi5, Xi6,… Xi95, Xi96, Xi97, Xi98, Xi99, Xi100
Winsorized Sample
Xi5, Xi5, Xi5, Xi5, Xi5, Xi6,… Xi95, Xi96, Xi96, Xi96, Xi96, Xi96
Winsorized at 5% and 95%
Obs. Original
Winsorized
Xi1
0.2
6.3
Xi2
0.9
6.3
Xi3
3.5
6.3
Xi4
4.8
6.3
Xi5
6.3
6.3
Xi6
7
7
Xi7
7.1
7.1
Xi8
7.2
7.2
Xi9-Xi92
…
…
Xi93
82
82
Xi94
83.2
83.2
Xi95
83.5
83.5
Xi96
98
98
Xi97
112
98
Xi98
114
98
Xi99
3150
98
Xi100
6572
98
WINSORIZING
ALTERNATIVES
Are the observations really outliers?
Look at Cook’s D measure
Transform the variables
Take the log or square root of the variable
This shouldn’t be done only to increase significance
Median based estimations
Quantile regression
Median absolute deviation
Nonparametric methods
WINSORIZING
A SAS EXAMPLE
Lift Index Data
Workers perform lifting tasks
Each lift has an amount of stress associated with it
Measuring the number of days an employee missed based on
the lift they were performing
206 observations
WINSORIZING
SAS CODE
proc sgplot data=isqsdata.lilesmerge;
scatter y=dayslost
x=alr;
scatter y=dayslost1 x=alr;
run;
data isqsdata.lileswin; set isqsdata.lileswin;
if subject = 6 then dayslost = 27;
if subject = 35 then dayslost = 27;
run;
proc qlim
data=isqsdata.liles;
model dayslost = alr;
endogenous dayslost ~ censored(lb=0);
run;
proc qlim
data=isqsdata.lileswin;
model dayslost1 = alr;
endogenous dayslost1 ~ censored(lb=0);
run;
WINSORIZING
LOOK AT YOUR DATA
PROC GLIM (NON-WINSORIZED)
PROC GLIM (WINSORIZED)
WINSORIZING
IMPLICATIONS
May impact significance
The standard errors will decrease
Depending on how symmetrical the data is, the mean may increase
or decrease
For example, if there is an extremely positive outlier, it will decrease the
mean
The significance will be determined by the proportionate
change in the estimated coef ficient, relative to the change in
the standard error
WINSORIZING
WHY COULD IT BE INAPPROPRIATE?
May be appropriate for
Ratios
Book to Market
Other measures in which the denominator can be extremely small
Never winsorize valid observations
Investment Returns
R&D expenditures
Truly exceptional observations
Large number of biological elements
Extremely low stress tolerances for mechanical implements
Model should produce data we could actually see
WINSORIZING
BIBLIOGRAPHY
Bibliography
Brillinger, David R. “John W. Tukey: His Life and Professional
Contributions.” The Annals of Statistics. 30(2002): 1535-75.
Dixon, W. J. “Simplified Estimation from Censored Normal Samples.”
The Annals of Mathematical Statistics. 31(1960): 385-91.
Kafadar, Karen. “John Tukey and Robustness.” Proceedings of the
Annual Meeting of the American Statistical Association. 2001.
Kruskal, William, Thomas Ferguson, John W. Tukey, E. J. Gumbel, and
F. J. Anscombe. “Discussion of the Papers of Messrs, Anscombe and
Daniel.” Technometrics. 2(1960): 157-66.
Tukey, John W. and Donald H. McLaughlin. “Less Vulnerable
Confidence and Significance Procedures for Location Based on a
Single Sample: Trimming/Winsorization 1. The Indian Journal of
Statistics. 25(1963): 331-52.
Westfall, Peter H. and Kevin S. S. Henning. Understanding Advanced
Statistical Methods. Boca Raton, FL: CRC Publishing, 2013.