Standardization of variables

Download Report

Transcript Standardization of variables

Standardization of variables
Maarten Buis
5-12-2005
1
Recap
• Central tendency
• Dispersion
• SPSS
2
Standardization
• Is used to improve interpretability of
variables.
• Some variables have a natural interpretable
metric: e.g. income, age, gender, country.
• Others, primarily ordinal variables, do not:
e.g. education, attitude items, intelligence.
• Standardizing these variables makes them
more interpretable.
3
Standardization
• Transforming the variable to a comparable metric
–
–
–
–
known unit
known mean
known standard deviation
known range
• Three ways of standardizing:
– P-standardization (percentile scores)
– Z-standardization (z-scores)
– D-standardization (dichotomize a variable)
4
When you should always
standardize
• When averaging multiple variables, e.g.
when creating a socioeconomic status
variable out of income and education.
• When comparing the effects of variables
with unequal units, e.g. does age or
education have a larger effect on income?
5
P-Standardization
• Every observation is assigned a number
between 0 and 100, indicating the
percentage of observation beneath it.
• Can be read from the cumulative
distribution
• In case of knots: assign midpoints
• The median, quartiles, quintiles, and deciles
are special cases of P-scores.
6
rent
room
room
room
room
room
room
room
room
room
room
room
room
room
room
room
room
room
room
room
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
175
180
185
190
200
210
210
210
230
240
240
250
250
280
300
300
310
325
620
cum %
percentile
5,3%
5,3%
10,5%
10,5%
15,8%
15,8%
21,1%
21,1%
26,3%
26,3%
31,6%
36,8%
36,8%
36,8%
42,1%
36,8%
47,4%
47,4%
52,6%
55,3%
57,9%
55,3%
63,2%
65,8%
68,4%
65,8%
73,7%
73,7%
78,9%
81,6%
84,2%
81,6%
89,5%
89,5%
94,7%
94,7%
100,0%
100,0%
7
P-standardization
• Turns the variable into a ranking, i.e. it turns the
variable into a ordinal variable.
• It is a non-linear transformation: relative
distances change
• Results in a fixed mean, range, and standard
deviation; M=50, SD=28.6, This can change
slightly due to knots
• A histogram of a P-standardized variable
approximates a uniform distribution
8
Linear transformation
• Say you want income in thousands of
guilders instead of guilders.
• You divide INCMID by f1000,M
SD
ƒ2543,-
ƒ1481,-
Incmid/1000 kƒ2,543
kƒ1,481
Incmid
9
Linear transformation
• Say you want to know the deviation from
the mean
• Subtract the mean (f2543,-) from INCMID
M
SD
Incmid
ƒ2543,-
ƒ1481,-
Incmid-M
ƒ0,-
ƒ1481,10
Recap: multiplication and
addition and the number line
11
Linear transformation
• Adding a constant (X’ = X+c)
– M(X’) = M(X)+c
– SD(X’) = SD(X)
• Multiply with a constant (X’ = X*c)
– M(X’) = M(X)*c
– SD(X’) = SD(X) * |c|
12
Z-standardization
• Z = (X-M)/SD
• two steps:
– center the variable (mean becomes zero)
– divide by the standard deviation (the unit becomes
standard deviation)
• Results in fixed mean and standard deviation:
M=0, SD=1
• Not in a fixed range!
• Z-standardization is a linear transformation:
relative distances remain intact.
13
Z-standardization
•
•
•
•
Step 1: subtract the mean
c = -M(X)
M(X’) = M(X)+c
M(X’) = M(X)-M(X)=0
• SD(X’)=SD(X)
14
Z-standardization
•
•
•
•
Step 2: divide by the standard deviation
c is 1/SD(X)
M(Z) = M(X’) * c
M(Z) = 0 * 1/SD(X) = 0
• SD(Z) = SD(X’) * c
• SD(Z) = SD(X) * 1/SD(X) = 1
15
Normal distribution
• Normal distribution = Gauss curve = Bell
curve
• Formula (McCall p. 120)
– Note the (x-m)2 part
– apart from that all you have to remember is that
the formula is complicated
• Normal distribution occurs when a large
number of small random events cause the
outcome: e.g. measurement error
16
Normal distribution
• Other examples the height of individuals,
intelligence, attitude
• But: the variables Education, Income and
age in Eenzaam98 are not normally
distributed
17
Z-scores and the normal
distribution
• Z-standardization will not result in a normally
distributed variable
• Standardization in NOT the same as normalization
• We will not discuss normalization (but it does
exist)
• But: If the original distribution is normally
distributed, than the z-standardized variable will
have a standard normal distribution.
18
Standard normal distribution
• Normal distribution with M=0 and SD=1.
• Table A in Appendix 2 of McCall
• Important numbers (to be remembered):
–
–
–
–
68% of the observations lie between ± 1 SD
90% of the observations lie between ± 1.64 SD
95% of the observations lie between ± 1.96 SD
99% of the observations lie between ± 2.58 SD
19
Why bother?
• If you know:
– That a variable is normally distributed
– the mean and standard deviation
• Than you know the percentage of
observations above or below and
observation
• These numbers are a good approximation,
even if the variable is not exactly normally
distributed
20
P & Z standardization
• Both give a distribution with fixed mean,
standard deviation, and unit
• P-standardization also gives a fixed range
• Both are relative to the sample: if you take
observations out, than you have to recompute the standardized variables
21
P & Z-standardization
• When interpreting Z-standardized variables
one uses percentiles
• With P-standardization one decreases the
scale of measurement to ordinal, BUT this
improves interpretability.
22
Student recap
23
Do before Wednesday
• Read McCall chapter 5
• Understand Appendix 2, table A
• make exercises 5.7-5.28
24