Transforming Data

Transcript Transforming Data

Psych 5500/6500
Data Transformations
Fall, 2008
1
Data Transformations
We are now going to examine an option we
have if we find that our data appear to violate
the assumption that the populations are
normally distributed or the assumption that
the populations have the same variance.
This option is to transform our data to better
fit the assumptions.
First we will look at some options on how to do
this, and then we will turn to general issues
and concerns about transforming data.
2
You Might Not Need to Worry
About the Assumptions
Remember that violation of the assumption
of normality grows less serious as the N of
our samples increase; and that violation of
the assumption of homogeneity of
variances is not important when the N’s of
our groups are roughly equal. Thus you
may not need to turn to transformations if
the N’s of your groups are largish and
approximately the same size.
3
Transformations to Take Care of
Anticipated Problems
Certain types of measurements routinely produce
samples that violate one or both of the assumptions
of normality and equal variances. The solutions are
fairly well established and it is always better to
anticipate a priori what transformation might be
appropriate. Post hoc decisions to transform the
data face two criticisms; 1) are you changing you
data just to get the results you want?, and 2) is your
population actually ok and you are changing your
data (and thus the population it represents) to fix a
problem that appeared in your sample just due to
chance (i.e. the populations were actually ok)? 4
Reaction Times
Reaction times are often positively skewed.
Transformations that are recommended for
reaction times (or any positively skewed
data) are:
1) Yi, transform  log10 Yi
1
2 a) Yi, transform 
or if Y might equal 0 on somescores then
Yi
b) Yi, transform
1

Yi  1
Try both 1 and 2 and see which works best.
5
Counts
If the variable involves counting something
(e.g. occurrences of some behavior) then
there could be a problem, particularly if low
counts (around zero) occur. The floor effect
of not being able to have a score below zero
will effect the normality of the data, and if
one of the groups has more of a floor effect
than the other then the groups will have
different variances. A recommended
transformation in this case is:
Yi, transform  Yi
6
Proportions
Proportions as measures suffer from several
problems, two of them are:
1. Proportions around .5 have greater variance than
proportions close to 0 or 1 (due to floor and ceiling
effects). Thus the variance of two groups will
differ if one has proportions closer to .5 than the
other.
2. Many people consider a difference in the mean
proportion of two groups of .02 and .08 (a
difference of .06 but also a quadrupling of the
proportion) to be greater than a difference of .48
and .54 (also a difference of .06 but a much
7
smaller change in terms of ratios).
Proportions (cont.)
Stretching out both tails of the distribution can help
with both of those problems. Transforms that
can accomplish this are:
arcsine transform:
Yi, transform  sin
1
Yi
logit transform
Yi
Yi, transform  log
1  Yi
Try them both and see which works best.
8
Power Transformations
The general formula of raising Y to some
power (‘pow’), as in, Ytransform=Y pow can be
used to reshape the distribution in a variety
of ways. We have already seen one use of
this, the square root of Y, which is the same
as Y0.5 (in SPSS that would be Y**.5).
Here we are venturing into the territory of post
hoc transformations, where you try out
various transformations until the distribution
is the shape that you want.
9
Power Transformation Strategies
Try various values for power, you might try the
following to see which works best:
pow = 3, 2, 0.5, ‘0’, -0.5, -1, -2.
Obviously, Y1 isn’t on the list as Y1 = Y
The ‘0’ is in quote marks because raising Y to
the power of 0 will result in changing all of
your scores to ‘1’. In place of using pow=0,
10
substitute Ytransform= log10 Y.
Power Transformation Strategies
Instead of pure trial and error, you can use the
general strategy of trying a value of pow>1 if
you want to bring in a long negative tail, and
pow<1 if you want to bring in a long positive
tail. The further from ‘1’ in either direction
the more the tail will be pulled in.
11
Power Transformation Strategies
Some statistical programs will run through
possible values for the power of Y and then
report which one best turns the data into a
normal curve (SPSS apparently doesn’t
offer that).
It is important to note that whether you figure
out a good value for p or the computer does,
such a transformation would be purely post
hoc.
12
Rank Transformation
If you cannot transform you data in such a way that it
becomes more normally distributed then an
interesting option of last resort is to transform your
data into rank scores. Each score is transformed
into a rank score which indicates where it falls in a
list of all the scores in the study for that variable. If
observations are tied, then each observation
receives the mean of the ranks they would have
received if they weren’t tied.
13
• For example, the (cardinal) data
– Group 1: 3.4, 7.2, 5.1, 6.9
– Group 2: 7.2, 5.1, 7.2, 8.4
• Would first be put into one list and ordered:
– Y= 3.4, 5.1, 5.1, 6.9, 7.2, 7.2, 7.2, 8.4
• Scores of 5.1 come in 2nd and 3rd on the list, so
they each get a rank of (2+3)/2=2.5. Scores of
7.2 come in 5th, 6th, and 7th, so they each get a
rank of (5+6+7)/3=6.
• Ytransform=1, 2.5, 2.5, 4, 6, 6, 6, 8
• Transformed (to ranked) data:
– Group 1: 1, 6, 2.5, 4
– Group 2: 6, 2.5, 6, 8
14
Upside of rank transformations:
1. Control the effect of outliers (because
distance to extreme score is reduced).
2. While they don’t create normal data they
generally reduce problems of thick tails.
3. While they don’t ensure homogeneity of
variance they generally prevent very large
differences in variance.
Downside of rank scores: you are losing
information when you move from cardinal to
rank scores.
15
Parametric and Nonparametric Tests
Statistical procedures that are based upon certain
assumptions about the population being true are
called parametric tests. The t tests--in fact
every test we will look at this semester--are
parametric tests.
Statistical procedures that are not based upon
certain assumptions about the population being
true are called nonparametric tests. These
tests are useful in that they can be applied to
situations where assumptions certainly are not
met, but nonparametric tests tend to have low
power.
16
Parametric Tests of Rank Data
Most nonparametric tests require rank data. To apply a
parametric test to rank data is an interesting halfway step between parametric and non-parametric
tests. By ranking the data you are not meeting the
assumptions of the parametric tests but you are
violating the assumptions in a way that tends to be
ok (e.g. creating thin tails rather than fat tails).
While it is still better not to have to transform cardinal
data to rank data (because you lose so much
information and probably lose power) if you do it
may be better to use a parametric test than a nonparametric on the data.
17
Advantages
The advantages of using a parametric test on rank
data vs. using a nonparametric test are:
1. You don’t have to learn another whole class of
statistical tests (nonparametric tests).
2. In some cases the parameter test of rank data
may be just as good or even better than the
nonparametric test (see Judd & McClelland,
1989, and Ruxton, 2006).
Judd, C. M. & McClelland, G. H. (1989). Data analysis: A modelcomparison approach. New York: Harcourt College Publishers.
Ruxton, G. D. (2006). The unequal variance t-test is an underused
alternative to Student's t-test and the Mann–Whitney U test.
Behavioral Ecology, 17, 688 - 690.
18
The Mann–Whitney U test is the nonparametric equivalent to a t test. The
following table compares type 1 error rate of the U test to that of Welch’s
t’ in a Monte Carlo study (Zimmerman and Zumbo, 1993, as adapted
and cited in Ruxton, 2006) when both tests were applied to rank data.
N1
N2
σ 1 / σ2
U
t’
6
18
1
.052
.049
6
18
2
.085
.051
6
18
4
.104
.054
18
6
1
.049
.052
18
6
2
.030
.056
18
6
4
.023
.064
19
Issues and Concerns Regarding
Transformations
1) You are moving the data further away
from reflecting reality in an attempt to
better make if fit your tool.
2) If your original measure is more relevant
to your theory than the transform, then
can you generalize the analysis of the
transformed data to the theory?
20
Issues and Concerns (cont.)
3) Transforming your data to fit the tool may
or may not be justified, but transforming
your data to make the results fit your
theory is definitely not justified. Comment:
this is why transforms that are selected a
priori due to known characteristics of the
measure have an advantage over post
hoc transforms.
21
Issues and Concerns (cont.)
4) Post hoc transforms might end up fixing
problems that are a fluke, that only appear
in this particular sample. This shows one
of the advantages of replicating studies. If
the data in the first study suggests a
particular transform would be useful, you
might want to use it, but then withhold final
judgment on its appropriateness until a
second study confirms its utility.
22