No Slide Title

Download Report

Transcript No Slide Title

Data Transformation
• Objectives:
–
–
–
–
Understand why we often need to transform our data
The three commonly used data transformation techniques
Additive effects and multiplicative effects
Application of data transformation in ANOVA and
regression.
Xuhua Xia
Why Data Transformation?
• The assumptions of most parametric methods:
–
–
–
–
Homoscedasticity
Normality
Additivity
Linearity
• Data transformation is used to make your data
conform to the assumptions of the statistical methods
• Illustrative examples
Xuhua Xia
Homoscedasticity and Normality
The data deviates from both
homoscedasticity and normality.
Xuhua Xia
Homoscedasticity and Normality
Won’t it be nice if we would
make data look this way?
Xuhua Xia
Types of Data Transformation
•
•
•
•
The logarithmic transformation
The square-root transformation
The arcsine transformation.
Data transformation can be done conveniently in
EXCEL.
• Alternatives: Ranks and nonparametric methods.
Xuhua Xia
Homoscedasticity
ID
1
2
3
4
5
6
7
8
9
n
Mean
Var
t
df
P
Equal Var.?
Kurtosis
Skewness
P(Zg1)
P(Zg2)
Xuhua Xia
Group 1
Group 2
2.72
20.09
7.39
54.60
7.39
54.60
20.09
148.41
20.09
148.41
20.09
148.41
54.60
403.43
54.60
403.43
148.41
1096.63
9
9
37.26
275.33
2102.35 114784.50
-2.09
16
0.0530
P=
0.0000
4.89
4.89
2.12
2.12
0.0046
0.0046
0.0144
0.0144
• The two groups of data seem to differ
greatly in means, but a t-test shows that
the means do not differ significantly from
each other - a surprising result.
• The two groups of data differ greatly in
variance, and both deviate significantly
from normality. These results invalidate
the t-test.
• We calculate two ratios: var/mean ratio
and Std/mean ratio (i.e., coefficient of
variation).
•
Var/mean
C.V.
Group1 Group2
56.420 416.891
1.230
1.230
• Log-transformation
Log-Transformed Data
ID Group
Group
Group
1 1 Group
2 2
1.31
11
2.72
20.09
1.31
3.05
2.13
22
7.39
54.60
2.13
4.02
33
7.39
54.60
2.13
4.02
44
20.09
148.41
3.05
5.01
55
20.09
148.41
3.05
5.01
66
20.09
148.41
3.05
5.01
77
54.60
403.43
4.02
6.00
88
54.60
403.43
4.02
6.00
9
148.41
1096.63
9
5.01
7.00
n
9
9
Mean
3.08
5.01
Var
1.30
1.47
t
-3.48
df
16
P
0.0031
Equal Var.?
P=
0.8687
Kurtosis
-0.35
-0.30
Skewness
0.16
0.02
P(Zg1)
0.82
0.97
P(Zg2)
0.659393 0.684475
ID
Xuhua Xia
NewX = ln(X+1)
• The transformation is successful
because:
– The variance is now similar
– Deviation from normality is now
nonsignificant
– The t-test revealed a highly significant
difference in means between the two
groups
Log-Transformed Data
ID
1
2
3
4
5
6
7
8
9
n
Mean
Var
t
df
P
Equal Var.?
Kurtosis
Skewness
P(Zg1)
P(Zg2)
Xuhua Xia
Group 1 Group 2
1.31
3.05
2.13
4.02
2.13
4.02
3.05
5.01
3.05
5.01
3.05
5.01
4.02
6.00
4.02
6.00
5.01
7.00
9
9
3.08
5.01
1.30
1.47
-3.48
16
0.0031
P=
0.8687
-0.35
-0.30
0.16
0.02
0.82
0.97
0.659393 0.684475
NewX = ln(X+1)
Transform back:
X  e NewX  1
Compare this mean with the original
mean. Which one is more preferable?
Calculate the standard error, the
degree of freedom, and 95% CL
(t0.025,16 = 2.47).
Normal but Heteroscedastic
Any transformation that you use is likely to
change normality. Fortunately, t-test and ANOVA
are quite robust for this kind of data. Of course,
you can also use nonparametric tests.
Xuhua Xia
Normal but Heteroscedastic
ID
1
2
3
4
5
6
7
8
9
n
Mean
Var
t
df
P
Equal Var.?
Kurtosis
Skewness
Xuhua Xia
Group 1
11
12
12
13
13
13
14
14
15
Group 2
1
4
4
8
8
8
12
12
15
9
13
1.5
3.216338
16
0.0054
P=
-0.28571
0
9
8
20.25
0.0013
-0.76582
0
The t-test, however,
detects significant
difference in means. You
can use nonparametric
methods to analyse data
for comparison, and you
are like to find t-test to be
more powerful.
The two variances are
significantly different.
Additivity
Factor A
Factor B Level 1
Level 2
1.313
2.450
2.127
3.240
2.127
3.343
3.049
3.976
Level 1
3.049
3.578
3.049
4.507
4.018
4.666
4.018
5.324
5.007
6.060
Mean
3.084
4.127
2.030
2.927
2.805
3.599
2.751
3.837
3.968
4.933
Level 2
3.766
4.576
3.589
4.781
4.570
5.719
4.562
5.983
5.956
6.868
Mean
3.778
4.803
Xuhua Xia
• What experimental design is
this?
• Compare the group means.
Is there an interaction effect?
Additivity means that the difference
between levels of one factor is
consistent for different levels of
another factor.
Multiplicative Effects
Factor B
Level 1
Mean
Level 2
Mean
Xuhua Xia
Factor A
Level 1
Level 2
2.718
10.589
7.389
24.530
7.389
27.316
20.086
52.304
20.086
34.795
20.086
89.660
54.598 105.306
54.598 204.262
148.413 427.365
37.262 108.458
6.616
17.678
15.524
35.570
14.665
45.408
51.884 137.739
42.222
96.110
35.185 118.225
95.498 303.596
94.819 395.790
385.215 960.457
82.403 234.508
• Compare the group means. Is
there an interaction effect?
• Does this data set meet the
assumption of additivity?
• When the assumption of
additivity is not met, we have
difficulty in interpreting main
effects.
• Now calculate the ratio of group
means. What did you find?
Multiplicative Effects
Factor B
Level 1
Mean
Level 2
Mean
Xuhua Xia
Factor A
Level 1
Level 2
2.718
10.589
7.389
24.530
7.389
27.316
20.086
52.304
20.086
34.795
20.086
89.660
54.598 105.306
54.598 204.262
148.413 427.365
37.262 108.458
6.616
17.678
15.524
35.570
14.665
45.408
51.884 137.739
42.222
96.110
35.185 118.225
95.498 303.596
94.819 395.790
385.215 960.457
82.403 234.508
For Factor A, we see that Level 2 has a
mean about 2.88 times as large as that for
Level 1. For factor B, Level 2 has a mean
about 2.18 times as large as that for Level
1).
If you know the value for Level 1 of
Factor A, you can obtain the value for
Level 2 of Factor A by multiplying the
known value by 2.88. Similarly, you can
do the same for Factor B.
We say that the effect of Factors A and B
are multiplicative, not additive.
Log-transformation
Factor A
Factor
Level 2
Factor B
B Level
Level 11
1.31
2.718
10.589
1.313
2.450
2.13
7.389
24.530
2.127
3.240
7.389
27.316
2.127
3.343
3.049
3.976
20.086
52.304
Level 11
Level
3.049
3.578
20.086
34.795
3.049
4.507
20.086
89.660
4.018 105.306
4.666
54.598
4.018 204.262
5.324
54.598
5.007 427.365
6.060
148.413
Mean
3.084 108.458
4.127
Mean
37.262
2.030
2.927
6.616
17.678
2.805
3.599
15.524
35.570
2.751
3.837
14.665
45.408
3.968 137.739
4.933
51.884
Level 22
3.766
4.576
Level
42.222
96.110
3.589 118.225
4.781
35.185
4.570 303.596
5.719
95.498
4.562 395.790
5.983
94.819
5.956 960.457
6.868
385.215
Mean
3.778 234.508
4.803
Mean
82.403
Xuhua Xia
Now log-transform the data. Compare
the means. Is the assumption of
additivity met now?
Original Data
37.262
2102.351
108.458
17878.648
82.403
12400.091
234.508
80241.944
3.084
1.302
4.127
1.268
3.778
1.235
4.803
1.385
Transformed data
Mean
Variance
Why log-transformation can
change the multiplicative
effects to additive effects?
Z  XY
ln( Z )  ln( X )  ln(Y )
Xuhua Xia
Square-Root Transformation
ID
1
2
3
4
5
6
7
8
9
Mean
Var
Var/Mean
Std/Mean
Xuhua Xia
Group 1
1
4
4
9
9
9
16
16
25
10.333
56.500
5.468
0.727
Group 2
9
16
16
25
25
25
36
36
49
26.333
152.500
5.791
0.469
• The two groups of data differ much
in variance.
• Calculate two ratios: var/mean
ratio and Std/mean ratio (i.e.,
coefficient of variation).
• Does your calculation suggest logtransformation? When is logtransformation appropriate?
• Use square-root transformation
when different groups have similar
Variance/Mean ratios
Notice the means, which do not
coincide with the most frequent
observations
Square-Root Transformation
ID
1
2
3
4
5
6
7
8
9
Mean
Var
Group 1
1.171
1.17
2.094
2.09
2.094
3.069
3.069
3.069
16
4.05
16
4.05
25
5.04
10.333
3.07
56.500
1.412
Group 2
3.069
16
4.05
16
4.05
25
5.04
25
5.04
25
5.04
36
6.03
36
6.03
49
7.03
26.333
5.04
152.500
1.475
Square-root
transformation:
X ' X  3/ 8
The variance is now
almost identical between
the two groups
Transform the means
back to the original scale
and compare these means
with the original means:
X  ( X ' )2 
Xuhua Xia
3
8
Quiz on Data Transformation
n
Mean
Var
SE
T
LowerL
UpperL
Xuhua Xia
1
2
0
2
3
0
5
1.4
1.8
0.600
2.776
-0.266
3.066
Group
2
3
6
9
4
5
8
6
2
5
4
11
5
5
4.8
7.2
5.2
7.2
1.020 1.200
2.776 2.776
1.969 3.868
7.631 10.532
The data set is right4 skewed for each
2 group.
4
1
0
2
5
1.8
2.2
0.663
2.776
-0.042
3.642
Calculate the
variance/mean ratio
and C.V. for each
group, and decide
what transformation
you should use.
Do the transformation
and convert the means
back to the original
scale.
With Multiple Groups
Variance
8
6
4
2
0
0
2
4
Mean
Xuhua Xia
6
8
When you have multiple
groups, a “Variance vs Mean”
or a “Std vs Mean” plot can
help you to decide which data
transformation to use. The
graph on the left shows that
the Var/Mean ratio is almost
constant. What
transformation should you
use?
12
10
8
6
Mean, Lower, Upper
Mean, Lower, Upper
Confidence Limits
4
2
0
-2
0
2
4
6
Mean
Before transformation
8
12
10
8
6
4
2
0
-2
0
2
4
6
8
Mean
After transformation
With the skewness in our data, do confidence
limits on the right make more sense? Why?
Xuhua Xia
Arcsine Transformation
Group1 Group2 Group1 Group2
84.20 92.30 66.579 73.890
88.90 95.10 70.539 77.211
89.20 90.30 70.814 71.854
83.40 88.60 65.957 70.267
80.10 92.60 63.507 74.215
81.30 96.00 64.378 78.463
85.80 93.70 67.863 75.463
Mean
84.70 92.66 67.091 74.480
Var
12.29
6.73
8.017
8.226
SE
1.325 0.980
1.070
1.084
LowerL
81.457 90.258 64.472 71.828
UpperL
87.943 95.056 69.709 77.133
Transform back
NewMean
84.847 92.841
LowerL
81.428 90.273
UpperL
87.974 95.041
Xuhua Xia
X '  arcsin( X )
• Used for proportions
• Compare the
variances before and
after transformation
• Do you know how to
transform the means
and C.L. back to the
original scale?
X  (sin X ' )
2
Data Transformation Using SAS
Data Mydata;
input x;
newx=log(x);
newx=sqrt(x+3/8);
newx=arsin(sqrt(x));
cards;
Xuhua Xia
Natural logarithm
transfromation
Square-root
transformation
Arcsine
transformation