Preparing Data for Analysis

Transcript Preparing Data for Analysis

One-Way ANOVA
By
Walden University Statsupport Team
March 2011
One Way Analysis of Variance
(One Way ANOVA)
•
•
•
•
•
Introduction
Assumptions
Parametric ANOVA
Post hoc Comparison
Non-parametric ANOVA
Introduction




An important technique for analyzing the effect of
categorical factors on a response is to perform an
Analysis of Variance (ANOVA).
An ANOVA decomposes the variability in the response
variable amongst the different factors.
The most commonly used form of ANOVA is the OneWay ANOVA. It is used when there is only a single
categorical factor.
The typical questions asked in One-Way ANOVA are
(a) Is there a significant difference between the
groups?, and (b) If so, which groups are significantly
different from which others?
ANOVA Assumptions



ANOVA requires distributional assumptions of:
• Independence
• Normality
• Equal variance
There is no formal statistical test for testing the
independence assumption. The independence
assumption refers to the way data are collected.
The independence assumption is strongly related to
the validity assumption for selection, and the
normality assumption is robust, especially when
samples are large. We will address assessment of
the equal variance assumption only.
Assessing Equal Variance



We can explore the validity of the equal variance
assumption using graphical methods such as sideby-side boxplots or by using formal statistical tests
such as F-ratio test, Bartlett’s test, and Levene’s
test .
The F-ratio test and Bartlett’s test require the
populations being compared to be Normal.
Levene’s test is much less dependent on conditions
of Normality in the population and hence it is the
most practical test for heteroscedasticity (unequal
variability).
When ANOVA Assumptions are Violated

When the ANOVA assumptions are violated, it is
recommended to undertake the following:
1.
2.
3.
Avoiding hypothesis testing entirely and reply on
exploratory and descriptive methods.
Mathematically transform the data to meet distributional
assumptions. The logarithmic and power transformations
are usually used for this purpose.
Using a distribution-free (non-parametric ) Test. The
Kruskal-Wallis Test is the nonparametric analogue to
one-way ANOVA. It is usually described as ranktransformed data based and comparisons are done
based on median values instead of mean values.
Parametric ANOVA (One-Way)
ANOVA is used to test hypotheses about
differences between two or more means.
 The t-test based on the standard error of the
difference between two means can only be used
to test differences between two means.
 When there are more than two means, it is
possible to compare each mean with each other
mean using t-tests.
 However, conducting multiple t-tests can lead to
severe inflation of the Type I error rate thereby
leading to high experiment wise error rate
(EER).

Parametric ANOVA continued…

Many methods are used to keep the family-wise error
rates in check. These are the least square difference
(LSD), Duncan, Dunnett, Tukey’s honest square
difference (HSD), Bonferroni and Scheffe methods.
These methods are called post hoc comparison or tests.

A demonstration of the LSD method and Bonferroni’s
method will be given next.
Post hoc Comparison



When a significant F-value is obtained in an Analysis of
Variance, the next task is to determine which group mean is
significantly different from each other. Post hoc tests are
typically used to do that.
We will use a data on skin pigmentation. Data on skin
pigmentation level was collected on four families that are
from the same racial group. It is important to note here that
the dependent variable is skin pigmentation and the
independent variable is family (four groups). Also it is
important to note how data are formatted for ANOVA
analysis.
Please see the next screenshot from SPSS regarding data
formatting.
Post hoc Comparison Continued…
Post hoc Comparison Continued…

To execute the ANOVA analysis and subsequent post hoc:
• Analyze > Compare Means > One-Way ANOVA and then move skin
pigmentation into dependent list and family into factor list and then
click OK.
Then you get the following ANOVA table in the output
window.
ANOVA
Skin Pigmentation
Sum of Squares
df
Mean Square
Between Groups
478.950
3
159.650
Within Groups
197.600
16
12.350
Total
676.550
19
F
Sig.
12.927
.000
You observe that the F-test for “Between Groups” is statistically highly significant.
That means at least one family group is statistically significantly different from each
other in their mean skin pigmentation level. To identify as to which family, we need to
undertake the post hoc comparison .
To do so, go back to your data and do the following:
Analyze > Compare means > One-Way ANOVA and then click on post hoc test and
select the LSD and the Bonferonni methods. Then click continue and Click OK. You
get something like the following table.
Post Hoc Tests
Multiple Comparisons
Dependent Variable:Skin Pigmentation
95% Confidence Interval
(I) Family
LSD
1
2
3
4
Bonferroni
1
2
3
4
(J) Family
Mean Difference (I-J)
Std. Error
Sig.
Lower Bound
Upper Bound
2
-7.400*
2.223
.004
-12.11
-2.69
3
-7.800*
2.223
.003
-12.51
-3.09
4
-13.800*
2.223
.000
-18.51
-9.09
1
7.400*
2.223
.004
2.69
12.11
3
-.400
2.223
.859
-5.11
4.31
4
-6.400*
2.223
.011
-11.11
-1.69
1
7.800*
2.223
.003
3.09
12.51
2
.400
2.223
.859
-4.31
5.11
4
-6.000*
2.223
.016
-10.71
-1.29
1
13.800*
2.223
.000
9.09
18.51
2
6.400*
2.223
.011
1.69
11.11
3
6.000*
2.223
.016
1.29
10.71
2
-7.400*
2.223
.025
-14.09
-.71
3
-7.800*
2.223
.017
-14.49
-1.11
4
-13.800*
2.223
.000
-20.49
-7.11
1
7.400*
2.223
.025
.71
14.09
3
-.400
2.223
1.000
-7.09
6.29
4
-6.400
2.223
.065
-13.09
.29
1
7.800*
2.223
.017
1.11
14.49
2
.400
2.223
1.000
-6.29
7.09
4
-6.000
2.223
.095
-12.69
.69
1
13.800*
2.223
.000
7.11
20.49
2
6.400
2.223
.065
-.29
13.09
3
6.000
2.223
.095
-.69
12.69
*. The mean difference is significant at the 0.05 level.
Remarks about parametric One Way ANOVA
It is strongly suggested that exploratory data analysis and checks for the
ANOVA assumption are made before doing the data analysis.
 A quick way of checking the normality assumption is using a side-by-side
boxplots. To obtain side-by-side boxplots in SPSS do the following:
Analyze > Descriptive Statistics > Explore and then move skin
pigmentation to dependent list and Family to Factor List. Click OK.
This will give you a boxplot of the median pigmentation level by family
group.
Boxplot of Skin pigmentation Levels by Family Group
To check the equal variance assumption, in SPSS we need to do the following:
Analyze > Compare means > One-way ANOVA > Options>
homogeneity of Variance > continue and then click OK.
Test of Homogeneity of Variances
Skin Pigmentation
Levene Statistic
df1
1.494
df2
3
Sig.
16
.254
The Levene test statistics for homogeneity of variance indicates that the p-value
is greater than 0.05 and hence the four family groups are not significantly different
in their variability in skin pigmentation levels.
Since data collection from the four family groups can be done independently, we can
assume that the independence assumption is also fulfilled. This wants us to go a
head and do the routine analysis of variance task, and contingent upon observance of
statistical significance , also undertaking the post hoc comparison.
As we will see in the next scenario, it is not unusual to encounter situations in which
the ANOVA distributional assumptions are violated. In those cases, we need to resort
to using the non-parametric ANOVA.
Non-Parametric ANOVA
 Non-parametric tests make no assumptions about the distribution of the data.
Non-parametric tests, uses the ranks of the data rather than their raw values to
calculate statistic.
The Kruskal Wallis test is the non-parametric version of ANOVA . Kruskal-Wallis
compares between the medians of two or more samples to determine if the
samples have come from different populations.
The Kruskal Wallis test can be viewed as ANOVA based on rank-transformed data
since the initial data are transformed to their ranks before submitted to ANOVA.
 Data on air quality from two sites will be used to illustrate this method. The
formatted data in SPSS is shown on the screenshot on the next slide.
In this dataset the equal variance assumption is violated. To check this assumption
we do the following in SPSS:
Analyze > Compare means > One-way ANOVA > Options> homogeneity of Variance
> continue and then click OK.
Test of Homogeneity of Variances
Particulate matter (mcg/m^3)
Levene Statistic
4.383
df1
df2
1
Sig.
14
.055
The test of homogeneity of variances indicates that the two sites are not same in the
amount of variability in the measured level of particulate matter. This means we cannot
use parametric ANOVA to compare the average level of suspended particulate matter at
the two sites. To compare the median level of suspended particulate matter at the two
sites, we run the Kruskal-Wallis test in SPSS as follows:
Analyze > Non-Parametric Tests > Independent Samples and then click on Fields and
then move particulate matter to Test Fields and site to Groups. Finally Click Run.
make sure that “Compare Medians across groups” is selected from the options indicated
under objective.
The p-value for the Kruskal-Wallis test is 0.527 which is greater than 0.05. That
means the two sites are not statistically significantly different in the distribution of
suspended particulate matter at the two sites.
Final Remarks
1. Analysis of Variance is one of the most commonly used statistical methods in
comparing means among several groups.
2. It is very crucial to check the distributional assumptions before using ANOVA
3. If the distributional assumptions are violated, we either transform the data so as to
achieve normality or use non-parametric tests.

Preparing Data for Analysis

Transcript Preparing Data for Analysis

Directory