TTests, ANOVA and Rank Based Tests PPT

Transcript TTests, ANOVA and Rank Based Tests PPT

T tests, ANOVA and Rank Based
Tests Using SPSS
Presented
By
Benedicto Kazuzuru
Presentation outline
1. A very brief Introduction to SPSS/Optional
2. An Overview of t tests, ANOVA and Rank based tests
2.1 One sample t test
2.2 Two samples Independent test
2.3 Paired t test
2.4 One way ANOVA
2.5 Assumptions underlying T-Tests and ANOVA
3 How to run t tests, ANOVA and Rank based tests
3.1 One sample t test
3.2 Two samples independent t tests
3.3 Mann Whitney U Test
3.4 Paired samples t test
3.5 Wilcoxon Singed Rank Test
3.6 One way ANOVA
3.6.1 One way ANOVA with unequal variance
3.7 Kruskalli Wallis Test
3.8 One way ANOVA with repeated measurements
3.9 Friedman Test
4 Two way ANOVA/Optional
1. A very brief Introduction of SPSS
• How to start the software
• How to enter the variables and data
• How to import data in SPSS from spreadsheets
like Microsoft excel
• Examples of data( in the SPSS file “T test,
ANOVA and Rank Based Tests”)
2.An Overview of t tests, ANOVA and
Rank based tests
 We use t-test and ANOVA when comparing
populations means
 For example one could compare the following:
• Whether the mean weight of a particular
population is equal to a specified value
• whether female students performs better than
male students on a particular subject
• Whether fertilizers A, B,C and D leads to different
mean yield per hectare on maize
2.1One sample t test
 It can be shown that if
then
x ~ N (,  2 )
x ~ N ( ,  2 n)
•
Inferences about populations could be made using this theory. Unfortunately
rarely known in practice
 W.S.Gosset(1908)provided a relief by stating the following:
x
• If x ~ N (, 2 ) , then
~ t n 1
s/
•
•
•
•
•
2
n
Therefore the student t distribution could be used to make inferences on populations
means with unknown variances as long the populations are normally distributed.
The normal approximation of the student t distribution could be used ,but only for
large samples. This scenario makes the t distribution the only option in small samples
But even in large samples the problem is how large should the sample be?
In a one sample t –test a researcher is interested to see whether the mean population
of the given items is equal to a specified value say C.
This could be achieved by finding confidence interval given as
 x  x 
n
•
is
x  t
2
2
, n 1
2
s
s
   x  t
, n 1
n
n
2
Where
s2 

2
i 1
n 1
 C
•
•
Alternatively you could test the null hypothesis of whether
Using
t  ( x  C) s / n
•
Both the confidence interval and the hypothesis test utilizes the students t distribution
which demands normality of the parent population as a prerequisite
2.2Two samples Independent t test
 In a two samples independent t test a researcher would like to compare the means
of two different populations. For example performance between male and female
students in a particular subject
• As an extension of W.S.Gosset theory it can be shown that
x1  x2  (1  2 ) ~ t
•
where
(n1  1)s12  (n2  1)s22
n n 2
2
2
2
Sp Sp
sp 

n1  n2  2
n1 n2
2
• s p is called the pooled variance and the theory assumes that the two parent
populations have the same variances which could be estimated by s 2
p
1
2
 You could use this result to compare the means of two populations by finding a
confidence interval for μ1- μ2 or test the hypothesis μ1-μ2= С.
•
The SPSS uses the same result, but first test for the assumption of equal variance
and provide and provide results for both the two cases
2.3 Paired samples t-test
 In paired sample t-test, you have paired
observations over the same individuals. For
example:
• compare students’ performance in chemistry
versus physics
• HIV-AIDS patients’ CD4 counts before receiving
treatment and after receiving
 To achieve the test we remove the dependence
by considering successive difference among the
pairs and use the formula for ONE Sample t test.
2.4 ANOVA
 The word “ANOVA “is an acronym for Analysis of Variance
 In ANOVA the focus is to compare means of more than two populations
 Consider a mass of students’ scores from at least 3 different schools.
• One of the sources of variation of students' scores could be difference in
schools(SSB) and the other owing to students themselves/chance (SSE)
 We know whether the schools matter through an F test where F=SSB/k1/SSE/N-K
• This analysis is referred to as One way Analysis of variance.
• The F test requires normality of data in all the groups as well as equality
of variances across the groups
 Suppose we also consider Parents’ incomes as a factors then we would
refer to the analysis as two way analysis of variance.
2.5Assumptions underlying T-Tests and
ANOVA

1.
From the previous discussions:
All the tests (One sample ,two samples, paired samples and ANOVA
)require the variables to be normally distributed
2.
The two samples Independent T test and ANOVA require the variables to
have equal variances
The two samples Independent T test and ANOVA require the variables to
be independent across the samples
3.
4.



All the tests require the samples to be random observations from the
populations
Assumption 1 and 2 could be checked before and after estimation
Assumptions 3 & 4 could be guaranteed in the design stage.
Assumption 1 could imply much more issues such as ( no outliers,
interval scale measurements)
3.1 One sample t test
 Example 1.
An MA Rural student at Sokoine University of Agriculture (SUA) in
Tanzania did a study in Morogoro rural area in 2009 to uncover the
role of Tanzania Social Action Fund (TASAF) in women economic
empowerment. The study was a household based targeting
households where the woman is the head of the household. In
achieving this objective the student intended to compare women
annual income between those who were supported by TASAF
against those who were not supported by TASAF. At the same time
the student was wondering whether the rural women are really
poor based on their incomes and the World Bank definition of
poverty. It was noted in the study that an average family size per
household was five members. The Word bank regards person to be
poor if he /she lives under 1 USD per day.
QS: How to go about knowing whether those women are really poor?
3.1 One sample t test
 Need to test normality assumption. How?
 Go to Analyze-Descriptive-Explore-enter the variable “income” in
the dependent list-plots-plots-normality plots with tests-histogramcontinue-OK
 we can clearly see that the data is not normally distributed
 Therefore transformation is needed. How?
 Go to Transform -Compute-fill in target variable say “newinco”functions - Ln(Numexpr)-push the function to the top right screen
with title “numerical expression”-then go to the left bottom
window and select the variable “Income”-push it to the top right
screen with title “numerical expression”-then Click OK
 A new variable with a title “newinco” will appear as a variable in
SPSS data
 Repeat step two to confirm whether it is now normal
3.1 One sample T test
 Clearly now the variable is normally distributed
 Go to Analyze-Compare means-One sample t test-select the variable
“newinco” which must be at far bottom on the left screen-Push it to the
right screen-OK
 Go to test value in the smallest screen and type the value of your test. Notice
that in this case we are using natural log of income so our test value would
be natural log of (5*365) dollars=7.509
 We can now see that there is no significant difference between the mean
women natural log incomes and 7.509 based on both the p-value and the
Confidence Interval
 It could be worth noting that the SPSS only provide a two tailed test which
you could use for a one tailed test
 We can try with 2 dollar per day and see what happens. Natural log of
(5*2*365)=8.202
3.2 Two Samples Independent test
 From example1,how do we know that TASAF supported women
have higher incomes than Non TASAF
 We use Two samples Independent t-test
 Need to check the assumptions
• Normality of the observations
• Homogeneity of variance
 Assumption one already checked. Assumptions two will be checked
automatically and results provided for both cases( with equal
variances and Unequal variances)
 Go to Analyze-Compare means-Independent Sample T Test-select
the variable “newinco” which must be at the far bottom on the
left screen-Push it to the right screen-enter grouping variable in
the smallest screen –Define groups-Continue-OK
 We can clearly see that there is significant difference
3.3 Mann Whitney U Test
 In the just ended case we assumed that the data is
normally distributed and we had to transform the data to
achieve normality
 Sometimes transformation is very hard or impossible
 Some type of data such as counts are obvious not normally
distributed
 The alternative test is the Mann Whitney U Test
 This test is immune to all the stated assumptions except
indepence between the two samples
 It can be applied to both type of data(continuous and non
continuous)
 Let us try this test with the original income data. How?
3.3 Mann Whitney U Test
 For an old SPSS version do the following
• Go Analyze-Non-paramteric-2 independent samples-enter the
variable “income "in the right screen with a title “ Test variable list”enter the grouping variable in the smallest screen-Define groupsContinue-options-descriptive-quartiles-continues-OK.
 For the Latest version of SPSS
• Go to Analyze-Non Parametric-Independent-Samples-ObjectiveAutomatically-Field-enter the variable “income” –enter the
grouping variable-Run
 Again we see that there is a significance difference
 Even though we have used example one , the most typical scenario
to apply the test is when the data is not measured in interval scale.
Try this with the data on “ Package and non package tourist” as
exercise 1. The data compares length of stay(days) between
tourists who are on a package tour versus tourists who are not on
package tour (Exercise 1)
 What are the results?
3.4 Paired t test
 An NGO in Tanzania known as TUNAJALI is operating a clinic
to boost the HIV-patients’ health by providing them with
among other things drugs and nutritional supplements to
improve their CD4 counts. A postgraduate student at SUA
was wondering whether by so doing the NGO was also
improving the patients economic well being. To that effect
she took random samples of 30 HIV_AIDS patients who are
peasants in rural area of Morogoro region where the clinic
also operates and observed their incomes in Tshs before
joining the clinic and two years after Joining the clinic for
comparison purpose. The data is provided in the SPSS file.
• QS How would we get to know whether the patient’s
incomes differ in the two periods?
3.4 Paired t test
 Go to Analyze-Descriptive-Explore-enter the two variables
“bclinic and aclinic” in the dependent list-plots-plotsnormality plots with tests-histogram-continue-OK
• We can clearly see that the data is normally distributed
• Now we can apply the paired t test. How?
 Go to Analyze-Compare means-Paired Samples T Test-select
the variables “bclinic and aclinic” simultaneously and push
them on the top right screen-OK
•
We can clearly see that there is significant difference based
on either “confidence Interval, or p-value”
3.5 Wilcoxon Signed Rank Test
 In paired t test we assumed the data is normally distributed
 As said before this assumption could hardly be attained in
most real data and transformation may not be feasible
 The alternative test is the “ Wilcoxon Signed Rank Test”
• Try the test with the clinic data. How?
 For the old versions of SPSS
• Go to Analyze-Non Parametric Tests-2related samples-enter
the two variables simultaneously in the right screen with a
title ‘’Test pairs list”-then click “Wilcoxon in one of the
smallest screens below”-Options-Descriptive-Quartiles-OK
 For the latest version of SPSS
• Go to Analyze-Non Parametric Test-Dependent samplesobjective-Automatically-Field-enter the two variables-Run
 We can Cleary see that there is a difference
3.5 Wilcoxon Signed Rank Test
 Even though we have applied the test in the
given example , the most typical situation is
when the data is not measured in interval
scale
 Let us apply it to the data on number of eggs
laid by chickens before being fed with a
special diet and after being fed with a special
diet (Exercise 2)
3.6 One way ANOVA
 Example 3.
• An MSc student at SUA did a research on altitudinal
difference in economic well being among the inhabitants
surrounding Mount Kilimanjaro (the highest mount in
Africa) in Tanzania. One of the aspects she looked at was to
compare households’ home assets values (livestock,
houses, bicycles, motorcycles, Radio, TV e.t.c) in the three
altitudes of the mountain (lower, Middle, Higher). In a pilot
study she took random samples of 15 households in each of
the three altitudes and recorded their asset values in
hundreds thousands of Tanzanian shillings. The data is
given in the SPSS file.
• QS: How do we compare the households’ assets values
across the three altitudes
3.6 One way ANOVA
 Needs to check the normality assumption. How?
 Go to Analyze-Descriptive-Explore-enter asset in the dependent-enter
“altitude” in the factor list-plot-plots-normal plots with tests-histogramcontinue-OK
• The data is normally distributed
 Now need to check the homogeneity of variance. How?
 Go to Analyze-Compare means- One-way ANOVA-enter asset in the
dependent-enter “altitude” in the factor list-Options-DescriptiveHomogeneity of Variance Test-Brown Forsythe-Welch-Continue-OK
• Based on the second Table of the results (Test of Homogeneity of
variance), it is clearly that the groups have the same variance. Based on
the third Table (ANOVA Table), there is significant difference in assets
values across the three altitudes. For the moment you can ignore the
fourth Table
 Now you can do pair wise comparison. How?
•
•
Go to Analyze-Compare means- One-way ANOVA-enter altitude in the dependent-enter
“altitude” in the factor list-PostHoc-Tukey/or any other-Continue-OK
3.6.1One way ANOVA with unequal
variance
 Example 4
• An M.A rural student at Sokoine University of agriculture intended
to find factors influencing tomato business at various nodes of its
value chain. The student had three main nodes of the tomato value
chain production. First was the primary node which involved the
peasants’ producers of tomato, second node involved the middle
men who buy tomato from the peasants and sell them to retailers
in town centers and third node involved retailers. Apart from
finding factors influencing tomato business, there was one
interesting question which was “at which node do the participants
acquire the highest profit margin”. The study involved 50 peasants,
20 middle men and 50 retailers.
• How do we identify the node with highest profit margin?
 Go to Analyze-Descriptive-Explore-enter “pmargin” in the
dependent-enter “actors “in the factor list-plot-plots-normal plots
with tests-histogram-Continue-OK
• Clearly the data is normally distributed.
 Need to check for the variance. How?
3.6.1One way ANOVA with unequal
variance
 Need to check for the homogeneity of variance. How?
 Go to Analyze-Compare means- One-way ANOVA-enter "asset” in the dependententer “actors” in the factor list-Options-Descriptive-Homogeneity of Variance TestBrown Forsythe-Welch-Continue-OK
•
Based on the second Table of the results (Test of Homogeneity of variance), it is
clearly that the groups (actors’ profit margins) do not have the same variance.
Based on the fourth Table of the results (Robust Tests of Equality of Means) , there
is significant difference in assets values across the three altitudes. You may now do
pair wise comparison among the Actors. How?
 Go to Analyze-Compare means- One-way ANOVA-enter altitude in the dependententer “actors” in the factor list-PostHoc-Games-Howell-Continue-OK
 we have used the Welch test and the Brown Forsythe Test because the variances
were not homogenous. These two Tests provide an adjustment in the original FTest. However, there is a non parametric alternative which is immune to the
ANOVA assumptions of normality and homogeneity of variance (Kruskal Wallis (HTest)
3.7 Kruskal Wallis (H-Test)
 We could use the test on the same data. How?
 For the older versions of SPSS
• Go to Analyze-Non-Parametric Test-K independent Samples-enter
the variable “pmarin ‘ in the right screen with title “Test Variable
List”-Tick the Kruskal-Wallis H-Enter grouping variable-Define
groups-continue-Options-define range-quartiles-OK
 For latest version
• Go to Analyze-Non Parmetrics-Independet samples-ObjectiveAutomatically-Field-enter the variable “pmargin” -Run
 You can see the results that there is significant difference in profit
margins across the three nodes.
 However the most typical situations to apply this test would be in a
case when the data is not measured in interval scale
 Try this with the data on students’ grade on three different
localities where the grades were measured in letter
grades(A,B,C,D,E,F) and later transformed to numerical scales
through ranks( A=6, B=5,C=4,D=3, E=2,F=1). The aim is to compare
performance across the three localities(Exercise 3).
3.8 One way ANOVA with repeated
measurements
 Example4
• It is a key requirement for a first year undergraduate student to pass an
examination in communication skills (English) at Sokoine University of
Agriculture in Tanzania before his/her admission. Normally an English
qualifying examination is given to the students upon their arrival and
those failing to pass more than 50% are supposed to take the subject as a
part of their core courses in their curriculum for two consecutive
semesters. A post graduate student in Education intended to examine the
contribution of the English teachings to the students in improving their
communication skills. To that effect a sample of 20 first year students was
examined by comparing their scores in English upon their arrival, and for
the next two semesters. The data are given in the SPSS file.
• Qs: How do we assess the contribution of English Teaching to students
communication skills?
 The repeated nature of the data violates the key assumption of
independence. The SPSS test this assumption first and provide an alert
natives estimation in the case it is violated. This assumption together with
the assumption of homogeneity of variances are now referred to as “
“Sphericity assumption”
 How to go?
3.8 One way ANOVA with repeated
measurements

I.
II.
III.
IV.


Go to Analyze-General Linear Model-Repeated Measures-enter the name of your
variable in the box labeled “ within the subject factor name
Now move the cursor down to the box that says "number of levels". You need to tell
SPSS how many "levels" there are of your repeated-measures variable – In this case
we have three different measurements Therefore type 3 in this box, and then click on
"Add".
Now click on the button labeled "Define." A dialog box will appear with five screens
Push the three variables under comparison one after another from the left screen
to the topmost right screen
Click the screen labeled Options-Descriptive-click the variable “test” in the topmost
left screen-Push it to the adjacent topmost right screen-Compare means-choose the
confidence Interval Adjustment-Continue-OK
The fourth Table labeled” Mauchy Test of Sphericity “ is of key interest as it tests for
sphericity assumption. In this case the null hypothesis of sphericity is rejected.
If spherity is not violated we read in the row labeled “ sphericty assumed” in the
Table labeled “ Tests of Within-Subjects effect otherwise we use the row labeled
“Huyn-Feldt “ which shows that there is significant deference in students’
performance across the three examined tests. The pair wise comparison is also
provided.
3.9 Friedman Test
 As in all previous cases there is also an alternative test to “One way repeated
measurements analysis called “Friedman test”
 This test is immune to the sphericty assumption .Try it with this data. How?
 For old versions of SPSS
• Go to Analyze-Non Parametric Tests-K Related samples-enter the three variables
simultaneously in the right screen with a title ‘’Test variables”-then click “Friedman in
one of the smallest screens below”-Statistics-Descriptive-Quartiles-Continue-OK

For the latest version
•
Go to Analyze-Non Parametric Test-Related samples-objective-Automatically-Fieldenter the three variables-Run

We cam clearly see that there is significant difference. You can also make pair wise
comparison of groups by using “ Wilcoxon Signed Rank Test for Old version of SPSS
while the latest version would automatically do it.

As before this method is not limited with assumptions of repeated measurements. So
it can be applied even when one is dealing with non continuous type of data.

Try it with the data on HIV_AIDS PATIENTS CD4 counts taken for four successive
periods of Clinic attendance as well as the data on students' GPA in four successive
semesters of study at Sokoine University Of Agriculture(Exercise 4).
4.Two way Analysis of Variance
 Example 5
• In a research which was sponsored by USAID under IAGRI Project at SUA,
an MSc Agricultural Economics student was examining factors influencing
maize commercialization by farmers at Kilosa district. Though there are
many factors the student for some reasons intended to examine the
influence of a farmer’ district of stay and the types of maize varieties
cultivated on the level of commercialization(= % of sold harvests/total
harvests). The data is provided in SPSS file.
• QS: How do we assess the influence of the two factors(district and
number of crops)
 Go Analyze-General linear model-Univariate-enter “commerc” in the
Dependent variable-enter “variety” and “district” in the Fixed factors-click
Plots-enter one of the factors in the horizontal line and the other in the
separate line-click Add-continue-options-click Descriptive-OK
Thank you!
• Please do not forget to fill the sign in sheet
and to complete the survey that will be sent
to you by email