Mann Whitney U worked example

Download Report

Transcript Mann Whitney U worked example

Mann Whitney U
For comparison data
Using Mann Whitney U
Non-parametric i.e. no assumptions are made about data
fitting a normal distribution
Is used to compare the medians of two sets of data
It measures the overlap between the two data sets
You must have between 6 and 20 replicates of data
The data sets can have unequal numbers of replicates
Example of normally distributed data
Frequency silver birch tree height at Rushey Plain
12
Frequency
10
8
6
4
2
0
10 to 15 to 20 to 25 to 30 to 35 to 40 to 45 to 50 to 55 to 60 to
14
19
24
29
34
39
44
49
54
59
64
Tree height
When to use Mann-Whitney U-test
Frequency of silver birch and beech trees at Broom Hill
12
Frequency
10
8
Silver Birch
6
Beech
4
2
0
10 to 15 to 20 to 25 to 30 to 35 to 40 to 45 to 50 to 55 to 60 to
14
19
24
29
34
39
44
49
54
59
64
Tree height
Curve not normally distributed ie. non
parametric
Compares overlap between two data
sets
The Equation
U1
=
n1 x n2 + ½ n2 (n2 + 1) -  R2
U2
=
n1 x n2 + ½ n1 (n1 + 1) -  R1
The Equation
U1
=
n1 x n2 + ½ n2 (n2 + 1) -  R2
U2
=
n1 x n2 + ½ n1 (n1 + 1) -  R1
U1
=
Mann - Whitney U for data set 1
n1
=
Sample size of data set 1
 R1
=
Sum of the ranks of data set 1
U2
=
Mann - Whitney U for data set 2
n2
=
Sample size of data set 2
 R2
=
Sum of the ranks of data set 2
Where:
Method
1. Establish the Null Hypothesis H0 (this is always the negative form.
i.e. there is no significant correlation between the variables) and
the alternative hypothesis (H1).
H0 - There is no significant difference between the variable
at Site 1 and Site 2
H1 - There is a significant difference between the variable
at Site 1 and Site 2
2. Copy your data into the table below as variable x and variable y
and label the data sets
Rank 1 R1
Data Set 1
Beech Hill (m)
23
21
23
20
24
25
Data Set 2
Rushey Plain
(m)
16
18
19
17
20
21
Rank 2 R2
22
3. Treat both sets of data as one data set and rank them in
increasing order (the lowest data value gets the lowest rank)
Rank 1 R1
Data Set 1
Beech Hill (m)
23
21
23
20
24
25
Data Set 2
Rushey Plain
(m)
16
18
19
17
20
21
22
Rank 2 R2
Start from the lowest and put the numbers in order:
16, 17, 18, 19, 20, 20, 21, 21, 22, 23, 23, 24, 25
The lowest data value gets a rank of 1
When you have data values of the same value, they must have the
same rank.
Take the ranks you would normally assign (5 and 6) and add
them together (11) and divide the ranks between the data
values(5.5)
5
6
1
2
3
4
5.5 5.5
16
17
18
19
20
20
21
21
22
23
23
24
The same thing is done for all data values that are the same
25
The lowest data value gets a rank of 1
When you have data values of the same value, they must have the
same rank.
Take the ranks you would normally assign (5 and 6) and add
them together (11) and divide the ranks between the data
values(5.5)
5
6
7
8
10
11
1
2
3
4
5.5 5.5
7.5 7.5
9
10.5 10.5
12
13
16
17
18
19
20
21
22
23
24
25
20
21
The assigned ranks can then be put into the table
23
10.5
7.5
10.5
5.5
12
13
9
Data Set 1
Beech Hill (m)
23
21
23
20
24
25
22
Data Set 2
Rushey Plain
(m)
16
18
19
17
20
21
1
3
4
2
5.5
7.5
Rank 1 R1
Rank 2 R2
4. Sum the ranks for each set of data ( R)
 R1 = 10.5 + 7.5 + 10.5 + 5.5 + 12 + 13 + 9 = 68
 R2 = 1 + 3 + 4 + 2 + 5.5 + 7.5 = 23
5. Calculate the number of samples in each data set (n)
Count the number of samples in each of the data sets
10.5
7.5
10.5
5.5
12
13
9
Data Set 1
Beech Hill (m)
23
21
23
20
24
25
22
Data Set 2
Rushey Plain
(m)
16
18
19
17
20
21
1
3
4
2
5.5
7.5
Rank 1 R1
Rank 2 R2
n1 = 7
n2 = 6
6. Calculate the Values for U1 and U2 using the equations
U1
=
n1 x n2 + ½ n2 (n2 + 1) -  R2
U2
=
n1 x n2 + ½ n1 (n1 + 1) -  R1
It is a good idea to break the
equations down into three bite
size chunks that will then give
you a very easy three figure sum
U1
=
n1 x n2 + ½ n2 (n2 + 1) -  R2
U2
=
n1 x n2 + ½ n1 (n1 + 1) -  R1
6. Calculate the Values for U1 and U2 using the equations
U1
=
n1 x n2 + ½ n2 (n2 + 1) -  R2
(7x6) +3(6+1) -23
(7x6) +3(7) -23
U1 = 42 +21 -23 = 40
U2
=
n1 x n2 + ½ n1 (n1 + 1) -  R1
(7x6) +3.5(7+1) -68
(7x6) +3.5(8) -68
U2 = 42 +28 -68 = 2
6. Compare the smallest U value against the table of critical values
The smallest U value is
U2 = 2
If the U value is less than (or equal to) the critical value
then there is a significant difference between the data
sets and the null hypothesis can be rejected
Critical Values for the Mann Whitney U test
(at the 0.05 or 95% confidence level i.e. we are 95% confident our data was not due to chance)
We us the values of n1 and n2 to find our critical value
Value of n2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
0
0
0
1
1
1
1
1
2
3
4
n1
0
1
1
2
2
3
3
4
4
5
5
0
1
2
3
4
4
5
6
7
8
9
10
5
0
1
2
3
5
6
7
8
9
11
12
13
14
6
1
2
3
5
6
8
10
11
13
14
16
17
19
7
1
3
5
6
8
10
12
14
16
18
20
22
24
8
0
2
4
6
8
10
13
15
17
19
22
24
26
29
9
0
2
4
7
10
12
15
17
20
23
26
28
31
34
10
0
3
5
8
11
14
17
20
23
26
29
33
36
39
11
0
3
6
9
13
16
19
23
26
30
33
37
40
44
12
1
4
7
11
14
18
22
26
29
33
37
41
45
49
13
1
4
8
12
16
20
24
28
33
37
41
45
50
54
14
1
5
9
13
17
22
26
31
36
40
45
50
55
59
15
1
5
10
14
19
24
29
34
39
44
49
54
59
64
Is there a significant difference?
Is 2 (our smallest U value) smaller or larger than 6 (our
critical value from the Mann Whitney Table)?
Smaller
If the U value is less than (or equal to) the critical value then
there is a significant difference between the data sets and the
null hypothesis can be rejected
The smallest U value is less than the critical value; therefore
the null hypothesis is rejected
The alternative Hypothesis can be accepted – There is a
significant difference between the tree heights of Beech
Hill and Rushey Plain
Use the following data to calculate U values independently
Abundance of Gammarus pulex in pools and riffles of an Exmoor stream
Rank 1 R1
Velocity cm.s-1
Pools
12
12
5
6
14
9
20
8
Velocity cm.s-1
Riffles
54
55
61
56
31
47
68
54
27
39
43
2
0
1
3
9
85
16
80
18
3
5
63
150
Rank 2 R2
Rank 1 R1
Abundance of
Gammarus pulex
Pools
Abundance of
Gammarus pulex
Riffles
Rank 2 R2
Key questions
Is there a significant relationship?
Which data value/s would you consider to be anomalous and
why?
What graph would you use to present this data?