Transcript Document

Dr.S.Nishan Silva
(MBBS)
Statistics
The collection, evaluation, and interpretation
of data
Statistics
Statistics
Descriptive Statistics
Inferential Statistics
Describe collected data
Generalize and
evaluate a population
based on sample
data
Graphic Data Representation
Histogram
Frequency distribution graph
Frequency Polygons
Frequency distribution graph
Bar Chart
Categorical data graph
Pie Chart
Categorical data graph %
Levels of Measurement
• Qualitative data
– Nominal Measurement
• Ex – Give a number coding for the data. Number
value is not considered
– Ordinal Measurement
• Ex- Number coding; but the number value matters
• Quantitative data
– Interval Measurement
• No absolute zero. To what range does a value
belong to..
– Ratio Measurement
• Absolute zero. And continuing
Discussion of
Examples
Example Research
• The effect of food from IIHS canteen on
weight gain
• Population – IIHS (students and staff)
• Further divisions – Students – Nursing and
Physiotherapy
• Data collection – Questionnaire
• Food from home/ outside Vs from canteen
• Weight change over one month
Master Data Sheet
Question
Gender
M
F
Job
Nurse
Physio
Other
Food
From
Canteen
Other
Weigh
t
Gain
Lost or
Same
Sheet 1
Sheet 2
Sheet 3
Sheet 4
Sheet 5
Sheet 6
Sheet 7
Sheet 8
Master Table
Gain
Canteen
Nurses
Other
No
Gain
No
Gain
Male
Canteen
Physio
Other
No
Gain
No
Gain
Other
Canteen
Other
No
Gain
No
Gain
Canteen
Nurses
Other
No
Gain
No
Gain
Female
Physio
Canteen
Other
No
Gain
No
Gain
Other
Canteen
Other
No
Gain
No
Value
%
Graphs - Draw
• Pie charts
– Weight gain from canteen in males
– Weight gain from home in females
• Bar charts / Graphs
– Weight gain from Canteen
Discussion of
YOUR
Examples
Describing the Data
with Numbers
Measures of Central Tendency
•
•
•
MEAN -- average
MEDIAN -- middle value
MODE -- most frequently observed
value(s)
Measures
of
Central
Tendency
Mean x
Arithmetic average
Sum of all data values divided by the
number of data values within the array
 x

x
n
Most frequently used measure of central
tendency
Strongly influenced by outliers- very large or
very small values
Measures of Central Tendency
Determine the mean value of
48, 63, 62, 49, 58, 2, 63, 5, 60, 59, 55
 x

x
n
(48 
63 
62 49 
58 2 
63 
5 
60 
59 
55)
x
11
524
x
11
x  47.64
Mean of a Group of Data
Page 78
Measures of Central Tendency
Median
Data value that divides a data array into
two equal groups
Data values must be ordered from lowest
to highest
Useful in situations with skewed data
and outliers (e.g., wealth management)
Measures
of Central
Tendency
Determine
the median
value of
48, 63, 62, 49, 58, 2, 63, 5, 60, 59, 55
Organize the data array from lowest to
highest value.
2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63
Select the data value that splits the data set
evenly.
Median = 58
What if the data array had an even number of
values?
5, 48, 49, 55, 58, 59, 60, 62, 63, 63
Measures of central tendency
Mode
Most frequently occurring response within a
data array
• Usually the highest point of curve
May not be typical
May not exist at all
Mode, bimodal, and multimodal
Measures
of Central
Determine
the mode
of
Tendency
48, 63, 62, 49, 58, 2, 63, 5, 60, 59, 55
Mode = 63
Determine the mode of
48, 63, 62, 59, 58, 2, 63, 5, 60, 59, 55
Mode = 63 & 59 Bimodal
Determine the mode of
48, 63, 62, 59, 48, 2, 63, 5, 60, 59, 55
Mode = 63, 59, & 48
Multimodal
Measures of Dispersion
• RANGE
highest to lowest values
STANDARD DEVIATION
• how closely do values cluster around the
mean value
SKEWNESS
• refers to symmetry of curve
•
•
•
Range
Calculate by subtracting the lowest value
from the highest value.
R  hl
Calculate the range for the data array.
2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63
R  hl
R  63  2
R  61
Standard Deviation  x  x 
1. Calculate the mean x .
s
2
(N 1)

2. Subtract the mean from each value.
3. Square each difference.
4. Sum all squared differences.
5. Divide the summation by the number of
values in the array minus 1.
6. Calculate the square root of the product.
x  x
Standard
Deviation
Calculate the standard
s
(N 1)

deviation for the data array.
2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63
1.
 x   524

 47.64
x
11
2.  x  x 
n
2 - 47.64 = -45.64
59 - 47.64 = 11.36
5 - 47.64 = -42.64
60 - 47.64 = 12.36
48 - 47.64 =
0.36
62 - 47.64 = 14.36
49 - 47.64 =
1.36
63 - 47.64 = 15.36
55 - 47.64 =
7.36
63 - 47.64 = 15.36
58 - 47.64 = 10.36
2
x  x
Standard
Deviation
Calculate the standard
deviation for the data array.
s
(N 1)

2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63
3.  x  x 
2
-45.642 = 2083.01
11.362 = 129.05
-42.642 = 1818.17
12.362 = 152.77
0.362 =
0.13
14.362 = 206.21
1.362 =
1.85
15.362 = 235.93
7.362 =
54.17
15.362 = 235.93
10.362 = 107.33
2
x  x
Standard
Deviation
Calculate the standard
deviation for the data array.
s
2
(N 1)

2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63

4.  x  x

2
2083.01 + 1818.17 + 0.13 + 1.85 + 54.17 + 107.33
+ 129.05 + 152.77 + 206.21 + 235.93 + 235.93
= 5,024.55

5.(N 1)
11-1 = 10
6.
x  x
( N1
 )
2
5,024.55
 502.46

10
7.
s
x  x
2
(N 1)

 502.46
S = 22.42
Variance
2
s 
x  x
(N 1)

Average of the square of the deviations
1.Calculate the mean.
2.Subtract the mean from each value.
3.Square each difference.
4.Sum all squared differences.
5.Divide the summation by the number of
values in the array minus 1.
2
Variance
2
s 
x  x
Calculate the variance for the
data array.
(N 1)

2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63
5024.55
s 
 502.46
( 10 )
2
2
Standard Deviation
Curve A
Curve B
A
B
Skewness
Mean
Median
Mode
Curve A
Curve B
negative
skew
A Simple Method for estimating
standard error
Standard error is the calculated standard deviation divided by the square root
of the size, or number of the population
Standard error of the means is used to test the reliability of the data
Example… If there are 10 corn plants with a standard deviation of 0.2
Sex = 0.2/ sq root of 10 = 0.2/3.03 = 0.006
0.006 represents one std dev in a sample of 10 plants
If there were 100 plants the standard error would drop to 0.002
Why?
Because when we take larger samples, our sample means get closer
to the true mean value of the population. Thus, the distribution of the
sample means would be less spread out and would have a lower
standard deviation.
Coefficient of Variation
• Percentage CV is –
• Standard Deviation X 100
Mean
Discussion of
Examples
Probability
• It is the numerical measure of the
likelihood that a specific event would
occur.
• (Page 92)
• Sum of probabilities for one event = 1
• Probability is always between 0 and 1
Probability
• Probability of independent events
– Chance of one single event happening
(against not happening)
• Marginal and condition probabilities
– (Page 92-94)
The Normal Distribution
.
Mean, Median, Mode
• Mean = median =
mode
• Skew is zero
• 68% of values fall
between 1 SD
• 95% of values fall
between 2 SDs
1

2
The Normal Curve and Standard
A normal curve:
Deviation
Each vertical line
is a unit of
standard deviation
68% of values fall
within +1 or -1 of
the mean
95% of values fall
within +2 & -2
units
Nearly all
members (>99%)
fall within 3 std
dev units
Example
(Theory)
My weight
day
weight
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
140
140.1
139.8
140.6
140
139.8
139.6
140
140.8
139.7
140.2
141.7
141.9
141.4
142.3
142.3
141.9
142.1
142.5
142.3
142.1
142.5
143.5
143
143.2
143
143.4
143.5
142.7
143.7
day
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
weight
day
143.9
144
142.5
142.9
142.8
143.9
144
144.8
143.9
144.5
143.9
144
144.2
143.8
143.5
143.8
143.2
143.5
143.6
143.4
143.9
143.6
144
143.8
143.6
143.8
144
144.2
144
143.9
weight
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
Plot as a function of time data was acquired:
144
144.2
144.5
144.2
143.9
144.2
144.5
144.3
144.2
144.9
144
143.8
144
143.8
144
144.5
143.7
143.9
144
144.2
144
144.4
143.8
144.1
day
Comments:
background is white (less ink);
Font size is larger than Excel
default (use 14 or 16)
146
145
144
weight (lbs)
weight
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
140
140.1
139.8
140.6
140
139.8
139.6
140
140.8
139.7
140.2
141.7
141.9
141.4
142.3
142.3
141.9
142.1
142.5
142.3
142.1
142.5
143.5
143
143.2
143
143.4
143.5
142.7
143.7
day
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
weight
day
143.9
144
142.5
142.9
142.8
143.9
144
144.8
143.9
144.5
143.9
144
144.2
143.8
143.5
143.8
143.2
143.5
143.6
143.4
143.9
143.6
144
143.8
143.6
143.8
144
144.2
144
143.9
143
142
Do not use curved lines to connect data
points
– that assumes you know more about the
relationship of the data than you really do
141
140
139
0
10
20
30
Day
40
50
60
weight
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
144
144.2
144.5
144.2
143.9
144.2
144.5
144.3
144.2
144.9
144
143.8
144
143.8
144
144.5
143.7
143.9
144
144.2
144
144.4
143.8
144.1
Assume my weight is a single, random, set of similar data
25 Make a frequency chart (histogram) of the data
146
145
# of Observations
144
weight (lbs)
20
143
142
141
15
140
139
0
10
20
30
40
50
60
Day
10
5
0
Weight (lbs)
Create a “model” of my weight and determine average
Weight and how consistent my weight is
25
average
143.11
# of Observations
20
15
10
Inflection pt
s = 1.4 lbs
5
0
Weight (lbs)
s = standard deviation
= measure of the consistency, or similarity, of weights
0.45
0.4
0.35
Amplitude
Width is measured
At inflection point =
s
0.3
0.25
0.2
W1/2
0.15
0.1
0.05
0
-5
-4
-3
-2
-1
0
1
2
3
4
s
Triangulated peak: Base width is 2s < W < 4s
5
0.45
0.4
Pp = peak to peak – or
– largest separation of
measurements
0.35
+/- 1s Area = 68.3%
Amplitude
0.3
pp ~ 6s
0.25
0.2
0.15
0.1
Area +/- 2s = 95.4%
0.05
0
-5
-4
-3
-2
Area +/- 3s = 99.74 %
-1
0
1
2
3
4
5
s
Peak to peak is sometimes
Easier to “see” on the data vs time plot
pp ~ 6s
(Calculated s= 1.4)
146
144.9
145
Peak to
peak
143
25
142
20
# of Observations
weight (lbs)
144
141
15
10
5
140
139.5
0
Weight (lbs)
139
0
10
20
30
Day
s~ pp/6 = (144.9-139.5)/6~0.9
40
50
60
Read
Co-relation between variables –
Page 99 and beyond
Inferential Statistics
Used to determine the likelihood that a
conclusion based on data from a sample is
true
Terms
p value: the probability that an observed
difference could have occurred by chance
Terms
confidence interval:
The range of values we can be reasonably
certain includes the true value.
The Use of the Null Hypothesis
• Is the difference in two sample populations
due to chance or a real statistical
difference?
• The null hypothesis assumes that there
will be no “difference” or no “change” or no
“effect” of the experimental treatment.
• If treatment A is no better than treatment B
then the null hypothesis is supported.
• If there is a significant difference between
A and B then the null hypothesis is
rejected...
T-test or Chi Square? Testing the
validity of the null hypothesis
• Use the T-test (also called Student’s Ttest) if using continuous variables from a
normally distributed sample populations
(ex. Height)
• Use the Chi Square (X2) if using discrete
variables (if you are evaluating the
differences between experimental data
and expected or hypothetical data)…
Example: genetics experiments, expected
distribution of organisms.
T-test
• T-test determines the probability that the
null hypothesis concerning the means of
two small samples is correct
• The probability that two samples are
representative of a single population
(supporting null hypothesis) OR two
different populations (rejecting null
hypothesis)
Use t-test to determine whether or not sample population A and B came
from the same or different population
t = x1-x2 / sx1-sx2
x1 (bar x) = mean of A ; x2 (bar x) = mean of B
sx1 = std error of A; sx2 = std error of B
Example:
Sample A mean =8
Sample B mean =12
Std error of difference of populations =1
12-8/1 = 4 std deviation units
The “z” test
-used if your population samples are greater than 30
-Also used for normally distributed populations with continuous variables
-formula: note: “σ” (sigma) is used instead of the letter “s”
z= mean of pop #1 – mean of pop #2/
√ of variance of pop #1/n1 + variance of pop#2/n2
Also note that if you only had the standard deviation you can square that value and
substitute for variance
Example z-test
• You are looking at two methods of learning
geometry proofs, one teacher uses method 1,
the other teacher uses method 2, they use a test
to compare success.
• Teacher 1; has 75 students; mean =85; stdev=3
• Teacher 2: has 60 students; mean =83; stdev=
= (85-83)/√3^2/75 + 2^2/60
= 2/0.4321 = 4.629
Example continued
Z= 4.6291
Ho = null hypothesis would be Method 1 is not better than method 2
HA = alternative hypothesis would be that Method 1 is better than method 2
This is a one tailed z test (since the null hypothesis doesn’t predict that there will be no
difference)
So for the probability of 0.05 (5% significance or 95% confidence) that Method one is
not better than method 2 … that chart value = Zα 1.645
So 4.629 is greater than the 1.645 (the null hypothesis states that method 1 would not be better
and the value had to be less than 1.645; it is not less therefore reject the null hypothesis and
indeed method 1 is better
Z table (sample table with 3 probabilities)
α
Zα (one tail)
Zα/2 (two tails)
0.1
1.28
1.64
0.05
1.645
1.96
0.01
2.33
2.576
Chi square
• Used with discrete values
• Phenotypes, choice chambers, etc.
• Not used with continuous variables (like
height… use t-test for samples less than
30 and z-test for samples greater than 30)
• O= observed values
• E= expected values
http://course1.winona.edu/sberg/Equation/chi-squ2.gif
Interpreting a chi square
•
•
•
•
Calculate degrees of freedom
# of events, trials, phenotypes -1
Example 2 phenotypes-1 =1
Generally use the column labeled 0.05 (which
means there is a 95% chance that any
difference between what you expected and what
you observed is within accepted random
chance.
• Any value calculated that is larger means you
reject your null hypothesis and there is a
difference between observed and expect values.
How to use a chi square chart
http://faculty.southwest.tn.edu/jiwilliams/probab2.gif