Pengenalan kepada Statistik Inferensi Inferential Statistics Standard Error of the Mean Significance Inferential tests you can use.

Download Report

Transcript Pengenalan kepada Statistik Inferensi Inferential Statistics Standard Error of the Mean Significance Inferential tests you can use.

Pengenalan kepada Statistik Inferensi
Inferential Statistics
Standard Error of the Mean
Significance
Inferential tests you can use
1
Perhatikan simbol/formula
—
XA
t=
2 (
SX ) 2
XA S
[(
A
n1
(n1-1)
2
- XB
+ SXB
)(
—
2 (
SX ) 2
-
B
n2
+ (n2-1)
1
)] x (n
1
+
1
n2)
Difference between
means
Don’t Panic !
—
XA
t=
2 (
SX ) 2
XA S
[(
A
n1
(n1-1)
- XB
+ SXB
)(
—
2 (
SX ) 2
-
n2
+ (n2-1)
Compare with SD formula
3
B
1
)] x (n
1
+
1
n2)
Basic types of statistical treatment
o Descriptive statistics which summarize
the characteristics of a sample of data
o Inferential statistics which attempt to say
something about a population on the basis
of a sample of data - infer to all on the
basis of some
Statistical tests are inferential
4
Two kinds of descriptive statistic:
o Measures of central tendencyOr where about on the
measurement scale
most of the data fall
– mean
– median
– mode
Or how spread out they
are
o Measures of dispersion (variation)
– range
– inter-quartile range
– variance/standard deviation
5
The different measures have different sensitivity and
should be used at the appropriate times…
Mean
Sum of all observations divided by the number of observations
In notation:
S
n
i=1
xi
Refer to handout on notation
n
Mean uses every item of data but is sensitive to extreme ‘outliers’
See example on next slide
6
To overcome problems with range etc.
we need a better measure of spread
Variance and standard deviation
 A deviation is a measure of how far from the
mean is a score in our data
 Sample: 6,4,7,5
mean =5.5
 Each score can be expressed in terms of distance from 5.5
 6,4,7,5, => 0.5, -1.5, 1.5, -0.5 (these are distances from
mean)
 Since these are measures of distance, some are positive (greater
than mean) and some are negative (less than the mean)
 TIP: Sum of these distances ALWAYS = 0
7
Symbol check
8

x
 Called ‘x bar’; refers to

(x  x)
 Called ‘x minus x-bar’;
the ‘mean’
implies subtracting the
mean from a data point x.
also known as a deviation
from the mean
Two ways to get SD
sd 
sd 
9
 (x  x)
2
•Sum the sq. deviations from the mean
•Divide by No. of observations
•Take the square root of the result
n
x
n
2
x
2
•Sum the squared raw scores
•Divide by N
•Subtract the squared mean
•Take the square root of the result
x
2
2
2
2
2
3
3
4
4
5
S x = 29
2
x
4
4
4
4
4
9
9
16
16
25
S x 2 = 95
s=
=
2
x
S - x2
n
95
10
=
9.5
=
1.09
=
1.044
-
2
2.9
- 8.41
If we recalculate the
variance with the 60
instead of the 5 in the
data…
If we include a large outlier:
2
x
x
2
4
2
4
2
4
2
4
2
4
3
9
3
9
4
16
4
16
60
3600
S x = 84 S x 2 = 3670
Like the mean, the
standard deviation uses
every piece of data and
is therefore sensitive to
extreme values
Note increase in SD
s=
=
2
x
S - x2
n
3760
10
-
=
367 - 70.56
=
296.44
=
17.22
2
8.4
Mean
Two sets of data can have the same mean but different standard
deviations.
The bigger the SD, the more s-p-r-e-a-d out are the data.
On the use of N or N-1
sd 
sd 
13
2
 (x  x)
n
 (x  x)
n 1
2
 When your observations
are the complete set of
people that could be
measured (parameter)
 When you are observing
only a sample of potential
users (statistic), the use of
N-1 increases size of sd
slightly
Summary
Measures of Central Tendency
Most frequent
Mode •
observation. Use with
nominal data
‘Middle’ of data. Use with ordinal
Median •
data or when data contain outliers
Mean •
Measures of Dispersion
Range •
Interquartile Range •
Variance / Standard Deviation •
‘Average’. Use with
interval and ratio data if
no outliers
Dependent on two extreme
values
More useful than range.
Often used with median
Same conditions as mean.
With mean, provides excellent
summary of data
Deviation units: Z scores
Any data point can be expressed in terms of its
Distance from the mean in SD units:
xx
z
sd
15
A positive z score implies a value above the mean
A negative z score implies a value below the mean
Interpreting Z scores
 Mean = 70,SD = 6
 Then a score of 82 is 2 sd [  By using Z scores, we can
(82-70)/6] above the
mean, or 82 = Z score of 2
 Similarly, a score of 64 = a
Z score of -1
16
standardize a set of scores to a
scale that is more intuitive
 Many IQ tests and aptitude
tests do this, setting a mean of
100 and an SD of 10 etc.
Comparing data with Z scores
You score 49 in class A but 58 in class B
How can you compare your performance in both?
Class A:
Mean =45
SD=4
49 is a Z=1.0
17
Class B:
Mean =55
SD = 6
58 is a Z=0.5
With normal distributions
Mean,
SD and
Z tables
In combination provide powerful means of estimating what
your data indicates
18
Graphing data - the histogram
The frequency of
occurrence for
measure of
interest,
e.g., errors, time,
scores on a test
etc.
Graph gives instant
summary of data check spread,
similarity, outliers, etc.
19
100
90
80
70
60
50
Number
Of errors
40
30
20
10
0
1
2
3
4
5
6
7
8
9
10
The categories of data we are studying, e.g., task or
interface, or user group etc.
Very large data sets tend to have
distinct shape:
80
70
60
50
40
30
20
10
0
20
Normal distribution
 Bell shaped, symmetrical, measures of central tendency
converge
 mean, median, mode are equal in normal distribution
 Mean lies at the peak of the curve
 Many events in nature follow this curve
 IQ test scores, height, tosses of a fair coin, user performance in
tests,
21
The Normal Curve
f
NB: position of
measures of
central
tendency
50% of scores
fall below mean
Mean
Median
Mode
22
Positively skewed distribution
Note how the various measures of
central tendency separate now note the direction of the
change…mode moves left of other
two, mean stays highest,
indicating frequency of scores
less than the mean
f
23
Mode Median
Mean
Negatively skewed distribution
Here the tendency
to have higher
values more
common serves to
increase the value
of the mode
f
Mean Median
Mode
24
Other distributions
 Bimodal
 Data shows 2 peaks separated by trough
 Multimodal
 More than 2 peaks
 The shape of the underlying distribution determines your choice
of inferential test
25
Bimodal
Will occur in situations where
there might be distinct groups
being tested e.g., novices and
experts
Note how each mode is itself
part of a normal distribution
(more later)
f
Mode
26
Mean
Median
Mode
Standard deviations and the normal
curve
68% of observations
fall within ± 1 s.d.
f
95% of observations fall
within ± 2 s.d. (approx)
1 sd
1 sd
Mean
27
1 sd
1 sd
Z scores and tables
Knowing a Z score allows you to determine
where under the normal distribution it occurs
Z score between:
0 and 1 = 34% of observations
1 and -1 = 68% of observations etc.
Or 16% of scores are >1 Z score above mean
Check out Z tables in any basic stats book
28
Remember:
 A Z score reflects position in a normal distribution
 The Normal Distribution has been plotted out such that we
know what proportion of the distribution occurs above or
below any point
29
Importance of distribution
 Given the mean, the standard deviation, and some reasonable
expectation of normal distribution, we can establish the
confidence level of our findings
 With a distribution, we can go beyond descriptive statistics
to inferential statistics (tests of significance)
30
So - for your research:
 Always summarize the data by graphing it - look for general
pattern of distribution
 Then, determine the mean, median, mode and standard
deviation
 From these we know a LOT about what we have observed
31
Inference is built on Probability
 Inferential statistics rely on the laws of probability to
determine the ‘significance’ of the data we observe.
 Statistical significance is NOT the same as practical
significance
 In statistics, we generally consider ‘significant’ those
differences that occur less than 1:20 by chance alone
32
At this point I ask people to take out a
coin and toss it 10 times, noting the exact
sequence of outcomes e.g.,
Calculating probability
h,h,t,h,t,t,h,t,t,h.
Then I have people compare outcomes….
 Probability refers to the likelihood of any given event
occurring out of all possible events e.g.:
 Tossing a coin - outcome is either head or tail
 Therefore probability of head is 1/2
 Probability of two heads on two tosses is 1/4 since the other possible
outcomes are two tails, and two possible sequences of head and tail.
 The probability of any event is expressed as a value between 0
(no chance) and 1 (certain)
33
Sampling distribution for 3 coin tosses
4
3
2
1
0
34
0 heads
1
1 head
3
2 heads
3
3 heads
1
Probability and normal curves
 Q? When is the probability of getting 10 heads in 10 coin
tosses the same as getting 6 heads and 4 tails?
 HHHHHHHHHH
 HHTHTHHTHT
 Answer: when you specify the precise order of the 6
H/4T sequence:
 (1/2)10 =1/1024 (specific order)
 But to get 6 heads, in any order it is: 210/1024 (or about 1:5)
35
What use is probability to us?
 It tells us how likely is any event to occur by chance
 This enables us to determine if the behavior of our users in a
test is just chance or is being affected by our interfaces
36
Determining probability
 Your statistical test result is plotted against the distribution of
all scores on such a test.
 It can be looked up in stats tables or is calculated for you in
EXCEL or SPSS etc
 This tells you its probability of occurrence
 The distributions have been determined by statisticians.
Introduce
simple stats
tables here :
37
What is a significance level?
 In research, we estimate the probability level of finding what
we found by chance alone.
 Convention dictates that this level is 1:20 or a probability of
.05, usually expressed as : p<.05.
 However, this level is negotiable
 But the higher it is (e.g., p<.30 etc) the more likely you are to
claim a difference that is really just occurring by chance (known
as a Type 1 error)
38
What levels might we chose?
 In research there are two types of errors we can make when
considering probability:
 Claiming a significant difference when there is none (type 1
error)
 Failing to claim a difference where there is one (type 2 error)
 The p<.05 convention is the ‘balanced’ case but tends to
minimize type 1 errors
39
Using other levels
 Type 1 and 2 errors are interwoven, if we lessen the
probability of one occurring, we increase the chance of the
other.
 If we think that we really want to find any differences that
exist, we might accept a probability level of .10 or higher
40
Thinking about p levels
 The p<.x level means we believe our results could occur
by chance alone (not because of our manipulation) at
least x/100 times
 P<.10 => our results should occur by chance 1 in 10 times
 P<.20=> our results should occur by chance 2 in 10 times
 Depending on your context, you can take your chances :)
 In research, the consensus is 1:20 is high enough…..
41
Putting probability to work
 Understanding the probability of gaining the data you have
can guide your decisions
 Determine how precise you need to be IN ADVANCE, not
after you see the result
 It is like making a bet….you cannot play the odds after the
event!
42
I find that this is the hardest part of stats for
novices to grasp, since it is the bridge
between descriptive and inferential
stats…..needs to be explained slowly!!
Sampling error and the mean
 Usually, our data forms only a small part of all the
possible data we could collect
 All possible users do not participate in a usability test
 Every possible respondent did not answer our questions
 The mean we observe therefore is unlikely to be the
exact mean for the whole population
 The scores of our users in a test are not going to be an exact
index of how all users would perform
43
How can we relate our sample to
everyone else?
 Central limit theorem
 If we repeatedly sample and calculate means from a population,
our list of means will itself be normally distributed
 Holds true even for samples taken from a skewed population
distribution
 This implies that our observed mean follows the same
rules as all data under the normal curve
44
The distribution of the means forms a smaller normal
distribution about the true mean:
2
45
4
6
8
10
12
14
16
18
2000
1500
1000
1500
1000
800
600
500
500
400
5
10
15
20
z
0
0
200
0
0
0
5
10
15
20
z
0
5
10
15
20
z
n=2
n=5
n = 15
mean of sample
means = 10
mean of sample
means = 10
mean of sample
means = 10
SD of sample means =
4.16
SD of sample means =
2.41
SD of sample means =
0.87
True for skewed distributions
too
Here the tendency
to have higher
values more
common serves to
increase the value
of the mode
Plot of means from
samples
f
Mean
47
How means behave..
 A mean of any sample belongs to a normal distribution of
possible means of samples
 Any normal distribution behaves lawfully
 If we calculate the SD of all these means, we can determine
what proportion (%) of means fall within specific distances of
the ‘true’ or population mean
48
But...
 We only have a sample, not the population…
 We use an estimate of this SD of means known as the
Standard Error of the Mean
SD
SE 
N
49
Implications
 Given a sample of data, we can estimate how confident we
are in it being a true reflection of the ‘world’ or…
 If we test 10 users on an interface or service, we can estimate
how much variability about our mean score we will find
within the intended full population of users
50
Example
 We test 20 users on a new iPad:
 Mean error score: 10, sd: 4
 What can we infer about the broader user population?
 According to the central limit theorem, our observed mean
(10 errors) is itself 95% likely to be within 2 s.d. of the ‘true’
(but unknown to us) mean of the population
51
The Standard Error of the Means
s.d .(sam p le)
SE 
N
4
4


 0.8 9
2 0 4.4 7
52
If standard
error
of
mean
=
0.89
 Then observed (sample) mean is within a normal distribution
about the ‘true’ or population mean:
 So we can be
 68% confident that the true mean=10  0.89
 95% confident our population mean = 10  1.78
 99% confident it is within 10 2.67
 This offers a strong method of interpreting of our data
53
Issues to note
 If s.d. is large and/or sample size is small, the estimated
deviation of the population means will appear large.
 e.g., in last example, if n=9, SE mean=1.33
 So confidence interval becomes 10  2.66 (i.e., we are now
95% confident that the true mean is somewhere between 7.34
and 12.66.
 Hence confidence improves as sample increases and variability
lessens
 Or in other words: the more users you study, the more sure you can
be….!
54
Exercise:
 If the mean = 10 and the s.d.=4, what is the 68% confidence
interval when we have:
 16 users?
 9 users?
Answers:
 If the s.d. = 12, and mean is still 10, what is 9-11
the 95%
confidence interval for those N?
8.66-11.33
4-16
2-18
55
Exercise answers:
 If the mean = 10 and the s.d.=4, what is the 68% confidence
interval when we have:
16 users?= 9-11 (hint: sd/n = 4/4=1)
9 users? = 8.66-11.33
 If the s.d. = 12, and mean is still 10, what is the 95% confidence
interval for those N?
16 users: 4-16 (hint: 95% CI implies 2 SE either side of mean)
9 users: 2-18
56
Recap
 Summarizing data effectively informs us of central tendencies
 We can estimate how our data deviates from the population
we are trying to estimate
 We can establish confidence intervals to enable us to make
reliable ‘bets’ on the effects of our designs on users
57
This is the
beginning of
significance
testing
Comparing 2 means
 The differences between means of samples drawn from the
same population are also normally distributed
 Thus, if we compare means from two samples, we can
estimate if they belong to the same parent population
58
SE of difference between means
 [x 1x 2]   x 1 x 2
2
2
SEdiff.m eans SE(sample1)  SE(sample2)
2
2
This lets us set up confidence limits for the differences
between the two means
59
Regardless of population mean:
 The difference between 2 true measures of the mean of a
population is 0
 The differences between pairs of sample means from this
population is normally distributed about 0
60
Consider two interfaces:
We capture 10 users’ times per task on
each.
The results are:
Interface A = mean 8, sd =3
Interface B = mean 10, sd=3.5
Q? - is Interface A really different?
61
How do we tackle this question?
Calculate the SE difference between
the means
SEa = 3/10 = 0.95
SEb= 3.5/ 10=1.11
SE a-b = (0.952+1.112) = (0.90+1.23)=1.46
Observed Difference between means= 2.0
95% Confidence interval of difference between means is 2 x(1.46)
or 2.92 (i.e. we expect to find difference between 0-2.92 by chance alone).
suggests there is no significant difference at the p<.05 level.
62
But what else?
We can calculate the exact probability of finding this
difference by chance:
Divide observed difference between the means by the SE(diff
between means): 2.0/1.46 = 1.37
Gives us the number of standard deviation units between two
means (Z scores)
Check Z table: 82% of observations are within 1.37 sd, 18% are
greater; thus the precise sig level of our findings is p<.18.
Thus - Interface A is different, with rough odds of 5:1
63
Hold it!
 Didn’t we first conclude there was no significant difference?
 Yes, no significant difference at p<.05
 But the probability of getting the differences we observed by
chance was approximately 0.18
 Not good enough for science (must avoid type 1 error), but very useful
for making a judgment on design
 But you MUST specify levels you will accept BEFORE not after….
 Note - for small samples (n<20) t- distribution is better than z
distribution, when looking up probability
64
Why t?
 Similar to the normal distribution
 t distribution is flatter than Z for small degrees of
freedom (n-1), but virtually identical to Z when N>30
 Exact shape of t-distribution depends on sample size
65
Simple t-test:
 You want all users of a new interface to score at least
70% on an effectiveness test.You test 6 users on a new
interface and gain the following scores:
62
92
75
68
83
95
66
Mean = 79.17
Sd=13.17
T-test:
79.17 70 9.17
t  13.17 
 1.71
5.38
6
From t-tables, we can see that this value of t exceeds t
value (with 5 d.f.) for p.10 level
So we are confident at 90% level that our new interface
leads to improvement
67
T-test:
Sample mean
79.17 70 9.17
t  13.17 
 1.71
5.38
6
SE mean
Thus - we can still talk in confidence intervals, e.g.,
We are 68% confident the mean of population =79.17  5.38
68
Predicting the direction of the
difference
 Since you stated that you wanted to see if new Interface was
BETTER (>70), not just DIFFERENT (< or > 70%), this is
asking for a one-sided test….
 For a two-sided test, I just want to see if there is ANY
difference (better or worse) between A and B.
69
One tail (directional) test
 Tester narrows the odds by half by testing for a specific
difference
 One sided predictions specify which part of the normal
curve the difference observed must reside in (left or
right)
 Testing for ANY difference is known as ‘two-tail’ testing,
 Testing for a directional difference (A>B) is known as
‘one-tail’ testing
70
So to recap
 If you are interested only in certain differences, you are being
‘directional’ or ‘one-sided’
 Under the normal curve, random or chance differences occur
equally on both sides
 You MUST state directional expectations (hypothesis) in
advance
71
Why would you predict the direction?
 Theoretical grounds
 Experience or previous findings suggested the difference
 Practical grounds
 You redesigned the interface to make it better, so you EXPECT
users will perform better….
72
Alternative and Null Hypotheses
 Inferential statistics test the likelihood that the alternative
(research) hypothesis (H1) is true and the null hypothesis
(H0) is not
 in testing differences, the H1 would predict that differences
would be found, while the H0 would predict no differences
 by setting the significance level (generally at .05), the
researcher has a criterion for making this decision
Alternative and Null Hypotheses
 If the .05 level is achieved (p is equal to or less than .05),
then a researcher rejects the H0 and accepts the H1
 If the the .05 significance level is not achieved, then the H0 is
retained
Alternative and Null Hypotheses
 If the .05 level is achieved (p is equal to or less than .05),
then a researcher rejects the H0 and accepts the H1
 If the the .05 significance level is not achieved, then the H0 is
retained
Degrees of Freedom
 Degrees of freedom (df) are the way in which the
scientific tradition accounts for variation due to error
 it specifies how many values vary within a statistical test
 scientists recognize that collecting data can never be error-free
 each piece of data collected can vary, or carry error that we
cannot account for
 by including df in statistical computations, scientists help
account for this error
 there are clear rules for how to calculate df for each statistical
test
Inferential Statistics: 5 Steps
 To determine if SAMPLE means come from same population,
use 5 steps with inferential statistics
1. State Hypothesis
 Ho: no difference between 2 means; any difference
found is due to sampling error

any significant difference found is not a TRUE difference,
but CHANCE due to sampling error
 results stated in terms of probability that Ho is false
 findings are stronger if can reject Ho
 therefore, need to specify Ho and H1
Steps in Inferential Statistics
2. Level of Significance
 Probability that sample means are different enough to reject Ho
(.05 or .01)
 level of probability or level of confidence
Steps in Inferential Statistics
3. Computing Calculated Value
 Use statistical test to derive some calculated value
(e.g., t value or F value)
4. Obtain Critical Value
 a criterion used based on df and alpha level (.05 or
.01) is compared to the calculated value to determine
if findings are significant and therefore reject Ho
Steps in Inferential Statistics
5. Reject or Fail to Reject Ho
 CALCULATED value is compared to the CRITICAL
value to determine if the difference is significant
enough to reject Ho at the predetermined level of
significance
 If CRITICAL value > CALCULATED value --> fail
to reject Ho
 If CRITICAL value < CALCULATED value -->
reject Ho
 If reject Ho, only supports H1; it does not prove H1
Testing Hypothesis
 If reject Ho and conclude groups are really different, it
doesn’t mean they’re different for the reason you
hypothesized
 may be other reason
 Since Ho testing is based on sample means, not
population means, there is a possibility of making an
error or wrong decision in rejecting or failing to reject
Ho
 Type I error
 Type II error
Testing Hypothesis
 Type I error -- rejecting Ho when it was true (it should
have been accepted)
 equal to alpha
 if  = .05, then there’s a 5% chance of Type I error
 Type II error -- accepting Ho when it should have been
rejected
 If increase , you will decrease the chance of Type II
error
Identifying the Appropriate Statistical Test of
Difference
One variable
Two variables
(1 IV with 2 levels; 1 DV)
One-way chi-square
t-test
Two variables
(1 IV with 2+ levels; 1 DV)
ANOVA
Three or more variables
ANOVA
TERIMA KASIH
84