Transcript Stats 244.3
Stats 245.3(02)
Review
Summarizing Data
Graphical Methods
Histogram
5 4 3 2 1 0 8 7 6 70 to 80 80 to 90 90 to 100 100 to 110 110 to 120 120 to 130
Stem-Leaf Diagram
8 9 10 11 12 0 2 4 6 6 9 0 4 4 5 5 6 9 9 2 2 4 5 5 9 1 8 9
Grouped Freq Table
70 to 80 80 to 90 90 to 100 100 to 110 110 to 120 120 to 130 Verbal IQ 1 6 7 6 3 0 Math IQ 1 2 11 4 4 1
Box-whisker Plot
Summary
Numerical Measures
Measure of Central Location
1. Mean
x
i n
1
x i n
• Center of gravity 2. Median • “middle” observation
Measure of Non-Central Location
1. Percentiles 2. Quartiles 1. Lower quartile (Q 1 ) (25 th (
lower mid-hinge)
percentile) 2. median (Q 2 ) (50 th percentile) (
hinge)
3. Upper quartile (Q 3 ) (75 th (
upper mid-hinge)
percentile)
Measure of Variability (Dispersion, Spread)
1. Range 2. Inter-Quartile Range 3. Variance, standard deviation 4. Pseudo-standard deviation
1. Range
R
= Range = max - min
2. Inter-Quartile Range (IQR)
Inter-Quartile Range = IQR = Q 3 - Q 1
The Sample Variance
Is defined as the quantity:
i n
1
n
d i
2
1
i n
1
x i n
1
x
2 and is denoted by the symbol
s
2
The Sample Standard Deviation s
Definition:
The Sample Standard Deviation is defined by:
s
i n
1
n
d i
2
1
i n
1
x i n
1
x
2 Hence the Sample Standard Deviation, s, is the square root of the sample variance.
Interpretations of s
• In Normal distributions – Approximately 2/3 of the observations will lie within one standard deviation of the mean – Approximately 95% of the observations lie within two standard deviations of the mean – In a histogram of the Normal distribution, the standard deviation is approximately the distance from the mode to the inflection point
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 0 Mode Inflection point 5
s
10 15 20 25
2/3 s s
2s
Computing formulae for s and s 2
The sum of squares of deviations from the the mean can also be computed using the following identity:
i n
1
x i
x
2
i n
1
x i
2
i n
1
x i
2
n
Then:
s
2
i n
1
x i n
1
x
2
i n
1
x i
2
i n
1
x i n n
1
2
and
s
i n
1
x i n
1
x
2
i n
1
x i
2
i n
1
x i n n
1
2
A quick (rough) calculation of s
s
Range 4
The reason for this is that approximately all (95%) of the observations are between
x
2
s
and
x
2
s
.
Thus max and
Range
x
2
s
max and min min
x
x
2
s
2
s
.
x
2
s
.
4
s
Hence
s
Range 4
The Pseudo Standard Deviation (PSD)
Definition:
The
Pseudo Standard Deviation (PSD)
is defined by:
PSD
IQR 1 .
35
InterQuart ile Range 1 .
35
Properties
• For Normal distributions the magnitude of the pseudo standard deviation (
PSD
) and the standard deviation (
s
) will be approximately the same value • For leptokurtic distributions the standard deviation (
s
) will be larger than the pseudo standard deviation (
PSD
) • For platykurtic distributions the standard deviation (
s
) will be smaller than the pseudo standard deviation (
PSD
)
Measures of Shape
Measures of Shape
• Skewness 0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 0 5 10 15 20 25 • Kurtosis 0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 0 5 10 15 20 25 0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 0 5 10 15 20 25 -3 -2 -1 0 0 1 2 3 0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 0 5 10 15 20 25 -3 -2 -1 0 0 1 2 3
• Skewness – based on the sum of cubes
i n
1
x i
x
3 • Kurtosis – based on the sum of 4 th powers
i n
1
x i
x
4
The Measure of Skewness
g
1
1
n i n
1
x i s
3
x
3
The Measure of Kurtosis
g
2
1
n i n
1
x i s
4
x
4
3
Interpretations of Measures of Shape • Skewness 0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 0
g
1 > 0 5 10 15 20 25 • Kurtosis 0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 0
g
1 = 0 5 10 15 20 25
g
2 < 0 -3 -2 -1 0 0 1 2 3 0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 0
g
2 = 0 5 10 15 20 25 0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 0
g
1 < 0 5 10 15 20 25
g
2 > 0 -3 -2 -1 0 0 1 2 3
Inferential Statistics
Making decisions regarding the population base on a sample
Estimation by Confidence Intervals
• Definition – An (100)
P
%
confidence interval
of an unknown parameter is a pair of sample statistics (
t
1 having the following properties: and
t
2 ) 1.
P
[
t
1 <
t
2 ] = 1. That is
t
1 is always smaller than
t
2 .
2.
P
[the unknown parameter lies between
t
1 and
t
2 ] =
P
. • • the statistics
t
1 and
t
2 are random variables Property 2. states that the probability that the unknown parameter is bounded by the two statistics
t
1 and
t
2 is
P.
Confidence Interval for a Proportion
z
/ 2
p
1
p
n
1
n
z
/ 2 upper / 2 critical point of the standard normal distribtio n
B
z
/ 2
z
/ 2
p
1
p
n
z
/ 2 1
n
Error Bound
Determination of Sample Size
The sample size that will estimate
p
and level of confidence
P
= 1 – is: with an Error Bound
B n
z a
2 / 2
p
* 1
B
2
p
* • where: •
B
is the desired Error Bound • z /2 is the distribution /2 critical value for the standard normal
p*
is some preliminary estimate of
p.
Confidence Intervals for the mean of a Normal Population,
m
or
x
or
x x
z
/ 2
x
z
/ 2
s n z
/ 2
n x
z
/ 2
sample mean
upper / 2 critical
s
point of the standard normal distribtio
sample standard deviation
n
Determination of Sample Size
The sample size that will estimate m and level of confidence
P
= 1 – is: with an Error Bound
B n
z a
2 / 2
B
2 2
z a
2 / 2 2
B
2 • where: •
B
is the desired Error Bound • z /2 is the distribution /2 critical value for the standard normal
s*
is some preliminary estimate of
s.
Hypothesis Testing
An important area of statistical inference
Definition
Hypothesis (H) – Statement about the parameters of the population • In hypothesis testing there are two hypotheses of interest.
– The null hypothesis (H 0 ) – The alternative hypothesis (H A )
Type I, Type II Errors 1. Rejecting the null hypothesis when it is true. (type I error) 2. accepting the null hypothesis when it is false (type II error)
Decision Table showing types of Error
Accept H 0 Reject H 0 H 0 is True Correct Decision Type I Error H 0 is False Type II Error Correct Decision
To define a statistical Test we 1. Choose a statistic (called the
test statistic
) 2. Divide the range of possible values for the test statistic into two parts • The Acceptance Region • The Critical Region
To perform a statistical Test we
1. Collect the data.
2. Compute the value of the test statistic.
3. Make the Decision: • If the value of the test statistic is in the Acceptance Region we decide to
accept
H 0 .
• If the value of the test statistic is in the Critical Region we decide to
reject
H 0 .
Probability ofhe two types of erro
r Definitions: For any statistical testing procedure define 1. = P[Rejecting the null hypothesis when it is true] = P[ type I error] 2. b = P[accepting the null hypothesis when it is false] = P[ type II error]
Determining the Critical Region
1.
2.
The
Critical Region
should consist of values of the
test statistic
(hence
H
0 that indicate that
H A
should be rejected). is true. The
size
of the
Critical Region
is determined so that the probability of making a
type I
error, , is at some pre-determined level. (usually 0.05 or 0.01). This value is called the
significance level
of the test.
Significance level = P[test makes type I error]
To find the Critical Region
1.
2.
Find the sampling distribution of the test statistic when is
H
0 true.
Locate
the
Critical Region
in the tails (either left or right or both) of the sampling distribution of the test statistic when is
H
0 true. Whether you locate the critical region in the left tail or right tail or both tails depends on which values indicate
H A
is true.
The tails chosen = values indicating
H A
.
3.
the size of the
Critical Region
is chosen so that the area over the critical region and under the sampling distribution of the test statistic when is
H
0 true is the desired level of =
P
[type I error] Sampling distribution of test statistic when
H
0 is true Critical Region - Area =
The
z-
tests
Testing the probability of success
z
p
0
p
0
1
p
0
p
0
n
Testing the mean of a Normal Population
z
x
x
m 0
x
m 0
n n x
m 0
n x
m 0
s
Critical Regions for testing the probability of success,
p
.
The Alternative Hypothesis
H A H A
:
p
p
0 The Critical Region
z
z
/ 2 or
z
z
/ 2
H A
:
p
p
0
H A
:
p
p
0
z
z
z
z
Critical Regions for testing mean, m , of a normal population The Critical Region The Alternative Hypothesis
H A H A
: m m 0
H A
: m m 0
H A
: m m 0
z
z
/ 2 or
z
z
/ 2
z
z
z
z
• You can compare a statistical test to a meter
Value of test statistic Acceptance Region Critical Region Critical Region
Critical Region is the red zone of the meter
Critical Region Acceptance Region Value of test statistic Critical Region
Accept H
0
Critical Region Acceptance Region Critical Value of test statistic Region
Reject H
0
Acceptance Region Critical Region
Sometimes the critical region is located on one side. These tests are called
one tailed
tests.
Whether you use a one tailed test or a two tailed test depends on: 1. The hypotheses being tested (
H
0 and
H A
).
2. The test statistic.
If only large
positive
values of the test statistic indicate
H A
then
the critical region should be located in the
positive
tail. (1 tailed test) If only large
negative
values of the test statistic indicate
H A
then
the critical region should be located in the
negative
tail. (1 tailed test) If both large
positive
and large
negative
values of the test statistic indicate
H A
then
the critical region should be located both the
positive
and
negative
tail. (2 tailed test)
Usually 1 tailed tests are appropriate if
H A
is
one-sided.
Two tailed tests are appropriate if
H A
is
two sided.
But not always
The
p
-value approach to Hypothesis Testing
Definition
– Once the test statistic has been computed form the data the
p-value
is defined to be:
p-value
=
P
[the test statistic is
as or more extreme
than the observed value of the test statistic when
H
0 is true]
more extreme
means giving stronger evidence to rejecting
H
0
Properties of the
p -
value
1.
2.
3.
4.
5.
If the
p-
value is small (<0.05 or 0.01)
H
0 rejected.
should be The
p-
value measures the plausibility of
H
0 .
If the test is
two
tailed the
p
-value should be
two
tailed.
If the test is
one
tailed the
p
-value should be
one
tailed.
It is customary to report
p
-values when reporting the results. This gives the reader some idea of the strength of the evidence for rejecting
H
0
Summary
• A common way to report statistical tests is to compute the
p-value
.
• If the
p-value
is small ( < 0.05 or < 0.01) then
H
0 is rejected.
• If the
p-value
is extremely small this gives a strong indication that
H A
is true.
• If the
p-value
is marginally above the threshold 0.05 then we cannot reject
H
0 there would be a suspicion that
H
0 but is false.
“Students” t-test
The Situation
• Let
x
1 ,
x
2 ,
x
3 , … ,
x
n denote a sample from a normal population with mean m and standard deviation
.
Both m and are unknown.
• Let
i n
1
x i x
the sample mean
n n
x i
x
2
s
i
1
n
1 given value m 0 .
the sample standard deviation • we want to test if the mean, m , is equal to some
The Test Statistic
t
x s
m 0
n
The sampling distribution of the test statistic is the
t distribution
with
n
- 1 degrees of freedom
The Alternative Hypothesis
H A H A
: m m 0
H A
: m m 0
H A
: m m 0 The Critical Region
t
t
/ 2 or
t
t
/ 2
t
t
t
t
t
and
t
/2 are critical values under the
t
distribution with
n
– 1 degrees of freedom
Critical values for the t-distribution
or /2 0
t
/ 2 or
t
t
Confidence Intervals
using the t distribution
Confidence Intervals for the mean of a Normal Population,
m
, using the Standard Normal distribution
x
z
/ 2
n
Confidence Intervals for the mean of a Normal Population,
m
, using the t distribution
x
t
/ 2
s n
Testing and Estimation of Variances
Sampling Theory
The statistic
U
i n
1
x i
2
x
2
s
2 has a c 2 distribution with
n
– 1
degrees of freedom
0.2
Critical Points of the
c 2
distribution
0.1
0 0 5 c 2 10 15 20
Confidence intervals for
2
and
Hence (1 – )100% confidence limits for 2 are:
n
1
c 2 / 2
s
2 to
n
1
c 1 2 / 2
s
2 and (1 – )100% confidence limits for are:
n
c 2 1
/ 2
s
to
n
c 1 2 1
/ 2
s
Testing Hypotheses for
2
and
.
Suppose we want to test:
H
0 : 2 0 2 against
H A
: 2 0 2 The test statistic:
U
n
1
2 0
s
2 If
H
0 is true the test statistic,
U
, has a with
n
– 1 degrees of freedom: c 2 distribution Thus we reject
H
0
n
1
s
2 0 2 if c 1 2 / 2 or
n
1
0 2
s
2 c 2 / 2
0.2
0.1
/2
Reject
0 0 c 1 2 / 2
Accept
5 c 2 / 2 10 /2
Reject
15 20
One-tailed Tests for
2
and
.
Suppose we want to test:
H
0 : 2 0 2 against
H A
: 2 0 2 The test statistic:
U
n
1
2 0
s
2 We reject
H
0 if
n
1
0 2
s
2 c 2
0.2
0.1
0 0
Accept
5 c 2 10
Reject
15 20
Or suppose we want to test:
H
0 : 2 0 2 against
H A
: 2 0 2 The test statistic:
U
n
1
2 0
s
2 We reject
H
0 if
n
1
2 0
s
2 c 1 2
0.2
0.1
Reject
0 0 c 1 2
Accept
5 10 15 20
Comparing Populations
Proportions and means
Comparing proportions
Comparing two binomial probabilities p 1 and p 2
The test statistic
z
1
p
1
p
2 1
n
1 1
n
2 where ˆ 1
x
1 , ˆ 2
n
1
x
2 and
p n
2
x
1
n
1
x
2
n
2
The Critical Region The Alternative Hypothesis
H A H A
:
p
1
p
2
H A
:
p
1
p
2
H A
:
p
1
p
2 The Critical Region
z
z
/ 2 or
z
z
/ 2
z
z
z
z
100(1 – ) %
Confidence Interval
for d =
p
1 –
p
2 :
p
1 2 ˆ 1 ˆ 2
z
/ 2 or ˆ 1 ˆ 2
B
ˆ 1
1 ˆ 1
n
1 ˆ 2
1 ˆ 2
n
2 where
B
z
2
p
1 1
p
1
n
1
p
2 1
p
2
n
2
Sample size determination
Confidence Interval
for d =
p
1 –
p
2 : ˆ 1 ˆ 2
B
where
B
z
2
p
1 1
n
1
p
1
p
2 1
n
2
p
2 Again we want to choose
n
1 and
n
2 to set
B
at some predetermined level with a fixed level of confidence 1 –
.
Special solutions - case 1: n 1 =
n
2 =
n
.
then
n
1
n
2
z
2 / 2
p
1 1
1 1
p
1 1
B
2 2
p
2 2
Special solutions - case 2:
Choose
n
1 minimize
N = n
1 +
n
2 and
n
2 to = total sample size then
n
1
z
2 / 2
B
2
p
1 1 1 1
p
1 1
n
2
z
2 / 2
p
2 2 1 1
p
2 2
B
2
Special solutions - case 3:
Choose
n
1
C = C
0 +
c
1
n
1 +
c
2
n
2 and
n
2 to minimize = total cost of the study Note:
C
0
c
1
c
2 = fixed (set-up) costs = cost per unit in population 1 = cost per unit in population 2 then
n
1
z
2 / 2
B
2
p
1 1 1 1
p
1 1
c
2
c
2
c
1
c
1 1
p
1 1 1 1
p
1 2 2 2
p
2
n
2
z
2 / 2
B
2
p
2 2
p
2 2
c
1
c
1
c c
2 2 1
p
1 1 1 1
p
1 2 2 1 1 2
p
2
Comparing Means
The
z-
test
z
x
y
1
n
2 2 2
m
x
y s x
2
n
s
2
y m
n
and
m
large
Confidence Interval
for d = m 1 – m 2 :
x
1 2
z
2 1 2
n
1 2 2
n
2 or
x
1 2
B
where
B
z
2 1 2
n
1 2 2
n
2
Sample size determination
The sample sizes required,
n
1 and
n
2 , to estimate m 1 an error bound
B
with level of confidence 1 – are : – m 2 within
Equal sample sizes
n
1
n
2
z
2 / 2 2 2
n
1
z
2 / 2
B
2 2
n
2
z
2 / 2
B
2 2
Minimizing the total cost C = C 0
n
1
z
2 / 2
B
2 2
c
2
c
1
n
2
+ c 1
n
1
z
2 / 2
B
2
+ c 2
n
2
2
.
c
1
c
2
The t test – for comparing means – small samples (equal variances)
Situation
• We have two normal populations (1 and 2) • Let m 1 and denote the mean and standard deviation of population 1.
• Let m 2 and denote the mean and standard deviation of population 1.
• Note: we assume that the standard deviation for each population is the same.
1
=
2
=
The
t
test
for comparing means – small samples (equal variances)
t
x
y s s Pooled
Pooled
1
n
1
m
n
1
s x
2
m
1
s
2
y
2
The Alternative Hypothesis
H A H A
: m 1 m 2
H A
: m 1 m 2
H A
: m 1 m 2 The Critical Region
t
t
/ 2 or
t
t
/ 2
t
t
t
t
t
/ 2
and
t
are critical points under the t distribution with degrees of freedom
n
+
m
–2.
Confidence intervals for the difference in two means of normal populations (small sample sizes equal variances) (1 – )100% confidence limits for m 1 – m 2
x
y
t
/ 2
s Pooled
1
n
1
m
where
s Pooled
and
df m
n
1
s
2
x n
m
1
s
2
y
2 2
Tests, Confidence intervals for the difference in two means of normal populations (small sample sizes, unequal variances)
The approximate test for a comparing two means
t
of Normal Populations (unequal variances) Test statistic
x s x
2
df
s x
2
n
m
2
n y s
2
y m n
1 1
s
2
x n
2
s
2
y m
1 1
s
2
y m
2 Null Hypothesis
H
0 : m 1
=
m 2 Alt. Hypothesis
H
0 :
H
0 :
H
0 : m 1 m 1 m 1
≠
m 2
>
m 2
<
m 2 Critical Region
t
< -
t
/2 or
t
>
t
/2
t
>
t
t
< -
t
Confidence intervals for the difference in two means of normal populations (small samples, unequal variances) (1 – )100% confidence limits for m 1 – m 2
x
y
t
/ 2
s
2
x n
s
2
y m
with
df
n
1 1
s x
2
n s n
2
x
2
s m
2
y
2
m
1 1
s
2
y m
2
The paired
t
-test
An example of improved experimental design
The matched pair experimental design (
The paired sample experiment) Prior to assigning the treatments the subjects are grouped into pairs of similar subjects.
Suppose that there are
n
such pairs (Total of 2
n
=
n
+
n
subjects or cases), The two treatments are then randomly assigned to each pair. One member of a pair will receive treatment 1, while the other receives treatment 2. The data collected is as follows: – (
x
1 ,
y
1 ), (
x
2 ,
y
2 ), (
x
3 ,
y
3 ),, …, (
x n
,
y n
) .
x i
= the response for the case in pair
i
that receives treatment 1.
y i
= the response for the case in pair
i
that receives treatment 2.
Let
x i
= the measurement of the response for the subject in pair
i
that received
treatment 1
. Let
y i
= the measurement of the response for the subject in pair
i
that received
treatment 2
.
x
1
y
1
x
2
y
2
x
3
y
3 The data
… x n y n
To test
H
0 : m 1 = m 2 is equivalent to testing
H
0 : m
d
= 0.
(we have converted the two sample problem into a single sample problem).
The test statistic is the single sample
t
-test on the differences
d
1 ,
d
2 ,
d
3 , … ,
d n
namely
t d
d s d
0
n df
=
n
- 1
d s d
the mean of the
d
'
i
s and the std.
dev.
of the
d
'
i
s
Testing for the equality of variances
The
F
test
The test statistic
(
F
)
F
s x
2
s
2
y
1 or
F
s s x
2 2
y
The sampling distribution of the test statistic
If the Null Hypothesis (
H
0 ) is true then the sampling distribution of
F
is called the
F
-distribution with and n 1 n 2 =
n
=
m
- 1 degrees in the numerator - 1 degrees in the denominator
The F distribution
0.7
n 1 n 2 =
n
- 1 degrees in the numerator =
m
- 1 degrees in the denominator 0.6
0.5
0.4
0.3
0.2
0.1
0 0 1 2
F
( 3 n 1 , n 2 ) 4 5
Critical region for the test:
H
0 :
x
2
y
2 against
H A
:
x
2
y
2 (Two sided alternative) Reject
H
0 if or
F
s x
2
s
2
y
F
/ 2
n
1,
m
1 1
F
s y
2
s x
2
F
/ 2
m
1,
n
1
Critical region for the test (one tailed):
H
0 :
x
2
y
2 against
H A
:
x
2
y
2 (one sided alternative) Reject
H
0 if
F
s x
2
s y
2
F
n
1,
m
1
Summary of Tests
Situation Sample form the Normal distribution with unknown mean and known variance (Testing m ) Test Statistic
z
n
x
0 m 0 Sample form the Normal distribution with unknown mean and unknown variance (Testing m )
t
n
x s
m 0 Testing of a binomial probability Sample form the Normal distribution with unknown mean and unknown variance (Testing )
z
U p
0 ( 1
p
0
p
0 )
n
n
1
s
2 0 2 One Sample Tests H 0 m m m m
p = p
0 0 H A m m m m m m m m m m m m 0 0 0 0 0 Critical Region z < -z /2 or z > z /2 z > z z <-z t < -t /2 or t > t /2 t > t t < -t z < -z /2 or z > z /2 z > z z < -z
U U U
c c c 2 1 2 2 / 2
n
/ 2
n
n
1 1 1 or
U
c 2 1
n
1
Situation Two independent samples from the Normal distribution with unknown means and known variances (Testing m 1 - m 2 ) Two independent samples from the Normal distribution with unknown means and unknown but equal variances. (Testing m 1 - m 2 ) Estimation of a the difference between two binomial probabilities, p 1 -p 2 Two Sample Tests Test Statistic
z
x
1 1 2
x
2 2 2
n
1
n
2
t s p
s p
x
1
x
2 1 1
n
1
n
2
n
1
s
1 2
n
m
m
2 1
s
2 2
z z
ˆ ˆ (1 ˆ 1 1 ) ˆ 2 1
n n
1 1 1 2
n
1 1
n
2 H 0 m 1 m 2 m 1 m 2 H A m 1 m 1 m 1 m m m 2 m 1 m 1 m m 2 m 1 m 2
p
2 2
p
2 2 2 Critical Region z < -z /2 or z > z /2 z > z z < -z t < -t /2 or t > t /2
df
n
m
2 t > t
df
n
m
2 t < -t z > z z < -z
df
n
m
2 z < -z /2 or z > z /2
Two Sample Tests - continued Situation Two independent Normal samples with unknown means and variances (unequal) Two independent Normal samples with unknown means and variances (unequal) Test statistic
t
x
1
x
2
s n
1 2 1
s n
2 2 2
F
s
1 2
s
2 2 1 or
F
s
2 2
s
1 2
H
0 m 1 m 2 1 2
H A
m 1 ≠ m 1 > m 1 < m 2 m 2 m 2 1 ≠ 2 1 > 1 < 2 2 * =
df
1
n
1 1
s
2 1
x n
1
s
1
x
2
n
1 2
s m
2 2
y
2 2 1
m
2 1
s m
2 2 2 Critical Region
t < - t
/2 or
t
>
t
/2
df
= *
t
>
t
df
= *
t < - t
df
= *
F > F
/2 (
n
-1,
m
-1) or 1/
F
>
F
/2 (
m
-1,
n
-1)
F > F
(
n
-1,
m
-1) 1/
F
>
F
(
m
-1,
n
-1)
Situation
n
matched pair of subjects are treated with two treatments.
d i
d
= x i =
m 1
– y i
has mean
–
m 2 Test statistic
t
s d d n
Independent samples
Treat 1 2 The paired
t
test
H
0 m 1 m 2
H A
m 1 ≠ m 2 m 1 > m 1 < m 2 m 2 Critical Region
t < - t
/2 or
t
>
t
/2
df
=
n
- 1
t
>
t
t < - t
df
=
n
- 1
df
=
n
- 1
Matched Pairs
Treat Treat 2 2Pair 3 Possibly equal numbers Pair
n
Comparing k Populations
Means – One way Analysis of Variance (ANOVA)
The
F
test
The F test – for comparing
k
means
•
Situation
• We have
k
normal populations • Let m
i
and denote the mean and standard deviation of population
i
.
i
= 1, 2, 3, …
k
.
• Note: we assume that the standard deviation for each population is the same.
1
=
2
= … =
k
=
We want to test
H
0
:
m 1
m 2
m 3
m
k
against
H
A
:
m
i
m
j
for at least one pair
i
,
j
To test against
H
0
:
m 1
m 2
m 3
m
k
H
A
:
m
i
m
j
for at least one pair
i
,
j
use the test statistic
F
where
x i s i
s s
2
Between
2
Pooled
i k
1
i k
1
n i n i
x i
1
s i
2
x
2
i k
1
k n i
1
k
mean for the
i th
sample.
standard deviation for the
i th
sample
x
n x
1 1
n x k k
overall mean
n
1
n k
the statistic
i k
1
i
i
x
2 is called the
Between Sum of Squares
and is denoted by
SS Between
It measures the variability between samples
k
– 1 is known as the
Between degrees of freedom
and
i k
1
i
i
x k
1 is called the
Between Mean Square
and is denoted by
MS Between
the statistic
i k
1
n i
1
s i
2 is called the
Within Sum of Squares
and is denoted by
SS Within k i
1
n i N
k
is known as the
Within degrees of freedom
and
i k
1
n i
1
s i
2
i k
1
n i
k
is called the
Within Mean Square
and is denoted by
MS Within
then
F
MS Between MS Within
The Computing formula for F:
Compute 1) 2) 3) 4)
T i G
j k n i
1
i
1
T i x ij
Total
i n i k
1
j
1
x ij
for sample Grand
N
k
1
i n i i k
1
j
1
x ij n i
2 Total sample size
i
Total 5)
i k
1
T i
2
n i
Then 1) 2) 3)
SS Between
i k
1
T i
2
n i
G
2
N SS W ithin
i n i k
1
j
1
x ij
2
i k
1
T i
2
n i F
SS Between SS W ithin
k
N
k
1
The critical region for the F test
We reject
H
0
:
m 1
m 2
m 3
m
k
if
F
F
F
is the critical point under the with n 1 =
k
- 1degrees of freedom in the numerator and n 2
F
distribution =
N – k
degrees of freedom in the denominator
The ANOVA Table
A convenient method for displaying the calculations for the
F
-test
Anova Table
Source d.f.
Between
k
- 1 Within Total
N - k N
- 1 Sum of Squares
SS Between SS Within SS Total
Mean Square
MS Between MS Within
F-ratio
MS B /MS W
Fishers LSD (least significant difference) procedure:
1. Test
H
0 : m 1 = m 2 = m 3 = … = m
k
against at least one pair of means are different,
H A
: using the ANOVA
F-
test 2. If
H
0 is accepted we know that all means are equal (not significantly different). Then stop in this case 3. If
H
0 is rejected we conclude that at least one pair of means is significantly different, then follow this by • using two sample
t
tests to determine which pairs means are significantly different
Comparing k Populations
Proportions The c 2 test for independence
1. The no. of populations (columns)
k
(or
c
) 2. The number of categories (rows) from 2 to
r.
1 2 1
x
11
x
21 2
x
12
x
22 Total
C
1
C
2
c
Total
R
1
R
2
C c R
r
N
The
c 2
test for independence
Situation
• We have two categorical variables
R
and
C.
• The number of categories of
R
is
r.
• The number of categories of
C
is
c.
• • We observe
n
subjects from the population and count
x ij
= the number of subjects for which
R
=
i
and
C
=
j.
R
= rows,
C
= columns
The
c 2
test for independence
Define
C i R i
j c
1
x ij
i c
1
x ij
i th
row Total
j th
column Total
E ij
R i C j n
= Expected frequency in the (i,j)
th
cell in the case of independence.
Then to test
H
0 :
R
and
C
are independent against
H
A :
R
and
C
are not independent Use test statistic c 2
i
1
j c r
1
x ij
E ij
2
E ij E ij
= Expected frequency in the (i,j)
th
cell in the case of independence.
R C i j n x ij
= observed frequency in the (i,j)
th
cell
Sampling distribution of test statistic when
H
0 true c 2
i r
1
j c
1
x ij
E ij E ij
2 is
-
c 2 distribution with degrees of freedom n
=
(
r
- 1)(
c
- 1)
Critical and Acceptance Region
Reject
H
0 if : Accept
H
0 if : c 2 c 2 c 2 c 2
Linear Regression
Hypothesis testing and Estimation
Assume that we have collected data on two variables X and Y. Let
(
x
1
,
y
1 ) (
x
2
,
y
2 ) (
x
3
,
y
3 ) … (
x
n
,
y
n )
denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)
The Statistical Model
Each
y
i
is assumed to be randomly generated from a normal distribution with mean m
i
= + b
x i
and standard deviation .
( , b and are unknown) slope = b
Y =
+ b
X y i
+ b
x i
x i
The Data
The Linear Regression Model
• The data falls roughly about a straight line.
160 140 120 100 80 60 40 20 0 40 60 80 100 120 140
Y =
+ b
X
unseen
The Least Squares Line
Fitting the best straight line to “linear” data
Let
Y = a + b X
denote an arbitrary equation of a straight line.
a and b are known values.
This equation can be used to predict for each value of
X
, the value of
Y
.
For example, if
X
=
x i
(as for the i th predicted value of
Y
is: case) then the
y
ˆ
i
a
bx i
The residual
r i
y i
ˆ
i
y i
a
bx i
can be computed for each case in the sample,
r
1
y
1
y
1 ,
r
2
y
2 2 , ,
r n
The residual sum of squares (RSS) is
y n
n
,
RSS
n
1
r i
2
n
1
y i
y
ˆ
i
2
n
y i
a i i i
1 a measure of the “goodness of fit of the line
bx i
2
Y = a
+
bX
to the data
The optimal choice of
a
and
b
will result in the residual sum of squares
RSS
i n
1
r i
2
i n
1
y i
y
ˆ
i
2
i n
1
y i
a
bx i
2 attaining a minimum.
If this is the case than the line:
Y
=
a
+
bX
is called the
Least Squares Line
Comments
• b and are the
slope
and
intercept
of the regression line (
unseen)
•
b
and
a
are the slope and
intercept
of the least squares line (
calculated from the data
b ˆ
a
• They represent the same quantities
The equation for the least squares line
Let
S
xx
S
yy
i n
1
i n
1
x
i
y
i
x
2
y
2
S
xy
i n
1
x
i
x
y
i
y
Computing Formulae:
S xx S yy
i n
1
x i
i n
1
y i
x
2
y
2
i n
1
x i
2
i n
1
y i
2
i n
1
x i
i n
1
n y i
2
n
2
S xy
i n
1
x i
x
y i
y
i n
1
x i y i
i n
1
x i
i n
1
n y i
Then the slope of the least squares line can be shown to be:
b
S
xy
S
xx
i n
1
x
i
i n
1
x
i
x
y
i
x
2
y
and the intercept of the least squares line can be shown to be:
a
y
b x
y
S
xy
S
xx
x
The residual sum of Squares
RSS
i n
1
y i
y
ˆ
i
2
i n
1
y i
S
yy
S
xy
2
i
2 Computing formula
S
xx
Estimating , the standard deviation in the regression model :
s
i n
1
y i n
2
i
2
n
1
2
S
yy
i n
1
y i
a
bx i
2
n
2
xy
S
xx
2 Computing formula This estimate of is said to be based on
n
– 2 degrees of freedom
Sampling distributions of the estimators
The sampling distribution
slope
of the least squares line :
b
S
xy
S
xx
i n
1
x
i
i n
1
x
i
x
y
i
x
2
y
It can be shown that
b
has a normal distribution with mean and standard deviation
m
b
b and
b
S
xx
n
x
i
x
2
i
1
Thus
z
b
b
m
b
b
b
S
xx
has a standard normal distribution, and
t
b
m
b
s
b
b
s
b
S
xx
has a
t
distribution with
df
=
n
- 2
(1 – )100% Confidence Limits for slope b : b ˆ
t
/ 2
s S xx t
/2
critical value for the t-distribution with degrees of freedom
n
– 2
Testing the slope
H
0
:
0
vs
H
A
:
0 The test statistic is:
t
b
b 0
s S
xx
- has a
t
distribution with
df
=
n
– 2 if
H
0 is true.
The Critical Region
Reject
H
0
:
0
vs
H
A
:
0
if
t
b
b 0
s S
xx
t
/ 2
or
t
t
/ 2
df
=
n
– 2 This is a two tailed tests. One tailed tests are also possible
The sampling distribution intercept of the least squares line :
a
y bx S
xy
x S
xx
It can be shown that
a
has a normal distribution with mean and standard deviation m
a
and
a
1
n
i n
1
x
i
x
2
x
2
Thus
z
a
a
m
a
a
1
n
i n
1
x
i
x
2
x
2 has a standard normal distribution and
t
a
s
a
m
a
s
1
n
a
i n
1
x
i
x
2
x
2 has a
t
distribution with
df
=
n
- 2
(1 – )100% Confidence Limits for intercept : ˆ
t
/ 2
s
1
n x
2
S xx t
/2
critical value for the t-distribution with degrees of freedom
n
– 2
Testing the intercept
H
0
:
0
vs
H
A
:
0 The test statistic is:
t
s a
0
1
n
i n
1
x
i
x
2
x
2 - has a
t
distribution with
df
=
n
– 2 if
H
0 is true.
The Critical Region
Reject
H
0
:
0
vs
H
A
:
0
if
t
a
0
s
a
t
/ 2
or
t
t
/ 2
df
=
n
– 2
Confidence Limits for Points on the Regression Line
• The intercept regression line. is a specific point on the • It is the
y –
coordinate of the point on the regression line when
x = 0.
• It is the predicted value of
y
when
x
= 0.
• We may also be interested in other points on the regression line. e.g. when
x
=
x
0 • In this case the
y –
coordinate of the point on the regression line when
x = x
0 is + b
x
0
y
= + b
x
+ b
x
0
x
0
(1 )100% Confidence Limits for
+
b
x
0 :
a
bx
0
t
/ 2
s
1
n
x
0
x
2
S xx t
/2 n - 2
is the /2 critical value for the t-distribution with degrees of freedom
Prediction Limits for new values of the Dependent variable
y
• An important application of the regression line is prediction. • Knowing the value of
x
(
x
0 ) what is the value of
y
? • The predicted value of
y
when
x
b
x
0 =
x
0 is: • This in turn can be estimated by:.
ˆ b ˆ
x
0
a
bx
0
The predictor ˆ b ˆ
x
0
a
bx
0 • Gives only a single value for
y
. • A more appropriate piece of information would be a range of values.
• A range of values that has a fixed probability of capturing the value for
y.
• A (1 )100%
prediction interval
for
y.
(1 )100% Prediction Limits for
y
when
x = x
0 :
a
bx
0
t
/ 2
s
1 1
n
x
0
x
2
S xx t
/2 n - 2
is the /2 critical value for the t-distribution with degrees of freedom
Correlation
Definition
The statistic:
r
S
xx
S
xy
S
yy
i n
1
x
i
x
y
i
y
i n
1
x
i
x
2
i n
1
y
i
y
2 is called
Pearsons correlation coefficient
Properties
1. -1 ≤
r
≤ 1, |
r
| ≤ 1,
r
2 ≤ 1 2. |
r
| = 1 (
r
= +1 or -1) if the points (
x
1 ,
y
1 ), (
x
2 ,
y
2 ), …, (
x n
,
y n
) lie along a straight line. (positive slope for +1, negative slope for -1)
The test for independence (zero correlation)
H
0 :
X
and
Y
are independent
H
A :
X
and
Y
are correlated The test statistic:
t
n
2
r
1
r
2 The Critical region Reject
H
0 if |
t
| >
t a
/2 (
df
=
n
– 2) This is a two-tailed critical region, the critical region could also be one-tailed
Spearman’s rank correlation coefficient
r
(rho)
Spearman’s rank correlation coefficient
r
(rho)
Spearman’s rank correlation coefficient is computed as follows: • Arrange the observations on X in increasing order and assign them the ranks 1, 2, 3, …, n • Arrange the observations on Y in increasing order and assign them the ranks 1, 2, 3, …, n.
•For any case (i) let
(
x
i
,
y
i )
the ranks on X and Y. denote the observations on X and Y and let
(
r
i
,
s
i )
denote
Spearman’s rank correlation coefficient
is defined as follows: For each case let
d i
two ranks.
=
r i – s i
= difference in the Then Spearman’s rank correlation coefficient ( r ) is defined as follows:
r
1
n
n
6
1
i
n
2
d
i
2
1
Properties of Spearman’s rank correlation coefficient
r
1.
The value of r is always between –1 and +1.
2.
3.
4.
5.
6.
If the relationship between X and Y is positive, then r will be positive.
If the relationship between X and Y is negative, then r will be negative.
If there is no relationship between X and Y, then r will be zero.
The value of r will be +1 if the ranks of X completely agree with the ranks of Y.
The value of r will be -1 if the ranks of X are in reverse order to the ranks of Y.
Relationship between Regression and Correlation
Recall
r
S
xx
S
xy
S
yy
Also b
ˆ
S
xy
S
xx
s
y
s
x
r
since
s
x
n S
xx
1 and
s
y
n S
yy
1
Thus the
slope
of the least squares line is simply the ratio of the standard deviations × the correlation coefficient
The coefficient of Determination
Sums of Squares associated with Linear Regresssion
RSS
i n
1
r i
2
i n
1
y i
=
SS
unexplained
y
ˆ
i
2
i n
1
y i
a
bx i
2
SS Total
i n
1
y i
y
2
SS Explained
i n
1 ˆ
i
y
2
It can be shown:
i n
1
y i
y
2
i n
1 ˆ
i
y
2
i n
1
y i
y
ˆ
i
2
SS Total
SS Explained
SS Un
exp
lained
(Total variability in
Y
) = (variability in
Y
explained by
X
) + (variability in
Y
unexplained by
X
)
It can also be shown:
r
2
i n
1
i n
1
y i
ˆ
i
y
2
y
2 = proportion variability in
Y explained
by
X.
=
the
coefficient of determination
Further: 1
r
2
i n
1
y i i n
1
y i
y
ˆ
i
2
y
2 = proportion variability in
Y
that is
unexplained
by
X.
Regression (in general)
In many experiments we would have collected data on a single variable
Y
(the dependent variable ) and on
p
(say) other variables
X
1 ,
X
2 ,
X
3 , ... ,
X p
(the independent variables). One is interested in determining a
model
that describes the relationship between
Y
(the
response (dependent)
variable) and
X
1 ,
X
2 , …,
X p
(the
predictor (independent)
variables.
This model can be used for – –
Prediction Controlling Y by manipulating X 1 , X 2 , …, X
p
The Model:
is an equation of the form
Y
=
f
(
X
1 ,
X
2 ,... ,
X
p | q 1 , q 2 , ... , q q ) + e where q 1 , q 2 , ... , q q are unknown parameters of the function f and e is a random disturbance (usually assumed to have a normal distribution with mean 0 and standard deviation .
The Multiple Linear Regression Model
In Multiple Linear Regression we assume the following model Y = b 0 + b 1 X 1 + b 2 X 2 + ... + b p X p + e This model is called the
Multiple Linear Regression Model.
Again are unknown parameters of the model and where e b 0 , b 1 , b 2 , ... , b p are unknown parameters and is a random disturbance assumed to have a normal distribution with mean 0 and standard deviation .
Summary of the Statistics used in Multiple Regression
The Least Squares Estimates:
b b b
0
,
1
,
2
, , b
p
,
- the values that minimize
RSS
i n
1
i n
1
y i y i
i
2 b 0 b
x
1 1
i
b 2
x
2
i
b
p x pi
2
The Analysis of Variance Table Entries
a) Adjusted Total Sum of Squares (SS n Total ) SS Total
y i _ y 2 . d.f. n 1 i 1 b) Residual Sum of Squares (SS n Error ) RSS SS Error
y i yˆ i 2 . d.f. n p 1 i 1 c) Regression Sum of Squares (SS n Reg ) SS Reg SS b 1 , b 2 , ... , b p
Note:
i 1 yˆ i _ y 2 . d.f. p n
i 1 y i _ y n 2
i 1 yˆ i _ y 2 n
i 1 y i yˆ i 2 .
i.e. SS Total = SS Reg +SS Error
The Analysis of Variance Table
Source Regression Error Total Sum of Squares d.f.
Mean Square SS SS SS Reg Error Total F p SS Reg /p = MS Reg n-p-1 SS Error /(n-p-1) =MS Error = s 2 MS Reg /s 2 n-1
Uses:
1. To estimate 2 (the error variance).
- Use s 2 = MS Error to estimate 2. To test the Hypothesis 2 .
H 0 : b 1 = b 2 = ... = b p Use the test statistic = 0.
F
MS
Reg
SS
Reg
MS Error p
MS
Reg
SS Error
n s
2 1 - Reject
H
0 if
F
>
F
(
p
,
n-p-1
).
3. To compute other statistics that are useful in describing the relationship between Y (the dependent variable) and X 1 , X 2 , ... ,X p a)R 2 (the independent variables).
= the coefficient of determination = SS Reg /SS Total = n i 1 n i 1 ˆ i y i y 2 y 2 = the proportion of variance in Y explained by 1 - R 2 X 1 , X 2 , ... ,X p = the proportion of variance in Y that is left unexplained by X 1 , X2, ... , X p = SS Error /SS Total .
b) R a 2 = "R 2 adjusted" for degrees of freedom.
= 1 -[the proportion of variance in Y that is left unexplained by X 1 , X 2 ,... , X p adjusted for d.f.]
MS Error MS Total SS Error SS Total
n
n
1 1
n
n
n
n
1
SS
1
Error SS Total
1 1 1
R
2
c)
Y R
=
R
2 = the Multiple correlation coefficient of with
X
1 ,
X
2 , ... ,
X p
= SS Re g SS Total = the maximum correlation between
Y
linear combination of
X
1 ,
X
2 , ... ,
X
p and a
Comment:
The statistics F, R 2 , R a 2 equivalent statistics. and R are
Logistic regression
The dependent variable
y
is
binary.
It takes on two values “Success” (1) or “Failure” (0) We are interested in predicting a
y
from a continuous dependent variable
x.
This is the situation in which
Logistic Regression
is used
The logisitic Regression Model
Let
p
denote
P
[
y
= 1] =
P
[Success].
This quantity will increase with the value of
x.
The ratio: 1
p p
is called the
odds ratio
This quantity will also increase with the value of
x,
ranging from zero to infinity
.
The quantity: ln 1
p p
is called the
log odds ratio
The logisitic Regression Model
Assumes the
log odds ratio
is linearly related to
x
.
i. e. : ln 1
p p
b 0 b 1
x
In terms of the
odds ratio
1
p p
e
1
x
The logisitic Regression Model
Solving for
p
in terms
x
.
p
e
b b 1
x
1
e
1
x
0.8
p
0.6
1 0.4
0.2
0 0 Interpretation of the parameter b 0 (determines the intercept) 2 1
e
b 0
e
b 0 4
x
6 8 10
Interpretation of the parameter b 1 (determines when
p
is 0.50 (along with b 0 )) 0.8
p
0.6
1 0.4
0.2
0 0 2
p
e
1
e
when b 0 b 1
x
4
x
6 1
x
1
x
0 or
x
b b 0 1 8 1 10 1 2
0.8
p
0.6
1 0.4
0.2
0 0 Interpretation of the parameter b 1 (determines slope when
p
is 0.50 ) 2 slope 4
x
6 b 1 4 8 10
The Multiple Logistic Regression model
Here we attempt to predict the outcome of a binary response variable independent variables
X
1
Y
,
X
from several 2 , … etc ln 1
p p
b 0 b 1
X
1 b
p X p
or
p
e
1
e
1
X
1 b
p X p
1
X
1 b
p X p
Nonparametric Statistical Methods
Definition
When the data is generated from process (model) that is known except for finite number of unknown parameters the model is called a
parametric model
.
Otherwise, the model is called a
non parametric model
Statistical techniques that assume a
non parametric model
are called
non-parametric.
Nonparametric Statistical Methods
The sign test
A nonparametric test for the central location of a distribution
To carry out the
The Sign test:
1. Compute the test statistic:
S
= the number of observations that exceed m 0 =
s observed
2.
Compute the
p
-value of test statistic,
s observed
:
p
-value =
P
[
S ≥ s observed
] ( = 2
P
[
S ≥ s observed
] for 2-tailed test) where
S
is
binomial, n = sample size, p = 0.50
3.
Reject
H
0 if
p-
value low (< 0.05)
Sign Test for Large Samples
If
n
is large we can use the Normal approximation to the Binomial.
Namely
S
has a Binomial distribution with
p
= ½ and
n =
sample size.
Hence for large
n, S
has approximately a Normal distribution with and mean m
S
np
n
2 standard deviation
S
npq
n
1 2 1 2 2
n
Hence for large
n,
use as the test statistic (in place of
S
)
z
S
S
m
S
S
n
2
n
2
Choose the critical region for
z
from the
Standard Normal distribution.
i.e. Reject
H
0 if
z
< -
z
/2 or
z
>
z
/2
two tailed
( a one tailed test can also be set up.
Nonparametric
Confidence Intervals
Assume that the data,
x
1 ,
x
2 ,
x
3 , …
x n
from an unknown distribution. is a sample Now arrange the data
x
1 ,
x
2 ,
x
3 , …
x n
order
x
(1) <
x
(2) <
x
(3) < … <
x
(
n
) in increasing Hence
x
(1) = the smallest observation
x
(2) = the 2
nd
smallest observation
x
(
n
) = the largest observation Consider the
k th
smallest observation and the
k th
largest observation in the data
x
1 ,
x
2 ,
x
3 , …
x n x
(
k
) and
x
(
n
–
k
+ 1)
Hence
P
[
x
(
k
) <
median
<
x
(
n
–
k
+ 1) ]
= P
[
k
≤ the no. of obs greater than the median ≤
n-k
]
= p
(
k
) +
p
(
k
+ 1) + … +
p
(
n-k
) = P where
p
(
i
)’s are
binomial probabilities
with
n =
the sample size and
p =
1/2.
This means that
x
(
k
) to
x
(
n
–
k
+ 1) is a P(100)% confidence interval for the median Choose
k
so that P
= p
(
k
) +
p
(
k
close to .95 (or 0.99) + 1) + … +
p
(
n-k
) is
Summarizing
x
(
k
) to
x
(
n
–
k
+ 1) is a P(100)% confidence interval for the median where P
= p
(
k
) +
p
(
k
+ 1) + … +
p
(
n-k
) and
p
(
i
)’s are
binomial probabilities
with
n =
the sample size and
p =
1/2.
For large values of
n
one can use the normal approximation to the Binomial to find the value of
k
so that
x
(
k
) median.
to
x
(
n – k
+ 1) is a 95% confidence interval for the
k
n
1.96
2
n
Using
k
n
1 .
96 2
n n
20 40 100 200
k
5.6
13.8
40.2
86.1
The Wilcoxon Signed Rank Test
An Alternative to the sign test
• For
Wicoxon’s signed-Rank test
assign ranks to the absolute values of (
x
1 – – m 0 , … ,
x n
– m 0 ). • A rank of 1 to the value of smallest in absolute value.
x i –
m 0 we would which is m 0 • A rank of
n
to the value of
x i
largest in absolute value.
–
m 0 which is ,
x
2
W + =
the sum of the ranks associated with positive values of
x i –
m 0 .
W =
the sum of the ranks associated with negative values of
x i –
m 0 .
To carry out Wilcoxon’s signed rank test We 1.
2.
3.
4.
Compute
T = W
+ or
W
(usually it would be the smaller of the two) i.
Let
t
observed = the observed value of
T.
Compute the
p
-value =
P
[
T ≤ t
observed ] (2
P
[
T ≤ t
observed ] for a two-tailed test). For
n ≤
12 use the table. ii.
For
n
> 12 use the Normal approximation.
Conclude
H A
0.01).
(Reject
H
0 ) if
p
-value is less than 0.05 (or
For sample sizes,
n T
(
W
+ or
W
) > 12 we can use the fact that has approximately a normal distribution with mean m
T
n
n
4 1
standard deviation
T
n
n
1 2
n
24 1 and
P
T
t
P
T
T
m
T
t
T
m
T
P
Z
t
T
m
T
Comments
1.
2.
i.
The
t –
test This test requires the assumption of normality. ii.
iii.
• If the data is not normally distributed the test is invalid The probability of a type I error may not be equal to its desired value (0.05 or 0.01) If the data is normally distributed, the
t
-test commits type II errors with a smaller probability than any other test (In particular Wilcoxon’s signed rank test or the sign test) i.
ii.
The sign test This test does not require the assumption of normality (true also for Wilcoxon’s signed rank test).
This test ignores the magnitude of the observations completely. Wilcoxon’s test takes the magnitude into account by ranking them
Two-sample – Non-parametic tests
Mann-Whitney Test
A non-parametric two sample test for comparison of central location
The Mann-Whitney Test
• This is a non parametric alternative to the two sample
t
test (or
z
test) for independent samples.
• These tests (
t
and
z
) assume the data is normal • The Mann- Whitney test does
not
make this assumption.
• Sample of
n
from population 1
x
1 ,
x
2 ,
x
3 , … ,
x n
• Sample of
m
from population 2
y
1 ,
y
2 ,
y
3 , … ,
y m
The Mann-Whitney test statistics
U
1 and
U
2 Arrange the observations from the two samples combined in increasing order (retaining sample membership) and assign ranks to the observations.
Let
W
1 = the sum of the ranks for sample 1.
Let
W
2 = the sum of the ranks for sample 2.
Then and
U
1
U
2
nm
nm
1 2
W
1 2
m m
1
W
2
• The distribution function of
U
(
U
1 or
U
2 ) has been tabled for various values of
n
and
m
(<
n
) when the two observations are coming from the same distribution.
• These tables can be used to set up critical regions for the Mann-Whitney
U
test.
The Mann-Whitney test for large samples
For large samples (
n
> 10 and
m
>10) the statistics
U
1 and
U
2 have approximately a Normal distribution with mean and standard deviation m
U i
nm
2
U i
12 1
Thus we can convert
U i
to a standard normal statistic
z
U i
U i
m
U i
U i
nm
2 1 12 And reject
H
0 tailed test) if
z
< -
z
/2 or
z
>
z
/2 (for a two
The Kruskal Wallis Test
• Comparing the central location for
k
populations • An nonparametric alternative to the one-way ANOVA
F
-test
Situation: Data is collected from
k
populations.
The sample size from population
i
is
n i
.
The data from population
i
is:
x
i
1
,
x
i
2
, ,
x
in i
i
1, 2, .
k
The computation of The Kruskal-Wallis statistic We group the
N
=
n
1 +
n
2 + … +
n k
observation from
k
populations together and rank these observations from 1 to
N.
Let
r ij
be the rank associated with with the observation
x ij
.
Handling of “
tied
” observations If a group of observations are equal the ranks that would have been assigned to those observations are averaged
The Kruskal-Wallis statistic
K
12 1
i k
1
U i
2
n i
3
N
1 where
U i
j n i
1
r ij
i
1
r in i
= the sum of the ranks for the
i th
sample
The Kruskal-Wallis test Reject
H
0 : the
k
populations have same central location
= p
(
k
) +
p
(
k
+ 1) + … +
p
(
n-k
) = P if
K
c 2 1
Probability Theory
Probability – Models for random phenomena
Definitions
The sample Space,
S
The
sample space
,
S
, for a random phenomena is the set of all possible outcomes.
An Event ,
E
The
event
,
E
, is any subset of the
sample space
,
S
. i.e. any set of outcomes (not necessarily all outcomes) of the random phenomena
S
Venn diagram
E
The
event
,
E
, is said to
have occurred
if after the outcome has been observed the outcome lies in
E.
S E
Set operations on Events
Union
Let
A
and
B
be two events, then the
union
of
A
and
B
is the event (denoted by
A
B
) defined by:
A
B
= {
e
|
e
belongs to A or e belongs to
B
}
A
B A B
The event
A
B occurs if the event A occurs or the event and B occurs .
A
B A B
Intersection
Let
A
and
B
be two events, then the
intersection
of
A
and
B
is the event (denoted by
A
B
) defined by:
A
B
= {
e
|
e
belongs to A and e belongs to
B
}
A
B A B
The event
A
B occurs if the event A occurs and the event and B occurs .
A
B A B
Complement
Let
A
be any event, then the
complement
of
A A
= {
e
| e does not belongs to
A
}
A A
The event
occurs
if the event A does not
occur
A A
In problems you will recognize that you are working with:
1. Union
if you see the word
or
,
2. Intersection
if you see the word
and
,
3. Complement
if you see the word
not
.
Definition:
mutually exclusive Two events
A
and
B
are called
mutually exclusive
if:
A
A B
If two events
A
and
B
are are
mutually exclusive
then:
1. They have no outcomes in common.
They can’t occur at the same time. The outcome of the random experiment can not belong to both
A
and
B.
A B
Rules of Probability
The additive rule
P
[
A
B
] =
P
[
A
] +
P
[
B
] –
P
[
A
B
] and
P
[
A
B
] =
P
[
A
] +
P
[
B
] if
A
B
=
The Rule for complements
for any event
E
Conditional probability
B
The multiplicative rule of probability
and
B
B
if
if
0 0
if
A
and
B
are
independent
.
This is the definition of independent
Counting techniques
Summary of counting rules
Rule 1
n
(
A
1
A
2
A
3 …. ) =
n
(
A
1 ) +
n
(
A
2 ) +
n
(
A
3 ) + … if the sets
A
1 ,
A
2 ,
A
3 , … are pairwise mutually exclusive (i.e.
A i
A j
= )
Rule 2
N
=
n
1
n
2 = the number of ways that two operations can be performed in sequence if
n
1 = the number of ways the first operation can be performed
n
2 = the number of ways the second operation can be performed once the first operation has been completed.
Rule 3
N
=
n
1
n
2 …
n k
= the number of ways the
k
operations can be performed in sequence if
n
1 = the number of ways the first operation can be performed
n
i = the number of ways the
i th
operation can be performed once the first (
i
- 1) operations have been completed.
i
= 2, 3, … ,
k
Basic counting formulae
1.
Orderings
n
!
2.
Permutations
n P k
n
!
!
The number of ways that you can choose
k
objects from
n
in a specific order 3.
Combinations
n
n C k
n
!
!
The number of ways that you can choose
k
objects from
n
(order of selection irrelevant)
Random Variables
Numerical Quantities whose values are determine by the outcome of a random experiment
• •
Random variables
are either
Discrete
– Integer valued – The set of possible values for
X
are integers
Continuous
– The set of possible values for
X
are all real numbers – Range over a continuum.
The Probability distribution of A random variable
A Mathematical description of the possible values of the random variable together with the probabilities of those values
The probability distribution of a discrete random variable is describe by its :
probability function p
(
x
).
p
(
x
) = the probability that
X
takes on the value
x.
This can be given in either a
tabular form
or in the form of
an equation
.
It can also be displayed in a
graph
.
Comments:
Every probability function must satisfy: 1. The probability assigned to each value of the random variable must be between 0 and 1, inclusive: 0
p
(
x
) 1 2. The sum of the probabilities assigned to all the values of the random variable must equal 1:
x p
(
x
) 1 3.
P
a
X
b
x b
a p p
(
a
) (
x
)
p
(
a
1 )
p
(
b
)
Probability Distributions of Continuous Random Variables
Probability Density Function The
probability distribution
of a
continuous
random variable is describe by
probability density curve f(x)
.
Notes:
The Total Area under the probability density curve is 1.
The Area under the probability density curve is from a to b is
P[a < X
<
b
]
.
Normal Probability Distributions (Bell shaped curve)
b
)
a
m
b x
Mean, Variance and standard deviation of Random Variables
Numerical descriptors of the distribution of a Random Variable
Mean of a Discrete Random Variable • The mean, m , of a discrete random variable
x
is found by multiplying each possible value of
x
by its own probability and then adding all the products together: m
x
1
x p
xp
1
x
2
p
2
x k p
k Notes:
The mean is a
weighted average
of the values of
X.
The mean is the
long-run average
value of the random variable.
The mean is
centre of gravity
of the probability distribution of the random variable
Variance of a Discrete Random Variable :
Variance, 2 , of a discrete random variable x is found by multiplying each possible value of the squared deviation from the mean, (
x
m ) 2 , by its own probability and then adding all the products together: 2
x
x
x
x
x x
2 2 m
p
p
m 2
x
xp
2 Standard Deviation of a Discrete Random Variable : The positive square root of the variance: 2
The Binomial distribution
An important discrete distribution
X
is said to have the
Binomial distribution
with parameters
n
and
p.
1.
X
is the number of successes occurring in the
n
repetitions of a Success-Failure Experiment.
2. The probability of success is
p.
3.
The probability function
p
n x
p
x
1
p
n
x
Mean,Variance & Standard Deviation of the Binomial Ditribution
• The mean, variance and standard deviation of the binomial distribution can be found by using the following three formulas:
1.
m
np
2.
2
3.
npq npq
np
1
p
np
1
p
Mean of a Continuous Random Variable (uses calculus) • The mean, m , of a discrete random variable
x
m
Notes:
The mean is a
weighted average
of the values of
X.
The mean is the
long-run average
value of the random variable.
The mean is
centre of gravity
of the probability distribution of the random variable
Variance of a Continuous Random Variable
2
x
m
Standard Deviation of a Continuous Random Variable
: The positive square root of the variance: 2
x
m
The Normal Probability Distribution Points of Inflection m 3 m 2 m m 2 m 3
Main characteristics of the Normal Distribution • Bell Shaped, symmetric • Points of inflection on the bell shaped curve are at m – and m + . That is one standard deviation from the mean • Area under the bell shaped curve between m and m + is approximately 2/3.
– • Area under the bell shaped curve between m and m + 2 is approximately 95%.
– 2
Normal approximation to the Binomial distribution
Using the Normal distribution to calculate Binomial probabilities
•
Normal Approximation to the Binomial
P
a
X
distribution
b
p
(
a
)
P
a
p
(
a
1 2 1 )
Y
b
1 2
p
(
b
)
X
has a Binomial distribution with parameters
n
and
p
•
Y
has a Normal distribution m
np
npq
1 2 continuity correction
Sampling Theory
Determining the distribution of Sample statistics
The distribution of the sample mean
Thus if
x
1 ,
x
2 , … ,
x n
denote
n
independent random variables each coming from the same Normal distribution with mean m and standard deviation
.
Then
x
i n
1
x i n
has Normal distribution with mean m
x
variance
x
2 m and 2
n
standard deviation
x
n
The Central Limit Theorem
The Central Limit Theorem (C.L.T.) states that if
n
is
sufficiently large
, the sample means of random samples from any population with mean standard deviation distributed with mean m m and finite are approximately normally
n Technical Note:
The mean and standard deviation given in the CLT hold for any sample size; it is only the “approximately normal” shape that requires
n
to be sufficiently large.
Graphical Illustration of the Central Limit Theorem Original Population Distribution of
x
:
n
= 2 10 20 30
x
Distribution of
x
:
n
= 10 10 20 30
x
Distribution of
x
:
n
= 30 10
x
10 20
x
Implications of the Central Limit Theorem
• The Conclusion that the sampling distribution of the sample mean is
Normal
, will to
true
if the sample size is large (>30). (even though the population may be non normal).
• When the population can be assumed to be normal, the sampling distribution of the sample mean is
Normal
, will to
true
for any sample size.
• Knowing the sampling distribution of the sample mean allows to answer probability questions related to the sample mean.
Sampling Distribution of a Sample Proportion
Sampling Distribution for Sample Proportions
Let
p =
population proportion of interest Let
X
or binomial probability of success.
n
no.
of succeses no.
of bimomial trials = sample proportion or proportion of successes.
Then the sampling distributi on of is approximately a normal distribution with mean m
p
p
( 1
p
)
n
Sampling distribution of a
differences
Note
If
X, Y
are
independent
normal random variables, then
: X – Y
is normal with mean m
X
m
Y
standard deviation 2
X
Y
2
Sampling distribution of a
difference
in two Sample means
Situation
• We have two normal populations (1 and 2) • Let m 1 and 1 population 1.
denote the mean and standard deviation of • Let m 2 and 2 population 2.
denote the mean and standard deviation of • Let
x
1 ,
x
2 ,
x
3 population 1
.
, … ,
x
n denote a sample from a normal • Let
y
1 ,
y
2 ,
y
3 population 2
.
, … ,
y
m denote a sample from a normal • Objective is to compare the two population means
Then
D
m
y
is Normal with mean m m
x
y
m m 1 2 and =
x
2
y
2 1 2
n
2 2
m
Sampling distribution of a
difference
in two Sample proportions
Situation
• Suppose we have
two Success-Failure
experiments • Let
p
1 • Let
p
2 = the = the
probability of success probability of success
for experiment 1.
for experiment 2.
• Suppose that experiment 1 is repeated
n
1 experiment 2 is repeated
n
2 times and • Let
x
1 = the no. of
successes
in the
n
1 repititions of experiment 1,
x
2 = the no. of
successes
in the
n
2 repititions of experiment 2.
ˆ 1 =
x n
1 1 and ˆ 2 =
x n
2 2
Then
D
ˆ 1
p
2 m
p
1 ˆ 2 m
p
1 m ˆ 2
p
1 -
p
2
p
1 ˆ 2 = 2 ˆ 1 2 ˆ 2
p
1 1
p
1
n
1
p
2 1
p
2
n
2
The Chi-square (
c 2
) distribution
The Chi-squared distribution with
n
degrees of freedom
Comment: If
z
1 ,
z
2 , ...,
z
n are independent random variables each having a standard normal distribution then
U
=
z
1 2
z
2 2
z
n 2 has a chi-squared distribution with n degrees of freedom.
0.18
The Chi-squared distribution with n degrees of freedom 0.12
0.06
0 0 10 n - degrees of freedom 20
0.5
0.4
0.3
0.2
0.1
2
d.f.
3
d.f.
4
d.f.
2 4 6 8 10 12 14
Statistics that have the Chi-squared distribution:
1. c 2
j c
1
i r
1
x ij
E ij
2
E ij
j c
1
i r
1
r ij
2 This statistic is used to detect independence between two categorical variables
d.f.
= (
r
– 1)(
c
– 1)
Let
x
1
,
x
2
, … ,
x
n
denote a sample from the normal distribution with mean
m
and standard deviation
, then
2.
U
i r
1
x i
2
x
2 (
n
1)
s
2 2 has a chi-square distribution with
d.f.
=
n
– 1.