Transcript Document

Statistics 111 - Lecture 13
Inference for a
Population Mean
Confidence Intervals and Tests
with unknown variance and Twosample Tests
June 23, 2008
Stat 111 - Lecture 13 - One Mean
1
Administrative Notes
• Homework 4 due Wednesday
• Homework 5 assigned tomorrow
• The final is ridiculously close (next Thursday)
June 23, 2008
Stat 111 - Lecture 13 - One Mean
2
Outline
• Review:
• Confidence Intervals and Hypothesis Tests which
assume known variance
• Population variance unknown:
• t-distribution
• Confidence intervals and Tests using the t-distribution
• Small sample situation
• Two-Sample datasets: comparing two means
• Testing the difference between two samples when
variances are known
• Moore, McCabe and Craig: 7.1-7.2
June 23, 2008
Stat 111 - Lecture 13 - One Mean
3
Chapter 5: Sampling Distribution of
• Distribution of values taken by sample mean in all
possible samples of size n from the same population
• Standard deviation of sampling distribution:
• Central Limit Theorem:
Sample mean
has a Normal distribution
• These results all assume that the sample size is
large and that the population variance is known
June 23, 2008
Stat 111 - Lecture 13 - One Mean
4
Chapter 6: Confidence Intervals
• We used sampling distribution results to create two
different tools for inference
• Confidence Intervals: Use sample mean as the
center of an interval of likely values for pop. mean 
• Width of interval is a multiple Z* of standard deviation
of sample mean
• Z* calculated from N(0,1) table for specific confidence
level (eg. 95% confidence means Z*=1.96)
• We assume large sample size to use N(0,1)
distribution, and we assume that  is known (usually
just use sample SD s)
June 23, 2008
Stat 111 - Lecture 13 - One Mean
5
Chapter 6: Hypothesis Testing
• Compare sample mean to a hypothesized population
mean 0
• Test statistic is also a multiple of standard deviation
of the sample mean
• p-value calculated from N(0,1) table and compared to
-level in order to reject or accept null hypothesis
• Eg. p-value < 0.05 means we reject null hypothesis
• We again assume large sample size to use N(0,1)
distribution, and we assume that  is known
June 23, 2008
Stat 111 - Lecture 13 - One Mean
6
Unknown Population Variance
• What if we don’t want to assume that population SD 
is known?
• If  is unknown, we can’t use our formula for the
standard deviation of the sample mean:
• Instead, we use the standard error of the sample
mean:
• Standard error involves sample SD s as estimate of 
June 23, 2008
Stat 111 - Lecture 13 - One Mean
7
t distribution
• If we have small sample size n and we need to use
the standard error formula because the population
SD  is unknown, then:
The sample mean does not have a
normal distribution!
• Instead, the sample mean has a
T distribution with n - 1 degrees of freedom
• What the heck does that mean?!?
June 23, 2008
Stat 111 - Lecture 13 - One Mean
8
t distribution
• t distribution looks like a normal distribution, but has
“thicker” tails. The tail thickness is controlled by the
degrees of freedom
Normal distribution
t with df = 5
t with df = 1
• The smaller the degrees of freedom, the thicker the tails of
the t distribution
• If the degrees of freedom is large (if we have a large sample
size), then the t distribution is pretty much identical to the
normal distribution
June 23, 2008
Stat 111 - Lecture 13 - One Mean
9
Known vs. Unknown Variance
• Before: Known population SD 
• Sample mean is centered at  and has standard deviation:
• Sample mean has Normal distribution
• Now: Unknown population SD 
• Sample mean is centered at  and has standard error:
• Sample mean has t distribution with n-1 degrees of freedom
June 23, 2008
Stat 111 - Lecture 13 - One Mean
10
New Confidence Intervals
• If the population SD is unknown, we need a new
formula for our confidence interval
• Standard error used instead of standard deviation
• t distribution used instead of normal distribution
• If we have a sample of size n from a population with
unknown , then our 100·C % confidence interval
for the unknown population mean  is:
• The critical value
is calculated using a table for
the t distribution (back of textbook)
June 23, 2008
Stat 111 - Lecture 13 - One Mean
11
Tables for the t distribution
• If we want a 100·C% confidence
interval, we need to find the value
so that we have a probability of C
between -t* and t* in a t distribution
with n-1 degrees of freedom
• Example: 95% confidence interval
when n = 14 means that we need a
tail probability of 0.025, so t*=2.16
= 0.95
df = 13
= 0.025
-t*
June 23, 2008
t*
Stat 111 - Lecture 13 - One Mean
12
Example: NYC blackout baby boom
• Births/day from August 1966:
• Before: we assumed that  was known, and used
the normal distribution for a 95% confidence interval:
• Now: let  be unknown, and used the t distribution
with n-1 = 13 degrees of freedom to calculate our a
95% confidence interval:
• Interval is now wider because we are now less
certain about our population SD 
June 23, 2008
Stat 111 - Lecture 13 - One Mean
13
Another Example: Calcium in the Diet
• Daily calcium intake from 18 people below poverty
line (RDA is 850 mg/day)
• Before: used known  = 188 from previous study,
used normal distribution for 95% confidence interval:
• Now: let  be unknown, and use the t distribution
with n-1 = 17 degrees of freedom to calculate our a
95% confidence interval:
Again, Wider interval because
we have an unknown 
June 23, 2008
Stat 111 - Lecture 13 - One Mean
14
New Hypothesis Tests
• If the population SD is unknown, we need to modify our
test statistics and p-value calculations as well
• Standard error used in test statistic instead of
standard deviation
• t distribution used to calculate the p-value instead of
standard normal distribution
June 23, 2008
Stat 111 - Lecture 13 - One Mean
15
Example: Calcium in Diet
• Daily calcium intake from 18 people below poverty line
• Test our data against the null hypothesis that 0 = 850
mg (recommended daily allowance)
• Before: we assumed known  = 188 and calculated
test statistic T= -2.32
• Now:  is actually unknown, and we use test statistic
with standard error instead of standard deviation:
June 23, 2008
Stat 111 - Lecture 13 - One Mean
16
Example: Calcium in Diet
• Before: used normal distribution to get p-value = 0.02
Normal
distribution
prob = 0.01
T= -2.32
T= 2.32
• Now: is actually unknown, and we use t distribution
with n-1 = 17 degrees of freedom to get p-value ≈ 0.04
t17
distribution
prob ≈ 0.02
T= -2.26
T= 2.26
• With unknown , we have a p-value that is closer to
the usual threshold of  = 0.05 than before
June 23, 2008
Stat 111 - Lecture 13 - One Mean
17
Review
• Known population SD 
• Use standard deviation of sample mean:
• Use standard normal distribution
• Unknown population SD 
• Use standard deviation of sample mean:
• Use t distribution with n-1 d.f.
June 23, 2008
Stat 111 - Lecture 13 - One Mean
18
Small Samples
• We have used the standard error and t distribution to
correct our assumption of known population SD
• However, even t distribution intervals/tests not as
accurate if data is skewed or has influential outliers
• Rough guidelines from your textbook:
• Large samples (n> 40): t distribution can be used even for
strongly skewed data or with outliers
• Intermediate samples (n > 15): t distribution can be used
except for strongly skewed data or presence of outliers
• Small samples (n < 15): t distribution can only be used if data
does not have skewness or outliers
• What can we do for small samples of skewed data?
June 23, 2008
Stat 111 - Lecture 13 - Means
19
Techniques for Small Samples
• One option: use log transformation on data
• Taking logarithm of data can often make it look more normal
• Another option: non-parametric tests like the sign test
• Not required for this course, but mentioned in text book if
you’re interested
June 23, 2008
Stat 111 - Lecture 13 - Means
20
Comparing Two Samples
• Up to now, we have looked at inference for one
sample of continuous data
• Our next focus in this course is comparing the data
from two different samples
• For now, we will assume that these two different
samples are independent of each other and come
from two distinct populations
Population 1:1 , 1
Population 2:2 , 2
Sample 1:
Sample 2:
June 23, 2008
, s1
Stat 111 - Lecture 13 - Means
, s2
21
Blackout Baby Boom Revisited
• Nine months (Monday, August 8th) after Nov 1965
blackout, NY Times claimed an increased birth rate
• Already looked at single two-week sample: found no
significant difference from usual rate (430 births/day)
• What if we instead look at difference between
weekends and weekdays?
Sun Mon
Tue
Wed Thu
Fri
Sat
452
470
431
448
467
377
344
449
440
457
471
463
405
377
453
499
461
442
444
415
356
470
519
443
449
418
394
399
451
468
432
June 23, 2008
Weekdays
Stat 111 - Lecture 13 - Means
Weekends
22
Two-Sample Z test
• We want to test the null hypothesis that the two
populations have different means
• H0: 1 = 2 or equivalently, 1 - 2 = 0
• Two-sided alternative hypothesis: 1 - 2  0
• If we assume our population SDs 1 and 2 are
known, we can calculate a two-sample Z statistic:
• We can then calculate a p-value from this Z statistic
using the standard normal distribution
• Next class, we will look at tests that do not assume
known 1 and 2
June 23, 2008
Stat 111 - Lecture 13 - Means
23
Two-Sample Z test for Blackout Data
• To use Z test, we need to assume that our pop. SDs
are known: 1 = s1 = 21.7 and 2 = s2 = 24.5
• We can then calculate a two-sided p-value for Z=7.5
using the standard normal distribution
• From normal table, P(Z > 7.5) is less than 0.0002, so our pvalue = 2  P(Z > 7.5) is less than 0.0004
• We reject the null hypothesis at -level of 0.05 and
conclude there is a significant difference between
birth rates on weekends and weekdays
• Next class: get rid of assumption of known 1 and 2
June 23, 2008
Stat 111 - Lecture 13 - Means
24
Next Class – Lecture 14
• More on Comparing Means between
Two Samples
• Moore, McCabe and Craig: 7.1-7.2
June 23, 2008
Stat 111 - Lecture 13 - Means
25