Lecture Note 6

Download Report

Transcript Lecture Note 6

Biostatistics
Unit 6 – Confidence Intervals
Statistical inference
Statistical inference is the procedure by which
we reach a conclusion about a population on
the basis of the information contained in a
sample drawn from that
population. Estimation involves the use of the
data in the sample to calculate the
corresponding parameter in the population
from which the sample was drawn.
Point estimate
A point estimate is a single numerical
value used to estimate the
corresponding population parameter.
Interval estimate
An interval estimate consists of two
numerical values that, with a specified
degree of confidence, we feel includes
the parameter being estimated.
Estimator
An estimator is a rule or formula that tells
how to compute the estimate.
Estimators are unbiased if they predict
well the value in the population.
Table of unbiased estimators
Sampled and target populations
The sampled population is the population
from which we actually draw the
sample. The target population is the
population about which we wish to make
an inference.
(continued)
Sampled and target populations
These two populations may or may not be the
same. When they are the same, it is possible
to use statistical inference procedures to
make conclusions about the target
population. If the sample and target
populations are different, conclusions can be
made about the target population only on the
basis of nonstatistical considerations.
Random and nonrandom samples
The strict validity of statistical procedures
depends on the assumption of random
samples.
Confidence intervals to be studied
A) Confidence Interval for a Population mean
B) Confidence Interval for the Difference of Two
Population Means
C) Confidence Interval for a Population Proportion
D) Confidence Interval for the Difference of Two
Population Proportions
E) Confidence Interval for the Variance of a Normally
Distributed Population
F) Confidence Interval for the Ratio of Variances of
Two Normally Distributed Populations
A) Confidence interval for a population mean
Estimating the mean
Estimating the mean of a normally distributed
population entails drawing a sample of size n
and computing which is used as a point
estimate of m.
It is more meaningful to estimate m by an
interval that communicates information
regarding the probable magnitude of m.
Sampling distributions and estimation
Interval estimates are based on sampling
distributions. When the sample mean is
being used as an estimator of a
population mean, and the population is
normally distributed, the sample mean
will be normally distributed with mean,
, equal to the population mean, m, and
variance of
The 95% confidence interval
Approximately 95% of the values of
making up the distribution will lie within 2
standard deviations of the mean. The
interval is noted by the two
points,
and
, so that 95%
of the values are in the interval,
.
The 95% confidence interval
Since m and are unknown, the location
of the distribution is uncertain. We can
use as a point estimate of m. In
constructing intervals of
, about
95% of these intervals would contain m.
Example
Suppose a researcher, interested in obtaining an
estimate of the average level of some enzyme in a
certain human population, takes a sample of 10
individuals, determines the level of the enzyme in
each, and computes a sample mean of x = 22.
Suppose further it is known that the variable of
interest is approximately normally distributed with a
variance of 45. We wish to estimate m.
Solution
An approximate confidence interval for
is given by:
Components of an interval estimate
The interval estimate of m is centered on
the point estimate of m. Approximately
95% of the values of the standard normal
curve lie within 2 standard deviations of
the mean. The z score in this case is
called the reliability coefficient. The real
value to use is 1.96.
General expression for an interval estimate
Table of confidence coefficients
Interpretation of confidence intervals
The interval estimate for m is expressed as:
If a = .05, we can say that, in repeated
sampling, 95% of the intervals constructed
this way will include m. This is based on the
probability of occurrence of different values of
.
(continued)
Interpretation of confidence intervals
The area of the curve of that is outside
the area of the interval is called a, and
the area inside the interval is called 1- a.
Probabilistic interpretation of the interval
In repeated sampling from a normally
distributed population with a known
standard deviation, 100(1- a) percent of
all intervals in the form
will, in the long run, include the
population mean, m.
(continued)
Probabilistic interpretation of the interval
The quantity 1-a is called the confidence
coefficient or confidence level and the
interval,
, is called the
confidence interval for m.
Practical interpretation
When sampling is from a normally
distributed population with known
standard deviation, we are 100(1- a)
percent confident that the single
computed interval,
contains the population mean, m.
Precision
Precision indicates how much the values
deviate from their mean. Precision is
found by multiplying the reliability factor
by the standard error of the mean. This
is also called the margin of error.
Exercise 6.2.2
We wish to estimate the mean serum indirect
bilirubin level of 4-day-old infants. The mean for a
sample of 16 infants was found to be 5.98
mg/dl. Assuming bilirubin levels in 4-day-old infants
are approximately normally distributed with a
standard deviation of 3.5 mg/dl find:
A) The 90% confidence interval for m
B) The 95% confidence interval for m
C) The 99% confidence interval for m
Solution
(1) Given
= 5.98
s = 3.5
n = 16
(2) Sketch
Solution
(3) Calculations
A) 90% interval (z = 1.645)
5.98 ± 1.645 (.875)
5.98-1.439375, 5.98+1.439375
(4.5408, 7.4129)
Solution
B) 95% interval (z = 1.96)
5.98 ± 1.96 (.875)
(4.265, 7.695)
Solution
C) 99% interval (z = 2.575)
5.98 ± 2.575 (.875)
(3.7261, 8.2339)
Solution
(4) Results
A higher percent confidence level gives a
wider band. There is less chance of making
an error but there is more uncertainty.
Calculator answers are more accurate
because the calculator uses exact values and
derives its answers from calculus.
The t distribution
In most real life situations the variance of the
population is unknown. We know that the z
score,
is normally distributed if the population is
normally distributed and is approximately
normally distributed when the population is
large. But, it cannot be used because s is
unknown.
Estimation of the standard deviation
The sample standard deviation,
can be used to replace s. If n 30, then
s is a good approximation of s. An
alternate procedure is used when the
samples are small. It is known as
Student's t distribution.
Student's t distribution
Student's t distribution is used as an
alternative for z with small samples. It
uses the following formula:
Student's t distribution
Student's t distribution was developed in
1908 by W. S. Gosset (1876-1937) who
worked for the Guinness Brewery.
Properties of the t distribution
1. Mean = 0
2. It is symmetrical about the mean.
3. Variance is greater than 1 but approaches
1 as the sample gets large. For df > 2, the
variance = df/(df-2) or
(continued)
Properties of the t distribution
4. The range is
to
.
5. t is really a family of distributions because
the divisors are different.
6. Compared with the normal distribution, t is
less peaked and has higher tails.
7. t distribution approaches the normal
distribution as n-1 approaches infinity.
Confidence interval for a mean using t
General relationship
The reliability coefficient is obtained from
the t distribution.
Confidence interval
When sampling is from a normal distribution
whose standard deviation, s, is unknown, the
100(1- a) percent confidence interval for the
population mean, m, is given by:
Deciding between z and t
When constructing a confidence interval for a
population mean, we must decide whether to use z
or t. Which one to use depends on the size of the
sample, whether it is normally distributed or not, and
whether or not the variance is known. There are
various flowcharts and decision keys that can be
used to help decide. Mine appears below.
Key for deciding between z and t in
confidence interval construction
1.
2.
3.
4.
5.
6.
7.
Population normally distributed................2
Not as above—normally distributed.........5
Sample size is large (30 or higher)............3
Sample size is small (less than 30)............4
Population variance is known.............use z
Population variance not known.... use t (or z)
Population variance is known.............use z
Population variance is not known.......use t
Sample size is large..................................6
Sample size is small..................................7
Population variance is known.............use z
Population variance not known
(central limit theorem applies)............use z
Must use a non-parametric method
Example
In a study of preeclampsia, Kaminski and
Rechberger found the mean systolic
blood pressure of 10 healthy,
nonpregnant women to be 119 with a
standard deviation of 2.1.
(continued)
Example
(Preeclampsia: Development of hypertension,
albuminuria, or edema between the 20th week of
pregnancy and the first week postpartum.
Eclampsia: Coma and/or convulsive seizures in the
same time period, without other etiology.)
Example
a. What is the estimated standard error of the
mean?
b. Construct the 99% confidence interval for the
mean of the population from which the 10 subjects
may be presumed to be a random sample.
c. What is the precision of the estimate?
d. What assumptions are necessary for the validity
of the confidence interval you constructed?
Solution
(1) Given
n = 10
= 119
s = 2.1
(2) Sketch of t distribution
Reading the t table
(3) Calculations
= .6640783086
119 ± 3.2498 (.66407...)
116.84, 121.16
Solution
Precision = 3.2498 (.66407...)
= 2.158121687
Assumptions
The population is normally distributed
The 10 subjects represent a random sample
from this population.
B) Confidence interval for the difference of two
population means
Introduction
From each of two populations an independent
random sample is drawn. Sample
means, and
, are calculated.
(continued)
B) Confidence interval for the difference of two
population means Introduction
The difference is
which is an
unbiased estimator of the difference
between the two population
means,
. The variance of the
estimator is
Conditions for use
Assuming the populations are normally
distributed, there are three situations
where we would determine the 100(1- a)
percent confidence interval for
.
(continued)
Conditions for use
a) where the population variances are known (use z)
b) where the population variances are unknown but
equal (use t)
c) where the population variances are unknown but
unequal (use t').
Population variances are known
When the population variances are
known, the 100(1- a) percent confidence
interval for
is given by
Example 6.4.1
A research team is interested in the difference
between serum uric acid levels in patients with and
without Down's syndrome. In a large hospital for the
treatment of the mentally retarded, a sample of 12
individuals with Down's syndrome yielded a mean
of
= 4.5 mg/100 ml. In a general hospital a
sample of 15 normal individuals of the same age and
sex were found to have a mean value of
= 3.4
mg/100 ml. If it is reasonable to assume that the
two populations of values are normally distributed
with variances equal to 1 and 1.5, find the 95%
confidence interval for
.
Solution
(1) Given
n1 = 12,
= 4.5,
=1
n2 = 15,
= 3.4,
= 1.5
Solution
(2) Calculations
The point estimate for
= 4.5 - 3.4 = 1.1
is
Solution
The standard error is
Solution
The 95% confidence interval is
1.1 ± 1.96 (.4282)
(.26, 1.94)
Population variances unknown but
equal
If it can be assumed that the population
variances are equal then each sample
variance is actually a point estimate of the
same quantity. Therefore, we can combine
the sample variances to form a pooled
estimate.
Weighted averages
The pooled estimate of the common
variance is made using weighted
averages. This means that each sample
variance is weighted by its degrees of
freedom.
Pooled estimate of the
variance
The pooled estimate of the variance
comes from the formula:
Standard error of the estimate
The standard error of the estimate is
Confidence interval
The 100(1-a) confidence interval for
is:
Example
(1) Given
n1 = 13,
= 21.0,
s1 = 4.9
n2 = 17,
= 12.1,
s2 = 5.6
a = .05
Example
(2) Calculations
The point estimate for
- is
= 21.0 - 12.1 = 8.9
Example
The pooled estimate of the variance is
Example
The standard error is
Example
The 95% confidence interval is
8.9 ± 2.0484 (1.9569)
8.9 ± 4.0085
(4.9, 12.9)
Population variances unknown
and not equal
With unequal variances, the quantity
used to calculate the test statistic does
not follow the t distribution. A substitute
reliability factor called t' has been
proposed.
C) Confidence interval for a population
proportion
To begin, a sample is drawn from the population of
interest and the sample proportion, , is
calculated. This sample proportion is used as the
point estimator of the population proportion, p. The
confidence interval is defined by the general formula:
Distribution
When n is large, the reliability coefficient will be z
from the standard normal distribution. Since p, the
population proportion, is unknown, we use as an
estimate. The estimate of
, the
standard error, is given by:
Confidence interval
The 100(1- a) confidence interval for p is
given by:
Probabilistic interpretation.
We say that we are 95% confident that
the population proportion, p,
lies between the calculated limits since,
in repeated sampling, about 95% of the
intervals constructed this way would
contain p.
Practical interpretation.
In a specific example, we would expect,
with 95% confidence, to find the
population proportion between the two
boundaries.
Example 6.5.2
A research study obtained data regarding sexual
behavior from a sample of unmarried men and
women between the ages of 20 and 44 residing in
geographic areas characterized by high rates of
sexually transmitted diseases and admission to drug
programs. Fifty percent of 1229 respondents
reported that they never used a condom. Construct
a 95 percent confidence interval for the population
proportion never using a condom.
Solution
(1) Given
n = 1229
= .50
(for the TI-83, x = 615)
Solution
(2) Calculation
D) Confidence interval for the difference
of two population proportions
When studying the difference between two
population proportions, the difference between the
two sample proportions,
, can be used as an
unbiased point estimator for the difference between
the two population proportions, p1 – p2. This is used
with the general formula:
Distribution
When the central limit theorem applies,
the normal distribution is used to obtain
confidence intervals. The standard error
is estimated by the formula:
Confidence interval
The 100(1- a) percent confidence
interval for p1 – p2 is given by:
Probabilistic interpretation.
We say that we are 95% confident that
the difference between the two
population proportions, p1 – p2,
lies between the calculated limits since,
in repeated sampling, about 95% of the
intervals constructed this way would
contain p1 – p2.
Practical interpretation.
In a specific example, we would expect,
with 95% confidence, to find the
difference between the two population
proportions between the two limits.
Example 6.6.1
A study of teenage suicide included a sample of 96
boys and 123 girls between ages of 12 and 16 years
selected scientifically from admissions records to a
private psychiatric hospital. Suicide attempts were
reported by 18 of the boys and 60 of the girls. We
assume that the girls constitute a simple random
sample from a population of similar girls and likewise
for the boys. Construct a 99 percent confidence
interval for the difference between the two
proportions.
Solution
(1) Given
n1 = 123
= .4878
n2 = 96
= .1875
Solution
(2) Calculation
Determining the sample size for
estimating means
It is important to have a sample that is the correct
size. It is also important to have a method that will
allow prediction of the correct sample size for
estimating a population mean or a population
proportion. This is important especially in business
or commercial situations where money is
involved. Selecting a sample size that is too big
wastes money. One that is too small may give
inaccurate results.
Objectives
The width of the confidence interval is determined by
the magnitude of the margin of error which is given
by:
d = (reliability coefficient) (standard error)
The total width of the interval is twice this amount.
Reducing the margin of error
In the standard error,
, the value of s is a
constant. If the reliability coefficient is fixed, the only
way to reduce the margin of error is to have a large
sample. The size of the sample depends on the size
of s, the degree of reliability and the desired interval
width.
Margin of error
Sample size for a large population
d = (reliability coefficient) X (standard error)
Solving for n gives
Estimating s2
Generally the variance of the population under study
is unknown. As a result s has to be estimated. The
most common sources of estimates for s are:
1. A pilot sample which is drawn from the population
and used as an estimate of s.
2. Estimates of s from previous or similar studies.
3. In a normally distributed population, the range is
usually about 6 standard deviations so is estimated
by R/6.
Determination of the sample size for estimating
proportions
The manner of finding sample sizes for estimating a
population proportion is basically the same as for
estimating a mean.
The general formula is:
Sample size
Assuming proper random sampling and
an approximately normal distribution, the
sample size is
Estimating the population proportion
It is necessary to estimate the population proportion,
p, to use in the determination of the sample size.
1. If an upper limit is suspected or presumed, it
could be used to represent p.
2. A pilot sample could be drawn and used to obtain
an estimate for p.
3. With no better estimate, one may use p = .5
which gives the maximum value of n.
E) Confidence interval for the variance of a
normally distributed population
Measures of dispersion
s
S
(continued)
E) Confidence interval for the variance of a
normally distributed population
Measures of dispersion
s
E( s2 ) = when
sampling is with
replacement
S
E( s2 ) = when
sampling is without
replacement.
Large population size
When N is large, N and N-1 are
approximately equal so s2 and s2 will be
approximately equal. These results
justify why s2 can be used to compute
the population variance.
Interval estimate of a population variance
The value of s2 is used as a point estimator of the
population variance, s2. Confidence intervals of s2
are based on the sampling distribution of (n-1) s2/
s2. If samples of size n are drawn from a normally
distributed population, this quantity has a distribution
known as the chi-square distribution with n-1
degrees of freedom. The assumption that the
sample is drawn from a normally distributed
population is crucial.
The chi-square distribution
The chi-square distribution is not symmetrical. For
low values of n, its shape is variable. The
distribution does not have negative values.
Microsoft Excel Demonstration
Note how the shape of the curve
changes depending on the degrees of
freedom. With 1 degree of freedom, the
curve is hyperbolic.
[Here follows the Excel Worksheet.]
Microsoft Excel Demonstration
Reading the c2 table
Finding c2 values
Finding c2 values
Finding c2 values
Confidence interval on the c2 distribution
The 100(1-a) confidence interval for the distribution
of (n-1) s2/s2 is a two-tailed c2 distribution between
and
. This interval is given by
Confidence interval for s2
From the sampling distribution of (n-1) s2/s2 the
sampling distribution of s2 is derived. The formula
is:
Confidence interval for s
To get the 100(1-a) confidence interval for s, the
population standard deviation, the square root of
each term is taken. The result is the formula below.
Example 6.9.1
In a study on cholesterol levels a sample of 12 men
and women was chosen. The plasma cholesterol
levels (mmol/L) of the subjects were as follows: 6.0,
6.4, 7.0, 5.8, 6.0, 5.8, 5.9, 6.7, 6.1, 6.5, 6.3, and
5.8. We assume that these 12 subjects constitute a
simple random sample of a population of similar
subjects. We wish to estimate the variance of the
plasma cholesterol levels with a 95 percent
confidence interval.
Solution
(1) Given
6.0 6.4 7.0 5.8 6.0 5.8
5.9 6.7 6.1 6.5 6.3 5.8
Estimate the variance with a 95%
confidence interval.
Solution
(2) Calculations
Value of s = .3918680978
Values of c2 from table
= 21.920
= 3.816
Calculations
F) Confidence interval for the ratio of variances
of two normally distributed populations
A way to compare the variances of two normally
distributed populations is to use the variance ratio,
/
. The variance ratio is used, among other
things, as the test statistic for analysis of variance
(ANOVA). If the two variances are equal, then
V. R. = 1.
Sampling distribution
The sampling distribution of ( / )/( / ) is
used. Since the population variances are usually not
known, the sample variances are used. The
assumptions are that
and
are computed from
independent samples of size n1 and n2, respectively,
drawn from two normally distributed populations.
(continued)
Sampling distribution
If the assumptions are met, (
/
)/(
/
)
follows a distribution known as the F distribution with
two values used for degrees of freedom.
Degrees of freedom
The F distribution uses two values for degrees of
freedom. The numerator degrees of freedom is the
value of n1 -1 which is used in calculating
. The
denominator degrees of freedom is the value of n2 -1
which is used in calculating .
The F distribution
The F distribution is not symmetrical.
The distribution does not have negative
values. Because it uses two values of
degrees of freedom, there are separate
charts for different confidence intervals.
F distribution tables
Reading F tables
F tables come in denominations based on
which are
,
,
,
and
with one
tail. For two-tail intervals, the lower boundary,
must be calculated to give values
of
,
and
.
,
Reading F tables
Two-tail F distribution boundaries
The F.95 table
The F.975 table
The F.995 table
Confidence interval for
/
The distribution (
/
)/(
/
) is used to
establish the 100(1- a) percent confidence interval
for
/
. The starting point is
(continued)
Confidence interval for
/
From this relation, it can be shown that the 100(1- a)
percent confidence interval for
/
is
Example 6.10.1
Among 11 patients in a certain study, the
standard deviation of the property of
interest was 5.8. In another group of 4
patients, the standard deviation was
3.4. We wish to construct a 95 percent
confidence interval for the ratio of the
variances of these two populations.
Solution
(1) Given
n1 = 11
n2 = 4
= (5.8)2 = 33.64
a = .05
= (3.4)2 = 11.56
10, 3 = 14.42
= 1/
3, 10 = 1/4.83 = .20704
Solution
(2) Calculations
Calculation of the 95% confidence interval for
/
fin