The Normal Approximation for Data

Transcript The Normal Approximation for Data

Review
Statistical inference and test of significance
Basic concepts
• Suppose we want to study a population, and the parameter (average
or percentage) is unknown. We use ____ to estimate the parameter.
• With a simple random sample, the sample average/percentage can be
used to estimate the population average/percentage.
• But the sample estimate will be off by some amount, due to chance
error. The standard error measures the likely size of it. When the
composition of the population is unknown, we have to use the
bootstrap method to estimate the SD of the population. What is the
bootstrap method?
• The SD of the population can be estimated by the SD of the sample.
This bootstrap estimate is good when the sample is large.
Basic concepts
• What is a confidence interval for the population parameter (average
or percentage)?
• A confidence interval for the population parameter is obtained by
going the right number of SEs either way from the sample estimate.
The confidence level is read off the normal curve. This should only be
used with large samples due to the CLT.
• How do you interpret the confidence level in terms of frequency
theory of probability?
• It is not about the probability that the parameter lies in the interval.
Because parameters are not subject to chance variation. It states
about the frequency of multiple samples that the corresponding
confidence interval covers the true value (parameter).
Basic concepts
• The formulas for simple random samples may not apply to other
kinds of samples.
• For instance: with samples of convenience, standard errors usually do
not make sense.
• Even if the sample is drawn by probability method, but not simple
random sampling, the formula for SE is still not applied.
Basic concepts
• What is a test of significance? What is the null hypothesis, and what is
the alternative hypothesis?
• A test of significance gets at the question of whether an observed
difference is real (the alternative hypothesis) or just a chance
variation (the null hypothesis).
• The null must be based on the chance process (assuming no other
factors or bias), and the alternative is based on the
question/argument we suggest.
• We can use a test of significance to detect a statement (null), or prove
a statement (alternative).
Basic concepts
• What is a test statistic?
• A test statistic measures the difference between the data and what is
expected based on the null hypothesis. This means the calculation is
based on the null.
• What is a z-statistic?
• The z-statistic says how many SEs away an observed value is from its
expected value, where the expected value is calculated using the null
hypothesis.
Basic concepts
• What is the observed significance level or P-value? How do you
interpret it?
• The P-value is not the chance of the null being correct. It is the
chance of getting a test statistic as extreme as or more extreme than
the observed one. (The calculation is based on the null.)
• Small P-values are evidence against the null:
• Less than 5%: statistically significant or significant.
• Less than 1%: highly significant.
Basic concepts
• Suppose we only have a small sample, say the sample size is 5. If the
observed values (or the errors) follow the normal curve, and the SD of
the population is unknown. Do we still use the z-test?
• No. We use the t-test instead.
• Suppose we have a randomized control experiment. We want to
compare the data from the treatment group and the control group. In
order to prove the treatment indeed has effect, what kind of test shall
we use? How do we set up the null and alternative?
• We use two-sample z-test. The null is based on the chance variation.
So it says there is no effect on the treatment. The alternative is based
on what we want to prove: the treatment has effect.
Basic concepts
• Suppose we want to detect whether a coin is fair or not. What kind of
test shall we use?
• The one-sample z-test (with two-sided P-value) or the χ²-test.
• But what if we want to detect a die is fair or not? (More than 2
categories.)
• We use the χ²-test.
• The χ²-statistic is always positive. (Compare to the z-statistic.)
• The χ²-test can also be used to test for independence.
Calculation and formula
• The four formulas for SE:
• SE for sum = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑟𝑎𝑤𝑠 × (𝑆𝐷 𝑜𝑓 𝑏𝑜𝑥)
• SE for average =
𝑆𝐸 𝑓𝑜𝑟 𝑠𝑢𝑚
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
• SE for counting (number) = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑟𝑎𝑤𝑠 × (SD of 0-1 box)
• SE for percentage =
𝑆𝐸 𝑓𝑜𝑟 𝑛𝑢𝑚𝑏𝑒𝑟
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
× 100%
• A Confidence interval:
• “sample estimate ± number of SEs”
Calculation and formula
• The equation for the z-statistic:
• In a z-test, the z-statistic can be compute by 𝑧 =
𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑
.
𝑆𝐸
• If the null determines the SD of the box, use this information.
Otherwise, you have to estimate the SD from the data of the sample.
• The one-sided P-value:
• Look at the more extreme side to the z under the normal curve.
Example 1
• A survey organization takes a simple random sample of 625
households from a city of 80,000 households.
• On the average, there are 2.30 persons per sample household, and
the SD is 1.75.
• Find a 95%-confidence interval for the average household size in the
city.
Solution
• This problem is about the inference. We use the sample average to
estimate the population average.
• So the average household size in the city is estimated as 2.30.
• Since we don’t know the SD of the population, we use the bootstrap
method to estimate the SD as 1.75 by the data of the sample.
• Then by the square root law, the SE for sum = 625 × 1.75 = 43.75.
• So the SE for average is
43.75
625
= 0.07.
• By “sample average ± 2 SEs”, we find that a 95%-confidence interval is
2.30 ± 0.14, or equivalently 2.16 ~ 2.44.
Remark
• A variant of this problem could be:
• Suppose 30% of the sample households have the size greater or equal
to 3 persons.
• Find a 95%-confidence interval for the percentage of the households
having the size greater or equal to 3 persons in the city.
• In this case, you are doing a 0-1 box problem.
• You may also look at the statement (true or false): 95% of the
households in the city contain between 2.16 and 2.44 persons.
• This is false. It confuses the SD with the SE. SE measures the chance
error for multiple samples, SD measures the spread of the data for just
one sample.
Example 2
• According to the census, the median household income in Atlanta
(1.5 million households) was $52,000 in 1999.
• In June 2003, a market research organization takes a simple random
sample of 750 households in Atlanta; 56% of the sample households
had incomes over $52,000.
• Did median household income in Atlanta increase over the period
1999 to 2003?
• Formulate the null and alternative hypotheses, and use a test of
significant to detect the statement.
Solution
• This problem asks about whether the median increased or not.
• But we don’t have enough information about the incomes overall.
Even if we know the observed median in 2003, we still don’t know
how to compute the corresponding SE.
• So instead of looking at the incomes (quantitative variable), we look
at the qualitative variable: whether the a household had income over
$52,000 or not.
• The idea is that, since the median (50%) income in 1999 was $52,000,
if the percentage of households having income over $52,000 was
really greater than 50% in 2003 (not due to chance), then the median
must increase.
Solution
• So a 0-1 box is needed (to classify the qualitative data):
• The box has one ticket for each household in 2003. If the income is
over $52,000, the ticket is marked 1; otherwise, 0.
• The null says: the median did not increase, or equivalently, the
percentage of the households having incomes over $52,000 is 50%.
(The percentage of 1’s in the box is 50%.)
• The alternative says, this percentage is bigger than 50%. (The median
did increase.)
• The sample is just like 750 draws from the box.
Solution
• We use the z-test: 𝑧 =
𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑
.
𝑆𝐸
• The observed percentage is 56%, and the expected percentage is 50% which
is based on the null.
• Again, based on the null, we know that the SD of the box can be computed by
1 − 0 × 0.50 × 0.50 = 0.50. (We don’t use 56%, the sample estimate,
because of the null.)
• The SE for number is about 750 × 0.50 ≈ 13.7, then the SE for percentage
13.7
is about
× 100% ≈ 1.83%.
750
56%−50%
1.83%
• So the z-statistic is about 𝑧 =
≈ 3.3. Then the P-value is about
0.05% (< 1%), which is highly significant.
• Therefore, we reject the null and conclude that the median went up.
Good Luck!

The Normal Approximation for Data

Transcript The Normal Approximation for Data

Directory