The Normal Approximation for Data

Download Report

Transcript The Normal Approximation for Data

Review
Law of averages, expected value and standard error, normal
approximation, surveys and sampling
Basic concepts
• What is the law of averages?
• When there is a chance process (in general, it is about the sum of
draws), chance error will come out according the equation:
• Observed value = expected value + chance error.
• If we increase the number of the draws, then the chance error is likely
to be large in absolute terms, but small relative to the number of
draws.
• Or equivalently, in terms of percentage, chance error in percent will
be close to 0 if we increase the number of draws. This is the law of
averages.
• The square root law is the quantitative version of the law of averages:
• SE = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑟𝑎𝑤𝑠 × (𝑆𝐷 𝑜𝑓 𝑏𝑜𝑥).
Example
• According to genetic theory, there is very close to an even chance that
both children in a two-child family will be of the same sex.
• Which is more likely?
• (i) 15 couples have two children each. In 10 or more of the families,
both children are of the same sex.
• (ii) 30 couples have two children each. In 20 or more of the families,
both children are of the same sex.
• Answer: (i).
• With more families, the percentage is more and more close to the
even chance (50%), then 10/15 = 20/30 = 2/3 is less likely to happen.
Example
• A die will be thrown some number of times, and the object is to guess
the total number of spots. There is a one-dollar penalty for each spot
that the guess is off.
• For instance, if you guess 200 and the total is 215, you lose $15.
• Which do you prefer: 50 throws, or 100?
• Answer: 50.
• The best number to be guessed is the expected value. Then the larger
number of throws, the chance error is likely to be larger (you lose
more money).
Remark
• The SE for number and SE for percentage behave quite differently:
• The SE for number will go up like the square root of the number of
draws.
• The SE for percentage will go down like the square root of the
number of draws.
Basic concepts
• What is a probability histogram?
• A graph represents probability/chance, not data.
• What is the relationship between the empirical histogram for the
observed data and the ideal probability histogram?
• If the chance process is repeated many times, the empirical histogram
converges to the probability histogram.
• In general, the process is about the sum of draws.
• What if the process is about the product of draws?
• The convergence still applies.
Basic concepts
• What is the central limit theorem?
• When drawing at random with replacement from a box, the
probability histogram for the sum will follow the normal curve,
provided that the histogram must be put into standard units and the
number of draws must be reasonably large.
• What if the process is about the product of draws?
• The convergence fails to apply.
Basic concepts
• What is a population? What is a sample?
• In a survey, a population is the group of subjects that we want to
study.
• A sample is part of the population. It will represent some properties
of the population. We study samples, when it is impractical to study
the whole population.
• What is a parameter?
• A parameter is a numerical fact about a population. Usually a
parameter cannot be determined exactly, but can only be estimated.
Basic concepts
• What is a statistic?
• A statistic is an estimate to the parameter, and it can be computed
from a sample. A statistic is what we know. The parameter is what we
want to know.
• What are the two main bias we studied in class?
• The selection bias and the non-response bias.
Basic concepts
• How do we determine if there is selection bias in a survey?
• There is discretion on the part of interviewers, there is discretion on
the part of investigator or survey designer, the process does not
involve probability theory so that the chance for each individual is not
even, and so on.
• How do we determine if there is non-response bias in a survey?
• The life style of the non-respondents can be very different from the
respondents, we may also calculate the non-response rate: personal
interviews 65% and mailed questionnaires 25%. (threshold)
Basic concepts
• What is the best method to draw a sample in a survey?
• The probability methods.
• What is the simplest probability method?
• The simple random sampling.
• Is it practical?
• No. The length of the name list is too long, it is not easy to send out
interviewers to find the selected individuals, and so on.
• What other probability methods we have studied?
• Multistage cluster sampling and random digit dialing(RDD) from
telephone survey.
Basic concepts
• According to the equation: statistic = parameter + chance error, what
is sample percentage and what is population percentage in a sampling
process? Do they have to be equal?
• Population percentage is the parameter or the expected value.
Sample percentage is the statistic or the estimate, and it is often off
by a chance error which is measured by SE for percentage.
• According to the square root law, what determines the accuracy of
the sampling process?
• When the sample is only a small part of the population, it is the
sample size which mainly determines the accuracy. The population
size has almost no influence on it.
Calculation and formula
• What is the formula for the expected value?
• EV = (number of draws) × (average of box)
• EV = (sample size) × average
• EV for percentage = composition percentage of population
• What is the formula for SD?
• SD = r.m.s. of the list of deviation
• SD = (big – small) × (𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑏𝑖𝑔) × (𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑚𝑎𝑙𝑙)
• The 2nd formula only applies for lists involving only 2 different
numbers.
Calculation and formula
• What is the formula for SE?
• SE for sum = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑟𝑎𝑤𝑠 × (𝑆𝐷 𝑜𝑓 𝑏𝑜𝑥)
• SE for counting (number) = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑟𝑎𝑤𝑠 × (SD of 0-1 box)
• SE for percentage =
𝑆𝐸 𝑓𝑜𝑟 𝑛𝑢𝑚𝑏𝑒𝑟
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
× 100%
• How do we convert scale into standard units?
• For data, convert average to 0, deviation(difference) divides by SD.
• For probability, convert expected value to 0, difference divides by SE.
Models and Examples
• The fundamental model: box with tickets, draw at random with
replacement.
• For instance, 100 draws from the box with tickets: 1,2,3,4.
• What is the expected sum?
• 100 × 2.5 = 250.
• What is the SD?
• r.m.s. of -1.5, -0.5, 0.5, 1.5 ≈ 1.12
• What is the SE?
• 100 × 1.12 = 11.2
Models and Examples
• Again, 100 draws from the box with tickets: 1,2,3,4.
• How many 1’s we may expect to get?
• We need a new box: 1,0,0,0.
• EV = 100 × ¼ = 25.
• What is the SD?
• SD = (1 - 0) ×
1
4
3
4
× ≈ 0.43
• What is the SE?
• SE = 100 × 0.43 = 4.3
Normal approximation
•
•
•
•
•
•
•
•
•
•
•
•
Again, 100 draws from the box with tickets: 1,2,3,4.
What is the probability that the average is between 2 and 3?
The average is 2 ~ 3 ↔ The sum is 200 ~ 300.
From previous calculation: EV = 250, SE = 11.2
Convert to standard units: 200 → -4.5, 300 → 4.5
From the normal table it is about 99.99%.
What is the probability that the number of 1’s is between 20 and 30?
Now we have to use the 0-1 box.
From previous calculation: EV = 25, SE = 4.3
Convert to standard units: 20 → -1.16, 30 → 1.16
From the normal table it is about 75%.
What about the number of 1’s more than 30?
Normal approximation
• What about the number of 1’s is exactly 25?
• We have to use the “continuity correction”.
• From the probability histogram, the block the 25 1’s goes from 24.5 to
25.5.
• So in the standard units, it goes from 24.5 → -0.12 to 25.5 → 0.12.
• (EV = 25, SE = 4.3)
• From the normal table, it is about 10%.
Models and Examples
• Sampling process:
• A group of 50,000 tax forms has an average gross income of $37,000,
with an SD of $20,000. About 20% of the forms have a gross income
over $50,000. A group of 900 forms is chosen at random for audit.
• Q1: estimate the probability that between 19% and 21% of the forms
chosen for audit have gross incomes over $50,000.
• Q2: estimate the probability that the total gross income of the
audited forms is over $33,000,000. (This question can be also
translated into average version: the average gross income is over
$33,000,000/900.)
Solutions
• The original box model is: 50,000 tickets, average = 37,000, SD =
20,000, and 900 draws.
• But for Q1, it is asked about forms having a gross income over
$50,000. This is a process about determine whether each form is
qualified or not. So it is about counting numbers.
• We need a new box model----the 0-1 box:
• 50,000 tickets, 20% are 1’s, others are 0’s.
• EV = composition percentage = 20%
• SD = 0.2 × 0.8 = 0.4
• SE = 900 × 0.4 = 12, SE for percentage = 12/900 = 1.33%
Solutions
• The last step is to convert the scale into standard units:
• 19%→-0.75, and 21%→0.75. (EV = 20%, SE = 1.33%)
• So the probability is about 55% from the normal table.
• For Q2, it is about the total income over $33,000,000. This is a
process about the sum of draws that greater than some quantity.
• So we use the original box: 50,000 tickets, average = 37,000, SD =
20,000, and 900 draws.
• EV = number of draws × average = 900 × 37,000 = 33,300,000.
• SE = 900 × 20,000 = 600,000.
Solutions
• Again, the last step is to convert the scale into standard units:
• 33,000,000 → -0.5 (This is because:
33,000,000 −33,300,000
600,000
= -0.5.)
• From the normal table, to the right of -0.5, it is about 69%.
• A very interesting question is that: can we estimate the probability
that between 9% and 11% of the forms chosen for audit have gross
incomes over $75,000?
• Answer: no. We need to know the percentage of forms with gross
incomes over $75,000.
• But sometimes, it can be done with the assumption that the data
follow the normal distribution (normal curve).
Example
• Suppose in a calculus test, a group of 10,000 students has an average
score of 70, with an SD of 10. A group of 400 students is chosen for
sampling.
• Q: Assume the scores follow the normal distribution, can we estimate
the probability that between 14% and 18% of the students chosen for
sampling have scores above 80?
• Answer: Yes, we can!
Solution
• In the box model, we have 10,000 tickets, average = 70, SD = 10, 400
draws.
• This problem is about determine whether each student have a score above
80 or not. So it is a counting process.
• We need a new box model: 0-1 box. There are 10,000 tickets.
• But we don’t know the composition percentage of 1’s and 0’s.
• We first have to use the normal curve to estimate it. Because we have the
assumption that the data follow the normal curve.
• In the original box, with average = 70, SD = 10, the score 80 is converted to
1 in standard units.
• From the normal table, to the right of 1, it is about 16%. That is, in the
population of 10,000 students, there are about 16% of the students have
scores above 80.
• So in the new 0-1 box, 16% are 1’s, rest are 0’s.
Solution
• Then in the new 0-1 box:
• EV = composition percentage = 16%
• SD = 0.16 × 0.84 ≈ 0.37
• SE = 400 × 0.37 = 7.4, SE for percentage = 7.4/400 = 1.85%.
• Convert to standard units: 14% → -1.1 and 18% → 1.1
• From the normal table, it is about 73%.
Good Luck!