The Normal Approximation for Data

Download Report

Transcript The Normal Approximation for Data

Sample Surveys

Terminologies

• Investigators usually want to generalize about a class of individuals. This class is called the population .

• For example, in forecasting the results of a presidential election in the U.S., one relevant population consists of all eligible voters.

• When study the whole population is impractical (this is the usual case), we can examine only part of it. This part is called the sample .

• Investigators make inferences from the sample to the population. That is they make generalizations from the part to the whole.

Terminologies

• • Usually, there are some numerical facts about the population which the investigators want to know. Such numerical facts are called parameters .

For example, the mean of the distribution for the heights of men in some large population.

• • • In general, parameters cannot be determined exactly, but can only be estimated from a sample. Then a major issue is accuracy.

Parameters are estimated by computed from a sample.

statistics (some numbers) which can be In previous example, the average of the heights in a sample will be a good statistic to estimate the mean of the whole population.

Comments

• • • • Statistics are what investigators know; parameters are what they want to know.

Estimating parameters from the sample is justified when the sample represents the population.

But it is impossible to check just by looking at the sample.

Instead, one has to look at how the sample was chosen.

• Reason: to see whether the sample is like the population in the ways that matter, investigators would have to know the facts about the population that they are trying to estimate.

Choosing a sample

We will see the method of choosing the sample matters a lot by looking at an example.

Later we will introduce the best methods in choosing a sample, which involve probability.

Example

• • • • • Background: in 1936, Franklin Delano Roosevelt was completing his first term of office as president of the U.S.

It was an election year, and the Republican candidate was Governor Alfred Landon of Kansas.

The country was struggling to recover from the Great Depression.

There were still nine million unemployed: real income had dropped by one-third in the period 1929-1933 and was just beginning to turn upward.

Landon was campaigning on a program of economy in government, and Roosevelt was defensive about his deficit financing.

Example

• Most observers thought Roosevelt would be an easy winner.

• • But the Literary Digest magazine predicted an overwhelming victory for Landon, with Roosevelt getting only 43% of the popular vote.

This prediction was based on the largest number of people ever replying to a poll----about 2.4 million individuals.

• It was backed by the enormous prestige of the Digest.

• • However, Roosevelt won the 1936 election by a landslide----62% to 38%.

The Digest went bankrupt soon after.

Bias 1

• A sampling procedure should be fair, selecting people for inclusion in the sample in an impartial way, so as to get a representative cross section of the public.

• A systematic tendency on the part of the sampling procedure to exclude one kind of person or another from the sample is called selection bias .

Example

• • • • • To find out where the Digest went wrong, we have to look at how they picked their sample.

The Digest mailed questionnaires to 10 million people.

The names and addresses of these 10 million people came from sources like telephone books and club membership lists.

That tended to screen out the poor, who were unlikely to belong to clubs or have telephones. (At the time, for example, only one household in four had a telephone.) So there was a very strong bias against the poor in the Digest’s sampling procedure.

Comments

• • Prior to 1936, this bias may not have affected the predictions very much, because the rich and poor voted along similar lines.

But in 1936, the political split followed economic lines more closely. The poor voted overwhelmingly for Roosevelt, the rich were for Landon.

• So the first reason for Digest’s error was selection bias.

• When a selection procedure is biased, taking a large sample does not help. This just repeats the basic mistake on a larger scale.

Bias 2

• After deciding which people ought to be in the sample, a survey organization still has to get their opinions: • If a large number of those selected for the sample do not in fact respond to the questionnaire or the interview, non-response bias likely to happen.

is • The non-respondents differ from the respondents in one obvious way: they did not respond. Experience shows they tend to differ in other important ways as well.

Example

• • • • • • The Digest did very badly at the first step in sampling. But there is also a second step.

There were only 2.4 million people bothered to reply, out of the 10 million who got the questionnaire.

These 2.4 million respondents do not even represent the 10 million people who were polled, let alone the population of all voters.

For example, the Digest made a special survey, with questionnaires mailed to every third registered voter in Chicago.

About 20% responded, and of those who responded over half favored Landon.

But in the election Chicago went for Roosevelt, by a two-to-one margin.

Comments

• The Digest poll was spoiled both by selection bias and non-response bias.

• Non-respondents can be very different from respondents. When there is a high non-response rate, look out for non-response bias.

Remarks

• Special surveys have been carried out to measure the difference between respondents and non-respondents.

• It turns out that lower-income and upper-income people tend not to respond to questionnaires, so the middle class is over-represented among respondents.

• For these reasons, modern survey organizations prefer to use personal interviews rather than mailed questionnaires.

• A typical response rate for personal interviews is 65%, compared to 25% for mailed questionnaires.

Remarks

• But the problem of non-response bias still remains, even with personal interviews.

• Those who are not at home when the interviewer calls may be quite different from those who are at home, with respect to working hours, family ties, social background, and therefore with respect to attitudes.

• Good survey organizations keep this problem in mind, and have ingenious methods for dealing with it. (See reading materials for the Gallup poll.)

Summary for choosing a sample

• Some samples are really bad. To find out whether a sample is any good, ask how it was chosen.

• • Was there selection bias?

Was there non-response bias?

• You may not be able to answer these questions just by looking at the data .

Probability methods

The probability methods use objective and impartial chance mechanisms to select the sample, compared to the quota sampling which is not a probability method. (See reading materials.)

What is a probability method?

• The probability method for drawing a sample is that: • For example, suppose we carry out a survey of 100 voters in a small town with a population of 1,000 eligible voters.

• Then we list all the eligible voters, write the name of each one on a ticket, put all 1,000 tickets in a box, and draw 100 tickets at random.

• • There will be no point interviewing the same person twice, the draws are made without replacement.

The people whose tickets have been drawn form the sample.

Simple random sampling

• The process is called simple random sampling : tickets have simply been drawn at random without replacement.

• At each draw, every ticket in the box has an equal chance to be chosen.

• The interviewers have no discretion at all in whom they interview, and the procedure is impartial----everybody has the same chance to get into the sample.

• The law of averages guarantees that the percentage of the corresponding subjects (e.g. Democrats) in the sample is likely to be close to the percentage in the population.

Comments

• • It is not practical to take a simple random sample.

For example, to predict a presidential election, we first need a list of all the eligible voters----over 200 million names. There is no such list.

• • • Even if there were, drawing a few thousand names at random from 200 million is not an easy job, since we have to make every name in the box have an equal chance of being selected.

Even if we could draw a simple random sample, the people would be scattered all over the map. It would be prohibitively expensive to send interviewers around to find them all.

So in practice, we use another probability method instead.

Multistage cluster sampling

• We describe the idea of multistage cluster sampling by using the following example: • During the period from 1952 through 1984, the Gallup pre-election surveys were all done using just about the same procedure.

• The Gallup Poll makes a separate study in each of the four geographic regions of the U.S.----Northeast, South, Midwest, and West.

• Within each region, they group together all the population centers of similar sizes. One such grouping might be all towns in the Northeast with a population between 50 and 250 thousand.

• Then, a random sample of these towns is selected.

Multistage cluster sampling

• Interviewers are stationed in the selected towns, and no interviews are conducted in the other towns of that group.

• Other groupings are handled the same way. This completes the 1 st stage of sampling.

• For election purposes, each town is divided up into wards, and the wards are subdivided into precincts.

• At the 2 nd stage of sampling, some wards are selected at random from each town chosen in the stage before.

Multistage cluster sampling

• At the 3 rd stage, some precincts are drawn at random from each of the previously selected wards.

• At the 4 th precinct.

stage, households are drawn at random from each selected • Finally, some members of the selected households are interviewed.

• Remember, no discretion is allowed. (e.g. Gallup Poll interviewers are instructed to “speak to the youngest man 18 or older at home, or if no man is at home, the oldest woman 18 or older”.)

The figure for Multistage cluster sampling

The advantage

• The method is set up so the distribution of the sample by residence is the same as the distribution for the nation.

• Each stage in the selection procedure uses an objective and impartial chance mechanism to select the sample units.

• So there is no selection bias on the part of the interviewer.

• Note: there could be selection bias on the other parts. (e.g. Separate study in each of the four geographical regions, dividing towns into wards, wards into precincts, and etc.)

The Gallup Poll record

The Gallup Poll record

• • • • There are three points to notice: 1 st . The sample size has gone down sharply. They used a sample of size about 50,000 in 1948. But they now use samples less than a tenth of that size.

2 nd . There is no longer any consistent trend favoring either Republicans or Democrats.

3 rd . The accuracy has gone up appreciably.

• Using probability methods to select the sample, the Gallup Poll has been able to predict the elections with startling accuracy, sampling less than 5 persons in 100,000----which proves the value of probability methods in sampling.

Remarks

• • • • • Simple random sampling is the basic probability method.

Other methods can be quite complicated.

But all probability methods for sampling have two important features: The interviewers have no discretion at all as to whom they interview; There is a definite procedure for selecting the sample, and it involves the planned use of chance.

• • As a result, with a probability method it is possible to compute the chance that any particular individuals in the population will get into the sample.

Note: to minimize bias, an impartial and objective probability method should be used to choose the sample.

Telephone Surveys

Many surveys are now conducted by telephone. The savings in costs are dramatic.

How to pick a sample?

• Here are two examples: • • In 1988, the Gallup Poll used a multistage cluster sample based on area codes, “exchanges”, and “banks”.

For example: Area code-Exchange-Bank-Digits: 415-767-26-76.

• • • • In 1992, they switched to a simpler design.

There are 4 time zones in the U.S. The Gallup Poll divided each zone into 3 types of areas, according to population density (heavy, medium, light). That gives 12 strata.

Within each stratum, they drew a simple random sample of telephone numbers, using the computer to exclude businesses by checking the yellow pages.

Choosing telephone numbers at random is called RDD: random digit dialing .

Comments

• Non-respondents create problems, as usual.

• • The Gallup Poll does most of its interviewing on evenings and weekends, when the people are more likely to be at home.

If there is no answer, the interviewer will call back up to 3 times. (Some designs have up to 15 call-backs. That is better, but more expensive.) • For many purposes, results are comparable to those from face-to-face interviews, and the cost is about 1/3 as much. This is why survey organizations are using the telephone.

Remarks

• People who do not have phones must be different from the rest of us, and that does cause a bias in telephone surveys.

• But the effect is small, because these days nearly everybody has a phone.

• On the other hand, about 1/3 of residential telephones are unlisted. Rich people and poor people are more likely to have unlisted numbers, so the telephone book tilts toward the middle class.

• Sampling from directories would create a real bias, but random digit dialing gets around this difficulty.

Chance error and bias

We have introduced the practical difficulties faced by real survey organizations.

However, even if all these difficulties are assumed away, the sample is still likely to be off----due to chance error.

Chance error

• Suppose we have a box with a very large number of tickets, some marked 1 and the others marked 0. That is the population.

• We want to estimate the percentage of 1’s in the box. That is the parameter.

• We draw 1,000 tickets at random without replacement. That is the sample.

• In this case, there is no problem about the response. Also, drawing tickets at random eliminates selection bias.

• As a result, the percentage of 1’s in the sample is going to be a good estimate for the percentage of 1’s in the box.

Chance error

• • • • • • • But the estimate is still likely to be a bit off, because the sample is only part of the population.

Since the sample is chosen at random, the amount off is governed by chance: Percentage of 1’s in sample = percentage of 1’s in box + chance error.

In other situations, if we take bias into account: Estimate = parameter + bias + chance error.

Chance error is often called “sampling error”, this error comes from the fact that the sample is only part of the whole.

Bias is called “non-sampling error”, this error is from other sources, like non-response. Bias is often a more serious problem than chance error.

Some natural questions

• As previous chapters, when we have chance error in an equation, we usually ask about chance errors: • • How big are they likely to be?

How do they depend on the size of the sample?

• Or do they depend on the size of the population?

• How big does the sample have to be in order to keep the chance errors under control?

• We will study these topics next in class.

Summary

• A sample is part of a population.

• • A parameter is a numerical fact about a population. Usually a parameter cannot be determined exactly, but can only be estimated.

A statistic can be computed from a sample, and used to estimate a parameter. The statistic is what the investigator knows. A parameter is what the investigator wants to know. The major issue is accuracy.

• When choosing a sample survey, ask yourself what is the population, the parameter? How is the sample chosen? What is the response rate? Try to avoid the selection bias and non-response bias.

• Large samples offer no protection against bias.

Summary

• Probability methods for sampling use an objective and impartial process to pick the sample, and leave no discretion to the interviewer.

• The investigator can compute the probability that any particular individuals in the population will be selected for the sample. Probability methods guard against bias, because blind chance is impartial.

• Even when using probability methods, bias may come in. Then the estimate differs from the parameter, due to bias and chance error: • Estimate = parameter + bias + chance error.