Lecture Unit 1 - NC State Department of Statistics

Download Report

Transcript Lecture Unit 1 - NC State Department of Statistics

Lecture Unit 3
Sample Surveys
Producing Valid Data
“If you don’t believe in random sampling,
the next time you have a blood test tell the
doctor to take it all.”
The election of 1948
The Predictions
The Candidates Crossley Gallup Roper The Results
Truman
45
44
38
50
Dewey
50
50
53
45
Lecture Unit 3 Objectives
1.
2.
3.
Given a survey sample, determine whether the
sample is a simple random sample, a stratified
sample, a cluster sample, or a systematic sample.
Choose a simple random sample, stratified random
sample, cluster sample, and systematic random
sample in a variety of situations.
Explain the affect of sample size when determining
whether a sample is representative of the population.
Beyond the Data at Hand to the
World at Large



We have learned ways to display, describe, and
summarize data, but have been limited to
examining the particular batch of data we have.
We’d like (and often need) to stretch beyond the
data at hand to the world at large.
Let’s investigate three major ideas that will allow
us to make this stretch…
3 Key Ideas That Enable Us to
Make the Stretch
Idea 1: Examine a Part of the
Whole

The first idea is to draw a sample.
– We’d like to know about an entire population of
individuals, but examining all of them is
usually impractical, if not impossible.
– We settle for examining a smaller group of
individuals—a sample—selected from the
population.
Examples
1. Think about sampling something you are
cooking—you taste (examine) a small part of what
you’re cooking to get an idea about the dish as a
whole.
2. Opinion polls are examples of sample surveys,
designed to ask questions of a small group of
people in the hope of learning something about the
entire population.
Sampling methods
Convenience sampling: Just ask whoever is around.
– Example: “Man on the street” survey (cheap, convenient, often quite
opinionated or emotional => now very popular with TV “journalism”)

Which men, and on which street?
– Ask about gun control or legalizing marijuana “on the street” in
Berkeley or in some small town in Idaho and you would probably get
totally different answers.
– Even within an area, answers would probably differ if you did the
survey outside a high school or a country western bar.

Bias: Opinions limited to individuals present.
Voluntary Response Sampling:

Individuals choose to be involved. These samples are very
susceptible to being biased because different people are motivated
to respond or not. Often called “public opinion polls.” These are not
considered valid or scientific.

Bias: Sample design systematically favors a particular outcome.
Ann Landers summarizing responses of readers
70% of (10,000) parents wrote in to say that having kids was not
worth it—if they had to do it over again, they wouldn’t.
Bias: Most letters to newspapers are written by disgruntled people. A
random sample showed that 91% of parents WOULD have kids again.
CNN on-line surveys:
Bias: People have to care enough about an issue to bother replying. This sample
is probably a combination of people who hate “wasting the taxpayers money” and
“animal lovers.”
Example: hospital employee drug
use
Administrators at a hospital are concerned about the
possibility of drug abuse by people who work there. They
decide to check on the extent of the problem by having a
random sample of the employees undergo a drug test. The
administrators randomly select a department (say, radiology)
and test all the people who work in that department –
doctors, nurses, technicians, clerks, custodians, etc.
•
•
Why might this result in a biased sample?
Dept. might not represent full range of
employee types, experiences, stress levels,
or the hospital’s drug supply
Example (cont.)
Name the kind of bias that might be present if the
administration decides that instead of subjecting
people to random testing they’ll just…
 a. interview employees about possible drug abuse.
 Response bias: people will feel threatened, won’t answer
truthfully

b. ask people to volunteer to be tested.
 Voluntary response bias; only those who are “clean”
would volunteer
Bias
•
•
•
•
Bias is the bane of sampling—the one thing
above all to avoid.
There is usually no way to fix a biased
sample and no way to salvage useful
information from it.
The best way to avoid bias is to select
individuals for the sample at random.
The value of deliberately introducing
randomness is one of the great insights of
Statistics – Idea 2
Idea 2: Randomize
•
Randomization can protect you against factors that
you know are in the data.
– It can also help protect against factors you are not even
aware of.
•
Randomizing protects us from the influences of
all the features of our population, even ones that
we may not have thought about.
– Randomizing makes sure that on the average the
sample looks like the rest of the population
Idea 2: Randomize (cont.)
Individuals are randomly selected. No one group should be overrepresented.
Sampling randomly gets rid of bias.
Random samples rely on the absolute objectivity of
random numbers. There are tables and books of
random digits available for random sampling.
Statistical software can
generate random digits
(e.g., Excel “=random()”,
ran# button on
calculator).
Idea 2: Randomize (cont.)

Not only does randomizing protect us from
bias, it actually makes it possible for us to
draw inferences about the population when
we see only a sample.
Hospital example (cont.)


Listed in the table are the
names of the 20
pharmacists on the
hospital staff. Use the
random numbers listed
below to select three of
them to be in the sample.
04905 83852 29350
91397 19994 65142
05087 11232
01 NCSU 02 UNC 03 Duke 04 Wake F 05 BC 06 UM 07 Maryl.
08 Clem 09 UVA 10 VaTech 11 GaTech 12 FSU 13 OSU 14 ILL
15 IN 16 PUR 17 IOWA 18 MSU 19 Mich 20 PennS 21 NorthW
22 MN 23 WISC
96927 19931 36089 74192 77567 88741 48409 41903
The first 3 schools in a random sample selected from the ACC and Big
Ten using the above random numbers are:
1.
2.
3.
4.
5.
UVA, UM, UNC
UVA, NCSU, Duke
UVA, UM, UVA
Clem, Mich, Duke
Mich, OSU, Maryl
0%
1
0%
0%
2
3
0%
0%
4
5
Idea 3: It’s the Sample Size!!
•
•
How large a random sample do we need for the
sample to be reasonably representative of the
population?
It’s the size of the sample, not the size of the
population, that makes the difference in sampling.
– Exception: If the population is small enough and the
sample is more than 10% of the whole population, the
population size can matter.
•
The fraction of the population that you’ve
sampled doesn’t matter. It’s the sample size itself
that’s important.
Example
•
•
•
•
•
i) In the city of Chicago, Illinois, 1,000 likely
voters are randomly selected and asked who they
are going to vote for in the Chicago mayoral race.
ii) In the state of Illinois, 1,000 likely voters are
randomly selected and asked who they are going
to vote for in the Illinois governor's race.
iii) In the United States, 1,000 likely voters are
randomly selected and asked who they are going
to vote for in the presidential election.
Which survey has more accuracy?
All the surveys have the same accuracy
Idea 3: It’s the Sample Size!!

Chicken soup

Blood samples
Does a Census Make Sense?
•
•
Why bother worrying about the sample
size?
Wouldn’t it be better to just include
everyone and “sample” the entire
population?
– Such a special sample is called a census.
Does a Census Make Sense? (cont.)
•
There are problems with taking a census:
– Practicality: It can be difficult to complete a census—
there always seem to be some individuals who are hard to
locate or hard to measure.
– Timeliness: populations rarely stand still. Even if you
could take a census, the population changes while you
work, so it’s never possible to get a perfect measure.
– Expense: taking a census may be more complex than
sampling.
– Accuracy: a census may not be as accurate as a good
sample due to data entry error, inaccurate (made-up?)
data, tedium.
Population versus sample
•
Population: The entire
group of individuals in which
we are interested but can’t
usually assess directly.
•
Example: All humans, all
working-age people in
California, all crickets
•
Sample: The part of the
population we actually
examine and for which we
do have data.
How well the sample
represents the population
depends on the sample
design.
Population
Sample
•
A parameter is a number
describing a characteristic of
the population.
•
A statistic is a number
describing a characteristic of
a sample.
Sample Statistics Estimate Parameters
•
•
•
•
Values of population parameters are unknown; in
addition, they are unknowable.
Example: The distribution of heights of adult females
(at least 18 yrs of age) in the United States is
approximately symmetric and mound-shaped with
mean µ. µ is a population parameter whose value is
unknown and unknowable
The heights of 1500 females are obtained from a
sample of government records. The sample mean x of
the 1500 heights is calculated to be 64.5 inches.
The sample mean x is a sample statistic that we use to
estimate the unknown population parameter µ
We typically use Greek letters to
denote parameters and Latin
letters to denote statistics.
Various claims are often made for
surveys. Why are each of the following
claims not correct?
•

•

•

It is always better to take a census than a sample
Timeliness, expense, complexity, accuracy
Stopping students on their way out of the cafeteria is a
good way to sample if we want to know the quality of the
food in the cafeteria.
Bias; they chose to eat at the cafeteria
We drew a sample of 100 from the 3,000 students at a
small college. To get the same level of precision for a town
of 30,000 residents, we'll need a sample of 1,000
residents.
It’s the sample size, not the size of the population or the
fraction of the population that we sample, that is important.
Survey claims (cont.)
•

•

An internet poll taken at the web site
www.statsisfun.org garnered 12,357 responses. The
majority said they enjoy doing statistics homework.
With a sample size that large, we can be pretty sure
that most Statistics students feel this way, too.
Voluntary response bias; size of sample does not
remove the bias.
The true percentage of all Statistics students who enjoy
the homework is called a “population statistic.”
The true percentage is a population parameter
3.2 Simple Random Samples

Desire the sample to be representative of the
population from which the sample is selected

Each individual in the population should have
an equal chance to be selected

Is this good enough?
Example

1.
2.
3.




Select a sample of high school students as follows:
Flip a fair coin
If heads, select all female students in the school as the
sample
If tails, select all male students in the school as the
sample
Each student has an equal chance to be in the sample
Every sample a single gender, not representative
Each individual in the population has an equal chance
to be selected. Is this good enough?
NO!!
Simple Random Samples

A simple random sample (SRS) of size n
consists of n units from the population
chosen in such a way that every set of n
units has an equal chance to be the sample
actually selected.
Simple Random Samples (cont.)
•
•
•
•
•
Suppose a large History class of 500 students has 250 male and 250
female students.
To select a random sample of 250 students from the class, I flip a fair
coin one time.
If the coin shows heads, I select the 250 males as my sample; if the
coin shows tails I select the 250 females as my sample.
What is the chance any individual student from the class is included in
the sample? ½
This is a random sample. Is it a simple random sample? NO!
Not every possible group of 250 students has
an equal chance to be selected.
Every sample consists of only 1 gender –
hardly representative.
Sampling Frame
•
To select a sample at random, we first need to
define where the sample will come from.
– The sampling frame is a list of individuals from which
the sample is drawn.
– E.g., To select a random sample of students from
NCSU, we might obtain a list of all registered full-time
students from Registration & Records.
– When defining sampling frame, must deal with details
defining the population; are part-time students
included? How about current study-abroad students?
•
Once we have our sampling frame, the easiest way
to choose an SRS is with random numbers.
Warning!
•
•
•
If some members of the population are not
included in the sampling frame, they cannot
be part of the sample!! (e. g., using a
telephone book as the sampling frame)
Population: Wal Mart shoppers
Sampling frame?
Example: simple random sample

Academic dept wishes to randomly choose
a 3-member committee from the 28
members of the dept
01 Abbott
02 Cicirelli
03 Crane
04 Dunsmore
05 Engle
06 Fitzpat’k
07 Garcia
08 Goodwin 15 Pillotte
09 Haglund 16 Raman
10 Johnson 17 Reimann
11 Keegan
18 Rodriguez
12 Lechtenb’g 19 Rowe
13 Martinez 20 Sommers
14 Nguyen
21 Stone
22 Theobald
23 Vader
24 Wang
25 Wieczoreck
26 Williams
27 Wilson
28 Zink
Solution
•
Use a random number table; read 2-digit pairs
until you have chosen 3 committee members
For example, start in row 121:
•
71487 09984 29077 14863 61683 47052 62224 51025
•
Garcia (07) Theobald (22) Johnson (10)
 Your calculator generates random numbers; you
can also generate random numbers using Excel
Sampling Variability
•
Suppose we had started in line 145?
•
19687 12633 57857 95806 09931 02150 43163 58636
Our sample would have been
19 Rowe, 26 Williams, 06 Fitzpatrick
•
Sampling Variability
•
•
•
•
Samples drawn at random generally differ from
one another.
Each draw of random numbers selects different
people for our sample.
These differences lead to different values for the
variables we measure.
We call these sample-to-sample differences
sampling variability.
•
Variability is OK; bias is bad!!
Stratified Random Sampling

This sampling procedure separates the
population into mutually exclusive sets
(strata), and then selects simple random
samples from each stratum.
Occupation
• professional
• clerical
• blue-collar
Age
• under 20
• 20-30
• 31-40
• 41-50
Sex
• Male
• Female
Stratified Random Sampling

With this procedure we can acquire
information about
– the whole population
– each stratum
– the relationships among strata.
Stratified Random Sampling
•
There are several ways to build the stratified
sample. For example, keep the proportion of
each stratum in the population.
A sample of size 1,000 is to be drawn
Stratum
Income
1
2
3
4
under $15,000
15,000-29,999
30.000-50,000
over $50,000
Population proportion Stratum size
25%
40%
30%
5%
250
400
300
50
Total 1,000
Cluster Sampling
•
•
•
•
•
Sometimes stratifying isn’t practical and
simple random sampling is difficult.
Splitting the population into similar parts or
clusters can make sampling more practical.
Then we could select one or a few clusters at
random and select an SRS or perform a
census within each selected cluster.
This sampling design is called cluster
sampling.
If each cluster fairly represents the full
population, cluster sampling will give us an
unbiased sample.
Cluster Sampling Useful When…
•
it is difficult and costly to
develop a complete list of the
population members (making
it difficult to develop a simple
random sampling procedure.)
 e.g., all items sold in a grocery store
 the population members are widely
dispersed geographically.
 e.g., all Toyota dealerships in North
Carolina
Mean length of sentences
in our course text
•
•
We would like to assess the
reading level of our course text
based on the length of the sentences.
Simple random sampling would be awkward:
• number each sentence in the book?
•
Better way:
• choose a few pages at random (the pages are the
clusters, and it's reasonable to assume that each
page is representative of the entire text).
• count the length of all sentences on the selected
pages or select a SRS of sentences from each of the
selected pages.
Cluster sampling - not the same
as stratified sampling!!
•
We stratify to ensure that our sample represents
different groups in the population, and sample
randomly within each stratum.
Strata are homogenous (e.g., male, female) but
differ from one another
•
Clusters are more or less alike, each
heterogeneous and resembling the overall
population.
 We select clusters to make sampling more practical or
affordable.
 We select an SRS or conduct a census on each
selected cluster.
Multistage Sampling
•
•
Sometimes we use a variety of sampling
methods together.
Sampling schemes that combine several
methods are called multistage samples.
Most surveys conducted by
professional polling
organizations and
government agencies use
some combination of
stratified and cluster
sampling as well as simple
random sampling.
Mean length of sentences
in our course text, cont.
•
•
In attempting to assess the
reading level of our course text:
• we might worry that it starts out easy and
gets harder as the concepts become more
difficult
• we want to avoid samples that select too
heavily from early or from late chapters
•
Suppose our course text has 5 sections,
with several chapters in each section.
Mean length of sentences in our
course text, cont.
•
•
•
•
•
•
•
•
•
We could:
i) randomly select 1 chapter from each section
ii) randomly select a few pages from each of the
selected chapters
iii) if altogether this makes too many sentences, we
could randomly select a few sentences from each
page.
So what is our sampling strategy?
i) we stratify by section of the book
ii) we randomly choose a chapter to represent each
stratum (section)
iii) within each chapter we randomly choose pages as
clusters
iv) finally, we choose an SRS of sentences from the
randomly chosen pages.
Systematic Sampling
• Sometimes we draw a sample by selecting
individuals systematically.
 For example, you might survey every 10th person on an
alphabetical list of students.
• To make it random, you must still start the systematic
selection from a randomly selected individual.
• When there is no reason to believe that the order of
the list could be associated in any way with the
responses sought, systematic sampling can give a
representative sample.
• Systematic sampling can be much less expensive
than true random sampling.
• When you use a systematic sample, you need to
justify the assumption that the systematic method is
not associated with any of the measured variables.
Systematic Sampling-example
•
•
•
•
•
•
•
•
You want to select a sample of 50 students from a
college dormitory that houses 500 students.
On a list of all students living in the dorm, number the
students from 001 to 500.
Generate a random number between 001 and 010,
and start with that student.
Every 10th student in the list becomes part of your
sample. For example: 3, 13, 23, 33, 43, 53, …, 493.
Questions:
1) does each student have an equal chance to be in
the sample? Yes
2) what is the chance that a student is included in the
sample?
1/10
3) is this an SRS?
No
Summary: What have we learned?


A representative sample can offer us important
insights about populations.
– It’s the size of the sample, not its fraction of the
larger population, that determines the precision of
the statistics it yields.
There are several ways to draw samples, all based on
the power of randomness to make them representative
of the population of interest:
– Simple Random Sample, Stratified Sample, Cluster
Sample, Systematic Sample, Multistage Sample
Summary: What have we learned?
(cont.)

Bias can destroy our ability to gain insights
from our sample:
– Nonresponse bias can arise when sampled
individuals will not or cannot respond.
– Response bias arises when respondents’
answers might be affected by external
influences, such as question wording or
interviewer behavior.
Summary: What have we learned?
(cont.)

Bias can also arise from poor sampling methods:
– Voluntary response samples are almost always biased
and should be avoided and distrusted.
– Convenience samples are likely to be flawed for
similar reasons.
– Even with a reasonable design, nonrepresentative
sample frames create bias.
 Undercoverage
occurs when individuals from a subgroup
of the population are selected less often than they should
be.
End of Lecture Unit 3