Chapter 12: Sampling Design

Download Report

Transcript Chapter 12: Sampling Design

Chapter 4-1
Samples
AP Statistics
Studies Show…

What is the point of a study…?
 We
study things in order to gather
information about something.
 Most of the times, it is too difficult to get
information about everyone or everything.
 So we look at pieces of a group to get a better
understanding of the whole.
2
Population and Sample


The entire group of individuals that we want
information about is called the population.
A sample is a part of the population that we
actually examine in order to gather information.
These are useful for several reasons:
 They
help us to make inferences about the
population as a whole.
 We usually can’t afford to (and usually don’t want to)
talk to everyone.
 Although different samples may give us different
results, those results are reasonable estimates of the
population as a whole.
3
Populations and Parameters




Models use mathematics to represent reality.
 Parameters are the key numbers in those
models.
Parameters tell us information about the
population.
We use data to estimate population parameters.
 Any summary found from a sample of the
data is a statistic.
A good way to remember is Statistics come from
Samples and Parameters come from
Populations.
4
Notation

We typically use Greek letters to denote
parameters and Latin letters to denote
statistics.
5
Defining the “Who”

You must specify the sampling frame.
 The sampling frame is a list of possible
participants in the study
 For
example, if you wanted study the opinions of
next year’s Freshman class, you would need a list
of possible incoming freshmen (this list is the
sampling frame)
 Usually,
the sampling frame is not the group
you really want to know about.
 Your
list of Freshmen may not be complete, some
students on the list may go somewhere else and
other students may transfer in.
6
Sampling vs. a Census
Sampling involves studying a part of the
population in order to gain information
about the whole.
 A census attempts to contact every
individual in the entire population.

7
How NOT To Sample (Poor Sampling)


Recall that bias is when we systematically favor
certain outcomes
There are three main types of sampling bias:
Selection Bias (aka undercoverage bias) – This
introduced when some part of the population is
systematically under-represented in the sample
2. Response Bias – This occurs when the method of
collecting the data corrupts the purity of the sample
and it tends to produce values that systematically
differ from the population
3. Non-Response Bias – This happens when responses
are not actually obtained from subjects chosen for
the sample
1.
8
How NOT To Sample (Poor Sampling)

Selection Bias
 There are several ways that selection bias can occur.
 Ex: those without phones, homeless, etc.
 Ex: Gettysburg Address – small words were excluded
 Selection Bias also occurs when we have volunteers
 Ex: Call-in opinion polls, magazine surveys, etc
 Called Voluntary Response Samples
 The
problem with volunteers is that they tend to have
strong opinions, especially strong negative opinions.

Ex: “Call in your opinion about Cal-Trans”
 An
online study surveys it’s respondents and asks,
“Do you think that online surveys are easy to
complete?”
How do you think that the volunteer will answer?
 Do you think it will represent the whole population?

9
How NOT To Sample (Poor Sampling)

Selection Bias
 Another type of selection
 Mall intercept Interviews
bias is convenience sampling
 Convenience
sampling may not get you access to all the
different groups of people in the population.
 Interviewers often avoid people who may make them
feel uncomfortable.
 These types of samples are mainly biased; they often do
not represent the entire population.
10
How NOT To Sample (Poor Sampling)

Selection Bias
 Usually,
selection bias is a form of undercoverage
which occurs when some groups in the population are
left out of the process of choosing the sample or has a
smaller representation in the sample than it has in the
population.

Ex: A researcher conducted a survey by calling mothers at 11
am and asked, “Does daycare for young children negatively
affect their developmental growth?” The response was an
overwhelming yes. Why?
11
How NOT To Sample (Poor Sampling)

Response Bias
 Sometimes,
this occurs when respondents try to
answer questions in the way they think the
questioner wants them to answer rather than
according to their true beliefs.
“It is estimated that disposable diapers account for less than
2% of the trash in today’s landfills. In contrast, beverage
containers, third-class mail and yard wastes are estimated to
account for about 21% of the trash in landfills. Given this, in
your opinion, would it be fair to ban disposable diapers?”
 “Given that fact that those who understand Statistics are
much more attractive and intelligent than those who don’t,
don’t you think that it’s important to take a course in
Statistics?”

 Always
consider biased question wording and
interviewer bias.
12
How NOT To Sample (Poor Sampling)

Response Bias
 Ex:
Ann Landers asked her parental readers, “If you
had it to do over again, would you have children?”
The overwhelming majority – 70% of the more than 100,000
people who wrote in – said no, kids weren’t worth it!
 Really? Do you think that 70% of the population actually
dislikes their children?
 A more carefully designed study showed that about 90% of
parents are actually happy with their decision to have
children. What accounts for the striking differences in these
two results? What kind of parents do you think are most
likely to respond to the original question?


This is called Voluntary Response Bias.
13
How NOT To Sample (Poor Sampling)

Response Bias
 Ex:
In 1968 researcher Shere Hite shocked
conservative America with her famous “Hite Report”
on the permissive sexual attitudes of American men
and women. Twenty years later, Hite was surrounded
by controversy again with her book, Women and Love:
A Cultural Revolution in Progress (Knopf Press, 1988).
14
How NOT To Sample (Poor Sampling)

Response Bias
 In
this new Hite report, she reveals some starting
statistics describing how women feel about
contemporary relationships:
84% of women are not emotionally satisfied with their
relationship
 95% of women report “emotional and psychological
harassment” from their men
 70% of women married 5 years or more are having
extramarital affairs
 Only 13% of women married more than 2 years are “in love”

15
How NOT To Sample (Poor Sampling)

Response Bias
 Looks
good from the outside…
Hite conducted the survey by mailing out 100,000
questionnaires to women across the country over a 7 year
period.
 She sent questionnaires to a wide variety of organizations and
asked them to circulate the questionnaires to their members.
 She mentions that they included church groups, women’s
voting and political groups, women’s rights organizations and
counseling and walk-in centers for women.
 Hite also relied on respondents who wrote in for copies of the
questionnaire (apparently, readers of her past books and those
who saw interviews on television and in the press).

16
How NOT To Sample (Poor Sampling)

Response Bias
 Unfortunately,

the problem is in the details…
Each questionnaire consisted of 127 open-ended questions,
many with numerous sub-questions and follow-up. Hite’s
instructions read: “It is not necessary to answer every question.
Feel free to skip around and answer those questions you
choose.” Approximately 4,500 completed questionnaires were
returned for a response rate of 4.5% and they form the data set
from which all conclusions were determined.
17
How NOT To Sample (Poor Sampling)

Response Bias
 Hite
claims that the survey results imply that vast
numbers of women are “suffering a lot of pain in their
love relationships with men.”
What is the population of interest to Shere Hite?
 What inferences did Hite make about the population?
 What are the problems with Hite’s inferences?

18
How NOT To Sample (Poor Sampling)

Non-Response Bias
 This
occurs when individuals chosen for the sample
can’t be contacted or does not cooperate with the
survey.

Ex: To better understand students at VCHS, a study left surveys
on a table by the school entrance…very few surveys were
returned.
 Very
few, if any, have a 100% response rate, but every
effort should be made to make this rate as high as
possible.
 Personal interviews have a better response rate, but are
more costly than other methods (telephone, mail, etc.)
 One should always follow up on subjects if they don’t
respond the first time
19
How to Capture a Good Sample

Getting a portion of the population is not
difficult.
 Anyone

can pick ten people from the crowd
However, getting a good sample is
difficult.
 Creating
a plan to get the sample is called
sampling design.
 If
you wanted to select 100 students from the
school that represents the school, how would you
do it? They process that you use to get your
sample is the Sampling Design.
20
Gathering a Good Sample

If you have been enlisted to answer the
following question:


“How readable is the Gettysburg Address?”
How would you go about trying to
answer this question?
21
Gathering a Good Sample

A simple method would be to measure the
average word length.
 Find
5 words that YOU think are
“representative” of the speech
 Find the average word length
 Post your results on the board
 Draw a dotplot of all of the average word
lengths

Do our results seem reasonable?
22
Gathering a Good Sample

A simple method would be to measure the
average word length.
 Now
let’s use choose the words another way
 Rather than choosing the words yourself, let’s
choose 5 words using a Random Number
Generator – in this case, your calculator
 Randomly
choose 5 distinct (no repeats) words
from 1 to 271
 Find
the average, post it, and graph it
 The true average was 4.287; which method
was more accurate? Why?
23
Bias

Sampling methods that, by their nature, tend to
over- or under- emphasize some characteristics of
the population are said to be biased.
 Bias
is the bane of sampling—the one thing above all to
avoid.
 There is usually NO way to fix a biased sample and no
way to salvage useful information from it.

The best way to avoid bias is to select individuals
for the sample at random.
 The
value of deliberately introducing randomness is
one of the great insights of Statistics.

The design of a study is biased if it systematically
favors certain outcomes.
24
Bias

When we chose words from the
Gettysburg Address, our eyes are
naturally drawn to larger words and our
samples ended up being biased
 Even
if we thought we were doing a good job
of “randomly” picking words

Bias is a major problem in conducting a
study and the only way to eliminate as
much bias as possible is to use chance!
25
Variability

The variability of an estimate refers to the
range of values that the estimate can take
in repeated sampling.
 When
there is a great deal of variability, it’s
difficult to be precise about our estimation

The size of the population has NO effect
on the variability of an estimate. The only
thing that matters is sample size
26
How to Sample




The best way to sample is to use a “simple
random sample”
A simple random sample (SRS) of size n
consists of n individuals from the population
chosen in such a way that every set of n
individuals has an equal chance to be the
sample actually selected.
This does not only mean that each individual
has an equal chance of being selected, but that
each group of people has an equal chance of
being selected, as well.
Basically, each person and combination of
people has an equal chance of being selected.
27
How to Sample

Before we continue, let’s examine the idea of an
SRS a little more:
Consider, for example, a school that has an equal number of
boys and girls. We want to take a sample so we decide to
either randomly choose 100 boys or 100 girls (possibly by
flipping a coin). If it comes up heads, select 100 random
girls; if it comes up tails, select 100 random boys. Is this an
SRS?
 No. Although every person has an equal chance of being
selected, every sample is of only a single sex; so every
possible subgroup does not have a chance of being selected –
hardly representative of the population.

28
How to Sample

Before we continue, let’s examine the idea of an
SRS a little more:
In order for our sample to be better, we insist that every
possible sample of the desired size we plan to draw has an
equal chance to be selected. This ensures that we are likely to
obtain a representative sample and, at the same time, still
guarantee that each person has an equal chance of being
selected. If by chance, we obtain a sample with only girls,
does that mean our sample is biased?
 As long as we randomly selected a group where every
person and every group was given an equally likely chance,
our sample is not considered biased. Although, this sample
may be very different than other good samples, the
sampling variability that occurs, by chance, is OK.

29
How to Create a SRS

Choose an SRS in two steps:
 Step

1: Label
Assign a numerical label to every individual in the
population.
 Step
2: Randomize
Random number table (Table B)
 Random number generator (RandInt in the TI-83)


Each person and group of people will have an
equal chance of being selected.
 Let’s
give it a try. From this class select a sample of 8
students.
First, give each person in the class a number
 Second, select 8 people (Use line 131; then use the calculator)

30
Using the List of Random Numbers
31
Three Ideas to Good Sampling Design

Idea 1: Examine Part of the Whole
 Gathering
a sample is much more simple than
looking at the whole group. A good sample
will allow us to make inferences about the
population.

Idea 2: Randomize
 The
key to every good sample is
randomization!

Idea 3: Consider the Sample Size
A
sample that is too small may not represent
the population.
32
Idea 1: Examine Part of the Whole

Opinion polls are examples of sample surveys,
designed to ask questions of a small group of
people in the hope of learning something about
the entire population.
 Professional pollsters work quite hard to
ensure that the sample they take is
representative of the population.
 If not, the sample can give misleading
information about the population.
33
Idea 2: Randomize

Randomization is one of the most
important aspects of designing a study
 Randomization
can protect you against
factors that you know are in the data.
 Randomizing protects us from the influences
of all the features of our population, even
ones that we may not have thought about.
makes sure that on the
average the sample looks like the rest of the
population.
 Randomizing
34
Idea 3: Consider the Sample Size


How large a random sample do we need for the
sample to be reasonably representative of the
population?
It’s the size of the sample, not the size of the
population, that makes the difference in
sampling.
 Exception:
If the population is small enough and the
sample is more than 10% of the whole population, the
population size can matter.
35
Stratified Random Sample
Simple random sampling (SRS) is not the
only fair way to sample.
 More complicated designs may save time
or money or help avoid sampling
problems.
 All statistical sampling designs have in
common the idea that chance, rather than
human choice, is used to select the sample.

36
Stratified Random Sample

To select a stratified random sample, first divide the
population into groups of similar individuals, called
strata. Then choose a separate SRS in each stratum and
combine these SRS’s to form the full sample. This type of
sampling is more complicated and often times is not
appropriate for our studies. However there are times
when it is very useful.

For example, let’s say a certain school wishes to determine the
amount of funding it should dedicate to it’s female intramural
program. The campus is made up of 67% females and 33% males.
If we an SRS, we may end up with 80% females and 20% males or
40% females and 60% males – the variability would be very large.
To reduce the variability, we can force our sample to be
representative of the gender balance on campus. We can make
sure that there are 67% female and 33% males in our study.
37
Multistage Sampling Design


Sampling Schemes that combine several
methods are called multistage samples.
For example, one could use the following:
 Stratify
the groups into subgroups (ex: break the
group into subgroups by ethnicity)
 Use an SRS to gather participants from each subgroup
(ex: gather 500 people from each ethnicity)
 Use a systematic sampling to gather a smaller subset
(ex: pick every 10th person in the list of possible
participants)
 and so on until you get down to a specified sample
size.
38
Cluster Sampling

Splitting the population into similar parts
or clusters can make sampling more
practical. Then we could simply select
one of a few clusters at random to perform
a census within each of them. This
process is called cluster sampling.
39
Stratified Random Samples vs. Cluster Samples

Stratified random samples and cluster samples
sound similar but there is a difference.
A
good way to distinguish between the two is to look
at an analogy.
 Imagine a 7-layer dip. If you wanted to test and see if
the dip was good, you could taste-test it several
different ways.
If you took a chip and dipped it into the dip (making sure
you got all 7 layers) and taste-tested it, that would be like
clustering.
 If you took a chip and taste-tested every layer within the dip,
that would be like Stratification. Each layer in the dip
would be considered a stratum.

40
2010 AP Question (# 3)

An apartment building has nine floors and each floor has
four apartments. The building owner wants to install
new carpeting in eight apartments to see how well it
wears before she decides whether to replace the carpet in
the entire building. The figure below shows the floors of
the apartments in the building with the apartment
numbers. Only the nine apartments with an asterisk have
children in the apartment.
41
2010 AP Question (# 3)

For convenience, the apartment building owner wants to
use a cluster sampling method, in which the floors are
clusters, to select eight apartments. Describe a process
for randomly selecting eight different apartments using
this method.
Many students had a difficult time answering this question.
Many didn’t understand clustering while others didn’t know
how to randomly select the floors.
Solution: Use digits 1 – 9. Let 1 represent the 1st floor, 2
represent the 2nd floor, and so on. Using 9 slips of paper, number
each one from 1 to 9. Put all slips of paper in a hat, mix the
numbers, and randomly select two pieces of paper representing
the two floors. Carpet ALL four apartments on BOTH floors for
a total of eight apartments.


42
2010 AP Question (# 3)

An alternative sampling method would be to select a
stratified random sample of eight apartments, where the
strata are apartments with children and apartments with
no children. A stratified random sample of size eight
might include two apartments with children and six
without. In context of this situation, give one statistical
advantage of selecting such a stratified sample as
opposed to a cluster sample of eight apartments using
the floors as clusters.

Solution: When using the cluster sampling method, it is possible
that NO apartments with kids would be selected; for example, if
the two selected floors only included 3, 4, or 6 (such as the 3rd
and 4th floors), none of the apartments would have children.
This would be a bad situation since children can have an affect
on carpet wear. A stratified sampling method would be better
since it would guarantee apartments with and without children
giving a better representation of the carpet’s durability.
43
Disadvantages to Stratifying



We need a sampling frame which includes the
entire population as well as characteristics
about each member to use when stratifying.
This could be difficult when the population is
large.
Also, the statistical analysis is more difficult
with a stratified random sample.
Note: The reason why we stratify is to get a
representative sample and reduce the
variability that is possible in a SRS. The
purpose is NOT to compare the results between
strata, although this is a secondary benefit.
Systematic Sampling

Often times, using a systematic method
will help us draw a proper sample.
 For
example, you might survey every 10th
person on an alphabetical list of students. To
make it random, you still must start the
systematic selection from a randomly
selected individual. When there is no reason
to believe that the order of the list could be
associated in any way with the responses
sought, a representative sample can be found.
45
Yum, Good Soup


Good sampling technique always uses random
selection to reduce the possibility of bias.
Let’s consider the following analogy:
 Let’s
say you have a pot of soup. How much soup do
you need to “taste” before you get an idea of what the
soup tastes like. Do you need to eat the whole pot?

Obviously not. This is the idea of sampling. A “taste” is
all we need to get an understanding of the big pot.
 Now
let’s say you put some salt in the soup. What
happens if you “taste” it before stirring in the salt? It
will either be very salty or taste the same as before.

Stirring the pot is analogous to randomizing.
Randomization mixes up those who will participate in the
study giving us a better “taste” of the whole.
46
Exit Slip: The River Problem:

Suppose we wanted to estimate the yield
of a corn field. The field is square and
divided into 16 equally sized plots (4 rows
x 4 columns). A river runs along the
eastern edge of the field. We want to take
a sample of 4 plots.
1. Randomly choose four of
the plots (SRS)
4
1
29
2
94
3 150
4
2. Stratify the plots into rows
and choose one plot from
each strata
7
5
31
6
98
7 153
8
6
9
27
10
92 148
11
12
13
5
32
14
97 147
15
16
3. Now, stratify the plots into
columns and choose one
plot from each strata
4. Use a cluster sample to choose four plots. How should we determine
a cluster?


Which do you think will give us the most accurate
estimate?
Here are the actual yields

Calculate the average yield using each method


The true average is 70
Which one worked the best?
What happened?

When we stratified by columns, the variability of
our estimate was greatly reduced.

Why does this work?
 With
a SRS, it is possible that I randomly choose 4
plots near the river (giving an estimate that is way too
high) or that I choose 4 plots far from the river (giving
an estimate that is way too small). However, when I
use each column as a strata, I am guaranteed to get
one plot close to the river (high yield), one plot far
from the river (low yield), etc. This guarantees that
we will have a representative sample.
When should we stratify?


If you think there are groups within the
population who may be different with regard to
the question of interest, you should take an
appropriately sized simple random sample from
each group.
In our example, we should anticipate that the
river will have an effect on the yield of the plots.
Thus, since the plots near the river are similar to
each other (but different than the rest of the
plots) stratifying by columns is a good method.
Examples

Population: United States adults
 question
of interest: affirmative action
 possible strata: race, gender
 non-effective strata: height

Population: SDHS
 question
of interest: AP program
 possible strata: GPA, grade
 non-effective strata: shoe-size, English period
 Let’s try this one more time,
but this time, let’s make it
more realistic.
 A farmer has a large field
which will be used for a
particular crop of wheat.
 The farmer has had trouble
with this field and wants to
determine which parts of
the field he should use.
 He decided to choose 10 plots to determine the best place to plant his
crop. Use the following methods to determine the best place to plant his
crop:
1. Convenience Sample
4. Stratified Sample by Column
2. SRS
5. Multi-Stage using the picture
above
3. Stratified Sample by Row