Collecting data - University of Toronto

Download Report

Transcript Collecting data - University of Toronto

Producing Data - Introduction
• Statistic is a tool that helps data produce knowledge rather that
confusion. As such, it must be concerned with producing data
as well as interpreting already available data.
• Exploratory data analysis helps reveal information in data.
However, alone it can rarely provide convincing evidence for
its conclusions.
• We may also use data to provide clear answers to specific
questions such as what is the average life time of humans?
• This lecture is devoted to developing the skills needed to
produce trustworthy data and to judge the quality of data
produced by others.
• The techniques for producing data are among the most
important ideas in statistics; they are the basis for formal
statistical inference.
week5
1
Collecting data
• Available data are the data that were produced in the past for
some other purpose but they may help answer a present question.
• Statistical designs for producing data rely on either sampling or
experiments.
• A sample survey collects information about a population by
selecting and measuring a sample from the population.
• Example: The General Social Survey interviews about 3000 adult
residents of US every 2nd year. That is GSS selects a sample of
adults to represent the larger population of all adults living in US.
• Census is an attempt to contact every individual in the population.
week5
2
Observation versus Experiment
• An observational study observes individuals and measures
variables of interest but does not attempt to influence the
response.
• An experiment imposes a treatment on individuals in order to
observe their response.
• An observational study, even one based on a statistical sample
is a poor way to study the effect of a treatment. To see the
effect of a treatment we must actually impose the treatment.
• When our goal is to understand the cause and effect,
experiments are the only source of fully convincing data.
week5
3
Design of experiments
• The individuals on which, the experiment is done are the
experimental units.
• A specific experimental condition applied to the units is called
a treatment.
• A placebo is a dummy treatment. The response to a dummy
treatment is the placebo effect.
• The explanatory variables in an experiment are called factors.
• The values of a factor are called levels.
• Many experiments study the joint effect of several factors. In
such an experiment, each treatment is formed by combining a
specific value of each of the factors.
• In principal, experiments can give good evidence of causation.
week5
4
Example
We want to study the effects of aspirin and beta carotene on
heart attacks and cancer.
Factors: Aspirin (levels: yes, no), Beta carotene (levels: yes, no).
Response variables: occurrence of heart attacks and cancer.
Treatments are the factor level combinations (4 treatments ).
The example above is a factorial (two factor) experiment.
week5
5
Bias
• The design of a study is biased if it systematically favors
certain outcomes.
• An uncontrolled study of a new medical therapy, for example
is biased in favor of finding the treatment effective because of
the placebo effect.
• The group of patients who received a dummy treatment is
called a control group, because it enable us to control the
effects of outside variables on the outcome.
• Control is the first basic principle of statistical design of
experiments. Comparisons of several treatments in the same
environment is the simplest form of control.
• Example 3.9 page 180 in IPS.
week5
6
Randomization
• The design of an experiment first describes the response
variable or variables, factors (explanatory variables), and the
layout of the treatments, with comparison as the leading
principle.
• The second aspect of design is the rule used to assign
experimental units to the treatments. Comparison of the effects
of treatments is valid only when all treatments are applied to
similar groups of experimental units.
• Systematic differences among the groups of experimental units
in a comparative experiment cause bias.
• The use of chance to divide experimental units into groups is
called randomization.
• Randomization can be done by the Hat method, random
number tables or software.
week5
7
Example
• A food company assesses the nutritional quality of a new
“instant breakfast” product by feeding it to newly weaned
male white rats and measuring their weight gain over a 28-day
period. A control group of rats receives a standard diet for
comparison. This experiment has a single factor (diet) with
two levels. 30 rats were used for this experiment.
• The outline of the design is given in the following diagram
• The design in the above figure combines comparison and
randomization to arrive at the simplest randomized
comparative design.
week5
8
Principles of experimental design
• Control the effects of lurking variables on the response, simply
by comparing two or more treatments.
• Randomize - use impersonal chance to assign experimental
units to treatments.
• Repeat each treatment on many units to reduce chance
variation in the results.
Statistical Significance
• An observed effect so large that it would rarely occur by
chance is called statistically significant.
week5
9
How to randomize
• The idea of randomization is to assign subjects to treatments
by drawing names from a hat. In practice, experimenters use
software to carry out randomization. We can randomize
without software by using a table of random digits.
• A table of random digits is a list of the digits 0, 1, 2, 3, 4, 5,
6, 7, 8, 9 that has the following properties:
 The digit in any position in the list has the same chance of
being any one of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
 The digits in different positions are independent in the
sense that the value of one has no influence on the value of
any other.
week5
10
Completely randomized design (CRD)
• When all experimental units are allocated at random among all
treatments, the experimental design is completely randomized.
• Example (rats example on slide 8)
- Label each rate with a numerical value from 01, …, 30.
- Start at line 164 in Table B and read two-digit groups.
The first 10 two-digit groups in this line are
11 02 27 91 24 49 52 56 30 78
So the rates labeled 11, 02, 27, 24, 30 go into the experimental
group. Run your finger across line 164 (and continue to line
165 if needed) until you have chosen 15 rates. They are the
rates labeled
11, 02, 27, 24, 30, 17, 22, 21, 01, 13, 23, 16, 28, 20, 08.
week5
11
Cautions about experimentation
• The study of the effects of aspirin and beta carotene on heart
attacks and cancer in the example on slide 5, was double-blind
- neither the subjects nor the medical personnel who worked
with them knew which treatment any subject had received. The
double-blind method avoids unconscious bias, e.g. a doctor who
doesn’t think that “just a placebo” can benefit a patient.
• Lack of realism
The subjects or treatment or setting of an experiment may not
realistically duplicate the conditions we really want to study.
• Example 3.16 page 188 in IPS.
week5
12
Matched pairs designs
• Match pairs designs compare just two treatments. We choose
blocks of two units that are as closely matched as possible.
Alternatively, each block in a matched pairs design may
consist of just one subject, who gets both treatments one after
the other and serves as his or her own control.
• The idea is that matched subjects are more similar than
unmatched ones, so that comparing responses within a number
of pairs is more efficient than comparing the responses of
groups of randomly assigned subjects.
• Randomization remains important; which one of the a matched
pair receive the first treatment.
• Example 3.17 page 189.
week5
13
Block design
• A block is a group of experimental units or subjects that are
known before the experiment to be similar in some way that is
expected to affect the response to the treatments. In a
randomized block design (RBD), the random assignment of
units to treatments is carried out separately within each block.
• Example 3.18 page 190 in IPS
Progress of a type of cancer differs in women and men. We
want to compare 3 therapies.
- gender is a blocking variable
- two randomizations done, one assigning female subjects to
treatments, and the other assigning male subjects.
As described in the following diagram
week5
14
week5
15
Sampling design
• A political scientist want to know what percent of the voting age
population consider themselves conservatives. He needs to gather
information about large group of individuals.
• Time, cost and inconvenience forbid contacting every individual.
• We gather information about only part of the group in order to
draw conclusions about the whole population.
• We will not, as in an experiment, impose treatment in order to
observe the response.
week5
16
Population and sample
• The entire group of individuals that we want information
about is called the population.
• A sample is a part of the population that we actually examine
in order to gather information.
• Sample design
 The design of a sample refers to the method used to
choose the sample from the population.
 Poor sample design can produce misleading conclusions.
week5
17
Example
• The ABC network program Nightline asked (in a call-in poll)
whether the UN should continue to have its headquarters in
United States. More than 186000 callers responded ( telephone
companies charge for these calls) and 67% said “No”.
• People who spend time and money to respond to call-in polls
are not representative of the entire adult population. In fact
they tend to be the same people who call radio talk shows.
• People who feel strongly, especially those with strong negative
opinions, are more likely to call.
• It is not surprising that a properly designed sample showed
that 72% of adults want UN to stay.
week5
18
Voluntary response sample
• A voluntary response sample consists of people who choose
themselves by responding to a general appeal.
• Voluntary response samples are biased because people with
strong opinions, especially negative opinions are most likely to
respond.
• Random selection of a sample eliminates bias giving all
individuals an equal chance to be chosen.
week5
19
Simple Random Sample
• A simple random sample (SRS) of size n consists of n
individuals from the population chosen in such a way that
every set of n individuals has an equal chance to be the sample
actually selected.
• How to select an SRS?
Hat method, Random number tables or software.
• Example 3.24 page 200 in IPS.
week5
20
Stratified Random Sampling
• To select a stratified random sample, first divide the
population into groups of similar individuals, called strata.
• Then choose a separate SRS in each stratum and combine
these SRSs to form the full sample.
• Example 3.26 page 203 in IPS.
week5
21
Multistage sampling design - Example
• Data on employment/ unemployment are gathered by the Gov.’s
Current Population survey, which conducts interviews in about
55000 households each month.
• Its not practical to maintain a list of all US household from which
to select a SRS. Cost of sending interviewers to the widely
scattered households in an SRS would be too high. So use
multistage design.
• The Current Population Survey sampling design is:
Stage 1. Divide US into 2007 geographical areas called primary
sampling units (PSU). Select a sample of 754 PSUs.
Stage 2. Divide each PSU selected into smaller areas called
“blocks”. Stratify blocks using ethnic and other
information and take a stratified sample of the blocks
in each PSU
Stage 3. Sort the housing units in each block into clusters of 4
nearby units. Interview the households in a random
week5
22
sample of these clusters.
Systematic random samples – Example
We want to choose 4 addresses from a list of 100.
 divide the list into 4 smaller lists each of 100/4 = 25 addresses.
 Choose one of the first 25 at random (using random number
tables) and then choose every 25th address.
 E. g. If 13 is the random number selected, the sample consists
of the addresses numbered 13, 38, 63, 88.
week5
23
Cautions about sample surveys
• Undercoverage
Sample surveys require an accurate and complete list of the
population (sampling frame). Because such lists are rarely
available, most samples suffer from some degree of
undercoverage, which occurs when some groups in the
population are left out of the process of choosing the sample.
• Examples:
(i) A sample survey of households will miss homeless people,
prison inmates, students in dormitories.
(ii) An opinion poll conducted by telephone will miss the 6%
of American households without residential phones.
• Nonresponse occurs when an individual chosen for the sample
can’t be contacted or doesn’t cooperate.
week5
24
Response bias
• The behavior of the respondent or the interviewer can cause
response bias in sample results.
• Respondents may lie, especially if asked about illegal or
unpopular behavior. The sample then underestimates the
occurrences of such behavior in the population.
• Answers to questions that ask the respondent to recall past
events are often inaccurate because of faulty memory.
• Wording of questions
Confusing or leading questions can introduce a strong bias in a
sample survey and even minor changes in wording can change
a survey’s outcome.
week5
25
Statistical inference - Parameters and statistics
• A parameter is a number that describes the population. It is a
fixed number, but in practice we do not know its value.
• A statistic is a number that describes a sample.
The value of a statistic is known when we have taken a sample,
but it can change from sample to sample.
• We often use a statistic to estimate an unknown parameter.
week5
26
Sampling distribution
• The sampling distribution of a statistic is the distribution of
values taken by the statistic in all possible samples of the
same size from the same population.
• Example 3.33 page 214 in IPS
We simulate drawing SRSs of size 100 from the population
of all adult US residents. Suppose that in fact 60% of the
population find shopping frustrating. Then the true value of
the parameter we want to estimate is p = 0.6.
The following diagrams describe the sampling distribution of
the statistics pˆ for different sample size.
week5
27
week5
28
Bias and Variability
• A statistic used to estimate a parameter is unbiased if the
mean of its sampling distribution is equal to the true value of
the parameter being estimated.
• The variability of a statistic is described by the spread of its
sampling distribution.
• The spread is determined by the sampling design and the
sample size n. Statistics from larger probability samples have
smaller spreads.
• Managing Bias and Variability.
 To reduce bias, use SRS.
 To reduce the variability of a statistic from an SRS, use
larger samples.
week5
29
Question - Final Dec 2001
• Two drugs A and B, used to the treatment of glaucoma, were tested for
effectiveness on 10 diseased dogs. Drug A was administered to one eye of
each dog and drug B to the other eye. Pressure measurements were taken 1
hour later on both eyeballs of each dog. Which of the following statements
are true?
(a) This is an example of a matched pairs design.
(b) This is an example of a CRD.
(c) This is an example of a RBD.
• Re the above study which of the following is the most important.
(a) We need to randomize the assignment of dogs to drugs.
(b) We need to randomize the assignment of drugs to eyes.
(c) We need to select the dogs randomly from a bigger population.
(d) We need to stratify the dogs before assigning the drugs.
(e) We need to pair the dogs based on some relevant criteria related to the
response.
week5
30
Question - Dec 2001
• A list enumeration areas in Ontario is made. From this list we
pick every 10th one after a random start. For the selected
areas, we obtain maps. For each map we number the blocks,
from 1 to N (N = number of blocks in that area). Using a RN
table, we select two distinct numbers between 1 and N and
include the corresponding blocks in our sample. On each
selected block, we start at the northeast corner, and walk
around the block, selecting every 5th household into our
sample (from a random start). The types of sampling methods
used here (in no particular order) are
(a) stratified, SRS, systematic
(b) systematic, multistage, stratified
(c) multistage, SRS, systematic
(d) multistage, SRS, stratified
(e) SRS, systematic
week5
31
Question - Summer 2001 test-2
a) In order to study various aspects of child abuse 15 ‘Child Welfare Service
Areas’ (CWSA) are randomly selected, from all those across Canada. From
each selected CWSA, 10% of cases are chosen, by taking every 10th file
from a cabinet.
i) Is this an observational study or an experiment?
ii) Describe the design (in statistical terminology)
b) You want to determine the best colour for attracting cereal leaf beetles to
boards on which they will be trapped. You will compare three colours:
Blue, green, Yellow. The response variable is the count of beetles trapped.
You will mount one board on each of 9 poles evenly spaced in a square
field, with 3 poles in each row as shown below. You will proceed with a
completely randomized experiment in order to compare the colours.
Randomly assign colours to poles, and mark on the field sketch, the colours
assigned to each pole. Indicate exactly how you assigned the colours to the
poles.
week5
32
c) In the cigarette smoking and cancer video, there was one study
in which smokers and non-smokers were matched up w. r. t. 30
different variables making them ‘as like as possible’ in the
words of the speaker. Cancer rates differed substantially
between the smoker and non-smokers.
i) Is this an observational study or a randomized block
design?
ii) Why does this or why does this not prove smoking cases
cancer?
d) Increasing the sample size is one method for reducing bias.
True or false?
week5
33
Question - Term test Summer 99
Suppose that we want to select a sample of students from sta220 class (150
students in total)
a) If we assign each student in the class a number from 001-150, and then use a
RN table to pick 2 distinct RNs from 001-150, and then take the corresponding
students, what do we call this type of sample?
b) If we select the 5th student, after ordering the students in some fashion, what do
we call this type of sampling design?
c) If we select randomly 4 students from the centre section, and then 2 at random
from the section on the left side and finally 2 randomly from the section on the
right side, what type of sampling design is this?
d) If we select randomly 5 rows in the classroom, then 2 students randomly from
each selected row, what do we call this type of sampling design?
week5
34
Question - Term Test summer 2000
For each of the following studies,
i) Indicate whether it is an observational study or a controlled
experiment.
ii) if an observational study:
(a) Describe precisely the sampling design utilized. Use
appropriate statistical terminology.
(b) Indicate the source of bias, if any are present.
ii) If an experiment, identify
(a) the experimental unit(s) and the response variable(s).
(b) the factors, treatments and the number of treatments.
week5
35
(A) A city has 2000 city-blocks in each of 4 geographical areas
(NE, NW, SE, SW). Five blocks will be selected at random
from each geographical area. For each selected block, 20% of
households will be selected, by having the interviewer walk
around the block, and take every 5th household, starting with
the house at the Northwest corner. When the interviewer
arrives at a household, one of the adults present is randomly
selected to be interviewed.
week5
36
(B) In order to investigate the effect of repeated exposure to an
advertising message, a number of undergraduate students
viewed a 40 minute TV program that included ads for a
digital camera. Some of the students saw a 30 second
commercial: other a 90 second version. The same commercial
was repeated either 1, 3 or 5 times during the program. After
viewing, all of the subjects answered questions about their
recall of the ad, their attitude toward the camera, and their
intention to purchase it.
week5
37