Transcript Example
Chapter 2
Turning Data Into Information
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
2.1 Raw Data
•
Raw data
are for numbers and category labels that have been collected but have not yet been processed in any way.
• When measurements are taken from a subset of a population, they represent
sample data.
• When all individuals in a population are measured, the measurements represent
population data.
•
Descriptive statistics:
summary numbers for either population or a sample.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
2
2.2 Types of Data
• Raw data from
categorical variables
consist of group or category names that don’t necessarily have a logical ordering.
Examples
: eye color, country of residence.
• Categorical variables for which the categories have a logical ordering are called
ordinal
variables. Examples : highest educational degree earned, tee shirt size (S, M, L, XL).
• Raw data from
quantitative variables
consist of numerical values taken on each individual.
Examples
: height, number of siblings.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
3
Asking the Right Questions
One Categorical Variable Question 1a:
How many and what percentage of individuals fall into each category?
Example:
What percentage of college students favor the legalization of marijuana, and what percentage of college students oppose legalization of marijuana?
Question 1b:
Are individuals equally divided across categories, or do the percentages across categories follow some other interesting pattern?
Example:
When individuals are asked to choose a number from 1 to 10, are all numbers equally likely to be chosen?
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
4
Asking the Right Questions
Two Categorical Variables Question 2a:
Is there a relationship between the two variables, so that the category into which individuals fall for one variable seems to depend on which category they are in for the other variable?
Example:
In Case Study 1.6, we asked if the risk of having a heart attack was different for the physicians who took aspirin than for those who took a placebo.
Question 2b:
Do some combinations of categories stand out because they provide information that is not found by examining the categories separately?
Example:
The relationship between smoking and lung cancer was detected, in part, because someone noticed that the
combination
of being a nonsmoker and having lung cancer is unusual.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
5
Asking the Right Questions
One Quantitative Variable Question 3a:
What are the interesting summary measures, like the average or the range of values, that help us understand the collection of individuals who were measured?
Example:
What is the average handspan measurement, and how much variability is there in handspan measurements?
Question 3b:
Are there individual data values that provide interesting information because they are unique or stand out in some way?
Example:
What is the oldest recorded age of death for a human? Are there many people who have lived nearly that long, or is the oldest recorded age a unique case?
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
6
Asking the Right Questions
One Categorical and One Quantitative Variable Question 4a:
Are the measurements similar across categories?
Example:
Do men and women drive at the same “fastest speeds” on average?
Question 4b:
When the categories have a natural ordering (an ordinal variable), does the measurement variable increase or decrease, on average, in that same order?
Example:
Do high school dropouts, high school graduates, college dropouts, and college graduates have increasingly higher average incomes?
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
7
Asking the Right Questions
Two Quantitative Variables Question 5a:
If the measurement on one variable is high (or low), does the other one also tend to be high (or low)?
Example:
Do taller people also tend to have larger handspans?
Question 5b:
Are there individuals whose combination of data values provides interesting information because that combination is unusual?
Example:
An individual who has a very low IQ score but can perform complicated arithmetic operations very quickly may shed light on how the brain works. Neither the IQ nor the arithmetic ability may stand out as uniquely low or high, but it is the combination that is interesting.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
8
Explanatory and Response Variables
Many questions are about the
relationship
between
two variables
. It is useful to identify one variable as the
explanatory variable
and the other variable as the
response variable.
In general, the
value of the explanatory variable
for an individual is thought to
partially explain
the
value of the response variable
for that individual.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
9
2.3 Summarizing One or Two Categorical Variables
Numerical Summaries
• Count how many fall into each category.
• Calculate the percent in each category.
• If two variables, have the categories of the explanatory variable define the rows and compute row percentages.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
10
Example 2.1
Importance of Order
Survey of
n = 190
college students. About half (92) given the question: “Randomly pick a letter ---
S
or
Q
.” Note: 66% picked the first choice of
S
.
Other half (98) given the question: “Randomly pick a letter ---
Q
or
S
.” Note: 54% picked the first choice of
Q
.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
11
Example 2.2
Lighting the Way to Nearsightedness
Survey of
n = 479
children. Those who slept with nightlight or in fully lit room before age 2 had higher incidence of nearsightedness (myopia) later in childhood .
Note
: Study
does not prove
sleeping with light actually
caused
myopia in more children.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
12
Visual Summaries for Categorical Variables
• •
Pie Charts
: useful for summarizing a single categorical variable if not too many categories.
Bar Graphs
: useful for summarizing one or two categorical variables and particularly useful for making comparisons when there are two categorical variables.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
13
Example 2.3
Humans Are Not Good Randomizers
Survey of
n = 190
college students. “Randomly pick a number between 1 and 10.”
Results
: Most chose 7, very few chose 1 or 10.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
14
Example 2.4
Revisiting Nightlights and Nearsightedness
Survey of
n = 479
children.
Response
: Degree of Myopia
Explanatory
: Amount of Sleeptime Lighting Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
15
2.4 Finding Information in Quantitative Data
Long list of numbers – needs to be organized to obtain answers to questions of interest.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
16
Five-Number Summaries
• Find
extremes
(high, low), the
median
, and the
quartiles
(medians of lower and upper halves of the values). • Quick overview of the data values.
• Information about the center, spread, and shape of data.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
17
Example 2.5
Right Handspans
About 25% of handspans of females are between 12.5 and 19.0 centimeters, about 25% are between 19 and 20 cm, about 25% are between 20 and 21 cm, and about 25% are between 21 and 23.25 cm.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
18
Interesting Features of Quantitative Variables
• • •
Location
: center or average.
e.g. median
Spread
: variability e.g. difference between two extremes or two quartiles.
Shape
: (later in Section 2.5)
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
19
Outliers and How to Handle Them Outlier:
a data point that is not consistent with the bulk of the data.
• Look for them via graphs.
• Can have big influence on conclusions.
• Can cause complications in some statistical analyses.
• Cannot discard without justification.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
20
Example 2.6
Ages of Death of U.S. First Ladies
Partial Data Listing and five-number summary:
Extremes
are more interesting here: Who died at 34? Martha Jefferson Who lived to be 97? Bess Truman Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
21
Possible Reasons for Outliers and Reasonable Actions
•
Mistake made while taking measurement or entering it into computer.
If verified, should be discarded/corrected.
•
Individual in question belongs to a different group than bulk of individuals measured.
Values may be discarded if summary is desired and reported for the majority group only.
•
Outlier is legitimate data value and represents natural variability for the group and variable(s) measured.
Values may not be discarded — they provide important information about location and spread.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
22
Example 2.7
Tiny Boatsmen
Weights (in pounds) of 18 men on crew team:
Cambridge:
188.5, 183.0, 194.5, 185.0, 214.0, 203.5, 186.0, 178.5, 109.0
Oxford:
186.0, 184.5, 204.0, 184.5, 195.5, 202.5, 174.0, 183.0, 109.5
Note:
last weight in each list is unusually small. They are the
coxswains
while others are
rowers
.
for their teams, Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
23
2.5 Pictures for Quantitative Data
•
Histograms
: similar to bar graphs, used for any number of data values.
•
Stem-and-leaf plots
and
dotplots
: present all individual values, useful for small to moderate sized data sets.
•
Boxplot
or
box-and-whisker plot
: useful summary for comparing two or more groups.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
24
Interpreting Histograms, Stemplots, and Dotplots
• Values are centered around 20 cm.
• Two possible low outliers.
• Apart from outliers, spans range from about 16 to 23 cm.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
25
Describing Shape
• • • •
Symmetric
,
bell-shaped Symmetric
,
not bell-shaped Skewed Right
: values trail off to the right
Skewed Left
: values trail off to the left Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
26
Creating a Histogram
Step 1:
Decide how many equally spaced (same width)
intervals
to use for the horizontal axis. Between 6 and 15 intervals is a good number.
Step 2:
Decide to use
frequencies
(count) or
relative frequencies
(proportion) on the vertical axis.
Step 3: Draw
equally spaced intervals on horizontal axis covering entire range of the data. Determine frequency or relative frequency of data values in each interval and draw a
bar
with corresponding height. Decide rule to use for values that fall on the border between two intervals. Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
27
Creating a Dotplot
“A dotplot
displays a dot for each observation along a number line
. If there are multiple occurrences of an observation, or if observations are too close together, then dots will be stacked vertically. If there are too many points to fit vertically in the graph, then each dot may represent more than one point.” (Minitab, Release 12.1, 1998) Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
28
Creating a Stem-and-Leaf Plot
Step 1:
Determine stem values. The “stem” contains all but the last of the displayed digits of a number. Stems should define equally spaced intervals.
Step 2:
For each individual, attach a “leaf” to the appropriate stem. A “leaf” is the last of the displayed digits of a number. Often leaves are ordered on each stem.
Note:
More than one way to define stems. Can use split-stems or truncate/round values first. Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
29
Example 2.8
Big Music Collection
About how many CDs do you own?
Stem
is ‘100s’ and
leaf
unit is ‘10s’. Final digit is
truncated
. Numbers ranged from 0 to about 450, with 450 being a clear
outlier
and most values ranging from 0 to 99. The shape is
skewed right
.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
30
2.6 Numerical Summaries of Quantitative Data
Notation for Raw Data:
n =
number of individuals in a data set
x
1 ,
x
2 ,
x
3 ,…,
x n
represent individual raw data values
Example:
A data set consists of handspan values in centimeters for six females; the values are 21, 19, 20, 20, 22, and 19. Then,
n =
6
x
1 = 21,
x
2 = 19,
x
3 = 20,
x
4 = 20,
x
5 = 22, and
x
6 = 19 Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
31
• •
Describing the Location of a Data Set Mean:
the numerical average
Median:
the middle value (if
n
odd) or the average of the middle two values (
n
even) Symmetric: mean = median Skewed Left: mean < median Skewed Right: mean > median Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
32
Determining the Mean and Median
The Mean
x
x i
x n
The Median
If
n
is odd:
M
= middle of ordered values.
Count (
n
+ 1)/2 down from top of ordered list.
If
n
is even:
M
= average of middle two ordered values.
Average values that are (
n
/2) and (
n
/2) + 1 down from top of ordered list.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
33
Example 2.9
Will “Normal” Rainfall Get Rid of Those Odors?
Data
: Average rainfall (inches) for Davis, California for 47 years
Mean
= 18.69 inches
Median
= 16.72 inches In 1997-98, a company with odor problem blamed it on excessive rain.
That year rainfall was 29.69 inches. More rain occurred in 4 other years.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
34
The Influence of Outliers on the Mean and Median
Larger influence on mean
than median.
High outliers will increase the mean. Low outliers will decrease the mean.
If ages at death are: 70, 72, 74, 76, and 78 then mean = median = 74 years.
If ages at death are:
35
, 72, 74, 76, and 78 then median = 74 but mean = 67 years.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
35
Describing Spread: Range and Interquartile Range
• • •
Range
= high value – low value
Interquartile Range (IQR)
= upper quartile – lower quartile
Standard Deviation
(covered later in Section 2.7) Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
36
Example 2.10
Fastest Speeds Ever Driven
Five-Number Summary for 87 males
•
Median
= 110 mph measures the center of the data • Two
extremes
describe spread over 100% of data
Range
= 150 – 55 = 95 mph • Two
quartiles
describe spread over middle 50% of data
Interquartile Range
= 120 – 95 = 25 mph Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
37
Notation and Finding the Quartiles
Split the ordered values into the half that is below the median and the half that is above the median.
Q
1 =
lower quartile
= median of data values that are
below
the median
Q
3 =
upper quartile
= median of data values that are
above
the median Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
38
Example 2.10
Fastest Speeds (cont)
Ordered Data (in rows of 10 values) for the 87 males: 55 60 80 80 80 80 85 85 85 85 90 90 90 90 90 92 94 95 95 95 95
95
95 100 100 100 100 100 100 100 100 100 101 102 105 105 105 105 105 105 105 105 109
110
110 110 110 110 110 110 110 110 110 110 110 112 115 115 115 115 115 115 120 120 120
120
120 120 120 120 120 120 124 125 125 125 125 125 125 130 130 140 140 140 140 145 150 • •
Median
= (87+1)/2 = 44 th value in the list = 110 mph
Q
1
= median of the 43 values below the median = (43+1)/2 = 22 nd value from the start of the list = 95 mph •
Q
3
= median of the 43 values above the median = (43+1)/2 = 22 nd value from the end of the list = 120 mph Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
39
Percentiles
The
k
th percentile
is a number that has
k
% of the data values at or below it and (100 –
k
)% of the data values at or above it. • Lower quartile = 25 th percentile • Median = 50 th percentile • Upper quartile = 75 th percentile Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
40
Picturing Location and Spread with Boxplots
Boxplots for right handspans of males and females.
• Box covers the middle 50% of the data • Line within box marks the median value • Possible outliers are marked with asterisk • Apart from outliers, lines extending from box reach to min and max values.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
41
How to Draw a Boxplot of a Quantitative Variable Step 1:
Label either a vertical axis or a horizontal axis with numbers from min to max of the data.
Step 2:
Draw box with lower end at
Q
1 and upper end at
Q
3.
Step 3:
Draw a line through the box at the median
M.
Step 4:
Draw a line from
Q
1 end of box to smallest data value that is not further than 1.5 IQR from
Q
1. Draw a line from
Q
3 end of box to largest data value that is not further than 1.5 IQR from
Q
3.
Step 5:
Mark data points further than 1.5 IQR from either edge of the box with an asterisk.
Points represented with asterisks are considered to be outliers.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
42
2.7 Bell-Shaped Distributions of Numbers
• Many measurements follow a
predictable pattern
:
Most
individuals are
clumped
around the
center
• The greater the distance a value is from the center, the fewer individuals have that value.
Variables that follow such a pattern are said to be “
bell-shaped
”. A special case is called a
normal distribution
or
normal curve
.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
43
Example 2.11
Bell-Shaped British Women’s Heights
Data
: representative sample of 199 married British couples.
Below shows a histogram of the
wives’ heights
with a normal curve superimposed. The mean height = 1602 millimeters.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
44
Describing Spread with Standard Deviation
Standard deviation
measures variability by summarizing how far individual data values are from the mean.
Think of the standard deviation as
roughly the average distance values fall from the mean
. Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
45
Describing Spread with Standard Deviation
Both sets have same mean of 100.
Set 1: all values are equal to the mean so there is no variability at all.
Set 2: one value equals the mean and other four values are 10 points away from the mean, so the
average distance away from the mean is about
10.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
46
Calculating the Standard Deviation
Formula for the (
sample
)
standard deviation
:
s
x i n
1
x
2 The value of
s
2 is called the (
sample
)
variance.
An equivalent formula, easier to compute, is:
s
x i
2
n
1
n x
2 Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
47
Calculating the Standard Deviation
Step 1: Step 2:
For each observation, calculate the difference between the data value and the mean.
Step 3:
Square each difference in step 2.
Step 4:
Sum the squared differences in step 3, and then divide this sum by
n –
1.
Step 5:
Take the square root of the value in step 4.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
48
Calculating the Standard Deviation
Consider four pulse rates: 62, 68, 74, 76
Step 1:
x
62 68 74 76 4 280 70 4
Steps 2 and 3: Step 4: Step 5:
s
2 120 4 1 40
s
40 6 .
3 Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
49
Population Standard Deviation
Data sets usually represent a sample from a larger population. If the data set includes measurements for an
entire population
, the notations for the mean and standard deviation are different, and the formula for the standard deviation is also slightly different. A
population mean
is represented by the symbol m (“mu”), and the
population standard deviation
is
x i n
m 2 Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
50
Interpreting the Standard Deviation for Bell-Shaped Curves: The Empirical Rule
For any bell-shaped curve, approximately •
68%
of the values fall within
1 standard deviation
of the mean in either direction •
95%
of the values fall within
2 standard deviations
of the mean in either direction •
99.7%
of the values fall within
3 standard deviations
of the mean in either direction Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
51
The Empirical Rule, the Standard Deviation, and the Range
• Empirical Rule => the range from the minimum to the maximum data values equals about 4 to 6 standard deviations for data with an approximate bell shape. •
You can get a rough idea of the value of the standard deviation by dividing the range by 6. s
Range
6 Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
52
Example 2.11
Women’s Heights (cont)
Mean height for the 199 British women is 1602 mm and standard deviation is 62.4 mm.
•
68%
of the 199 heights would fall in the range 1602 62.4, or 1539.6 to 1664.4 mm •
95%
of the heights would fall in the interval 1602 2(62.4), or 1477.2 to 1726.8 mm •
99.7%
1602 of the heights would fall in the interval 3(62.4), or 1414.8 to 1789.2 mm Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
53
Example 2.11
Women’s Heights (cont)
Summary of the actual results: Note: The minimum height = 1410 mm and the maximum height = 1760 mm, for a range of 1760 – 1410 = 350 mm.
So an estimate of the standard deviation is:
s
Range
6 350 6 58 .
3 mm Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
54
Standardized z-Scores
Standardized score
z
Observed or
z-score
: value Standard Mean deviation
Example:
Mean resting pulse rate for adult men is 70 beats per minute (bpm), standard deviation is 8 bpm. The standardized score for a resting pulse rate of 80:
z
80 8 70 1 .
25 A pulse rate of 80 is 1.25 standard deviations
above
the mean pulse rate for adult men.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
55
The Empirical Rule Restated
For bell-shaped data, • About
68%
of the values have
z
-scores between –1 and +1. • About
95%
of the values have
z
-scores between –2 and +2. • About
99.7%
of the values have
z
-scores between –3 and +3. Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc.
56