Chapter 5 Describing Distributions Numerically

Download Report

Transcript Chapter 5 Describing Distributions Numerically

Chapter 5
Understanding and Comparing
Distributions
Math2200
Example: The Hopkins Memorial
Forest
• A 2500-acre reserve in Massachusetts,
New York, Vermont
• Managed by the Williams College
center for Environmental Studies (CES)
• http://www.williams.edu/CES/hopkins.htm
• Average wind speed for every day in
1989
– Important for monitoring storms
Avg Wind
Day of Year
Month
1.88
1
1
2.57
2
1
4.04
3
1
4.73
4
1
2.49
5
1
2.17
6
1
3.51
7
1
4.59
8
1
4.4
9
1
1.85
10
1
3.17
11
1
3.44
12
1
6.33
13
1
2.39
14
1
3.14
15
1
2.56
16
1
3.11
17
1
1.64
18
1
2.05
19
1
2.98
20
1
4.66
21
1
0.81
22
1
0.72
23
1
Five-number summary
Max
8.670
Q3
2.930
Median
1.900
Q1
1.150
Min
0.200
Boxplot
• Invented by John W.
Tukey
Constructing Boxplots
1. Draw a single
vertical axis
spanning the range
of the data. Draw
short horizontal
lines at the lower
and upper quartiles
and at the median.
Then connect them
with vertical lines to
form a box.
Constructing Boxplots (cont.)
Erect “fences” around
the main part of the
data.
2.
–
–
–
The upper fence is 1.5
IQRs above the upper
quartile.
The lower fence is 1.5
IQRs below the lower
quartile.
Note: the fences only
help with constructing the
boxplot and should not
appear in the final
display.
Constructing Boxplots (cont.)
3. Use the fences to
grow “whiskers.”
–
–
Draw lines from the
ends of the box up
and down to the
most extreme data
values found within
the fences.
If a data value falls
outside one of the
fences, we do not
connect it with a
whisker.
Constructing Boxplots (cont.)
4. Add the outliers by
displaying any data
values beyond the
fences with special
symbols.
– We often ( not
always ) use a
different symbol for
“far outliers” that are
farther than 3 IQRs
from the quartiles.
How to make a boxplot?
• Draw a single vertical axis spanning
the extent of the data
• Draw short horizontal lines at the Q1,
median, Q3. Then connect them to
make a box.
• Draw ‘fences’
– Upper fence = Q3 + 1.5 * IQR
– Lower fence = Q1 - 1.5 * IQR
• Grow ‘whiskers’
• Add outliers
• TI-83 can make boxplots
Comparing groups
• Relationship between a quantitative
variable and a categorical variable
– A categorical variable defines groups
• Is it windier in the winter or summer?
– A binary categorical variable
• Spring/Summer: April -- September
• Fall/Winter: October – March
– A quantitative variable: average wind speed
Comparison
shape
center
spread
Spring/Summer
Fall/Winter
mode
unimodal
unimodal
symmetry
skewed to the right
less skewed
outlier
no
yes
mean
1.556
2.712
median
1.340
2.470
StdDev
1.005
1.359
IQR
1.315
1.865
Are some months windier than others?
Summary
• Average wind speed is lower and less
variable in the summer, especially July
• Average wind speed is higher and more
variable in the winter
• The highest winder speed occurs in
November
• More outliers than when plotting for the
entire year
Outliers
• Some outliers are obviously errors
–
–
–
–
–
Misplacing the decimal point
Digit transposed
Digits repeated or omitted
Units may be wrong
Incorrectly copied
• What to do with outliers?
– If there are any clear outliers and you are reporting
the mean and standard deviation, report them with
the outliers present and with the outliers removed.
The differences may be quite revealing.
– Note: The median and IQR are not likely to be
affected by the outliers.
Timeplots
• For some data sets, we are interested in how the
data behave over time. In these cases, we
construct timeplots of the data.
Timeplots
Re-expressing Skewed Data to
Improve Symmetry
When data are skewed, it is hard to simply summarize with
a center and spread.
Can we transform the data to be more symmetric?
Histogram of the annual compensation to
CEOs of the Fortune 500 companies in
2005
Re-expressing Skewed Data to
Improve Symmetry (cont.)
• One way to make a
skewed distribution more
symmetric is to reexpress or transform the
data by applying a
simple function (e.g.,
logarithmic function or
square root).
What Can Go Wrong?
• Avoid inconsistent scales, either within the
display or when comparing two displays.
• Label clearly so a reader knows what the plot
displays.
• Beware of outliers
• Be careful when comparing groups that have
very different spreads
What Can Go Wrong? (cont.)