Transcript Slide 1
Looking at data: distributions
- Describing distributions with numbers
IPS chapter 1.2
© 2006 W.H. Freeman and Company
Objectives (IPS chapter 1.2)
Describing distributions with numbers
Measure of center: the mean
Measure of center: the median
Measure of spread: the quartiles
Five-number summary and boxplot
Measure of spread: the standard deviation
Choosing among summary statistics
Changing the unit of measurement
Measure of center: the mean
The mean or arithmetic average
To calculate the average, or mean, add
all values, then divide by the number of
individuals. It is the “center of mass.”
Sum of heights is 1598.3
divided by 25 women = 63.9 inches
58.2
59.5
60.7
60.9
61.9
61.9
62.2
62.2
62.4
62.9
63.9
63.1
63.9
64.0
64.5
64.1
64.8
65.2
65.7
66.2
66.7
67.1
67.8
68.9
69.6
woman
(i)
height
(x)
woman
(i)
height
(x)
i=1
x1= 58.2
i = 14
x14= 64.0
i=2
x2= 59.5
i = 15
x15= 64.5
i=3
x3= 60.7
i = 16
x16= 64.1
i=4
x4= 60.9
i = 17
x17= 64.8
i=5
x5= 61.9
i = 18
x18= 65.2
i=6
x6= 61.9
i = 19
x19= 65.7
i=7
x7= 62.2
i = 20
x20= 66.2
i=8
x8= 62.2
i = 21
x21= 66.7
i=9
x9= 62.4
i = 22
x22= 67.1
i = 10
x10= 62.9
i = 23
x23= 67.8
i = 11
x11= 63.9
i = 24
x24= 68.9
i = 12
x12= 63.1
i = 25
x25= 69.6
i = 13
x13= 63.9
n=25
=1598.3
Mathematical notation:
x1 x2 ... xn
x
n
1 n
x xi
n i 1
1598.3
x
63.9
25
Learn right away how to get the mean using your calculators.
Your numerical summary must be meaningful.
Height of 25 women in a class
x 69.3
Here the shape of
the distribution is
wildly irregular.
Why?
Could we have
more than one
plant species or
phenotype?
The distribution of women’s
heights appears coherent and
symmetrical. The mean is a good
numerical summary.
x 69.6
Height of Plants by Color
x 63.9
5
x 70.5
x 78.3
red
Number of Plants
4
pink
blue
3
2
1
0
58
60
62
64
66
68
70
72
74
76
78
80
82
Height in centimeters
A single numerical summary here would not make sense.
84
Measure of center: the median
The median is the midpoint of a distribution—the number such
that half of the observations are smaller and half are larger.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
25 12
6.1
1. Sort observations by size.
n = number of observations
______________________________
2.a. If n is odd, the median is
observation (n+1)/2 down the list
n = 25
(n+1)/2 = 26/2 = 13
Median = 3.4
2.b. If n is even, the median is the
mean of the two middle observations.
n = 24
n/2 = 12
Median = (3.3+3.4) /2 = 3.35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
Comparing the mean and the median
The mean and the median are the same only if the distribution is
symmetrical. The median is a measure of center that is resistant to skew
and outliers. The mean is not.
Mean and median for a
symmetric distribution
Mean
Median
Mean and median for
skewed distributions
Left skew
Mean
Median
Mean
Median
Right skew
Mean and median of a distribution with outliers
Percent of people dying
x 3.4
x 4.2
Without the outliers
With the outliers
The mean is pulled to the
The median, on the other hand,
right a lot by the outliers
is only slightly pulled to the right
(from 3.4 to 4.2).
by the outliers (from 3.4 to 3.6).
Impact of skewed data
Mean and median of a symmetric
Disease X:
x 3.4
M 3.4
Mean and median are the same.
… and a right-skewed distribution
Multiple myeloma:
x 3.4
M 2.5
The mean is pulled toward
the skew.
Measure of spread: the quartiles
The first quartile, Q1, is the value in the
sample that has 25% of the data at or
below it ( it is the median of the lower
half of the sorted data, excluding M).
M = median = 3.4
The third quartile, Q3, is the value in the
sample that has 75% of the data at or
below it ( it is the median of the upper
half of the sorted data, excluding M).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
1
2
3
4
5
1
2
3
4
5
6
7
1
2
3
4
5
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Q1= first quartile = 2.2
Q3= third quartile = 4.35
Five-number summary and boxplot
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6.1
5.6
5.3
4.9
4.7
4.5
4.2
4.1
3.9
3.8
3.7
3.6
3.4
3.3
2.9
2.8
2.5
2.3
2.3
2.1
1.5
1.9
1.6
1.2
0.6
Largest = max = 6.1
BOXPLOT
7
Q3= third quartile
= 4.35
M = median = 3.4
6
Years until death
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
5
4
3
2
1
Q1= first quartile
= 2.2
Smallest = min = 0.6
0
Disease X
Five-number summary:
min Q1 M Q3 max
Boxplots for skewed data
Years until death
Comparing box plots for a normal
and a right-skewed distribution
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Boxplots remain
true to the data and
depict clearly
symmetry or skew.
Disease X
Multiple Myeloma
Suspected outliers
Outliers are troublesome data points, and it is important to be able to
identify them.
One way to raise the flag for a suspected outlier is to compare the
distance from the suspicious data point to the nearest quartile (Q1 or
Q3). We then compare this distance to the interquartile range
(distance between Q1 and Q3).
We call an observation a suspected outlier if it falls more than 1.5
times the size of the interquartile range (IQR) above the first quartile or
below the third quartile. This is called the “1.5 * IQR rule for outliers.”
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
7.9
6.1
5.3
4.9
4.7
4.5
4.2
4.1
3.9
3.8
3.7
3.6
3.4
3.3
2.9
2.8
2.5
2.3
2.3
2.1
1.5
1.9
1.6
1.2
0.6
8
7
Q3 = 4.35
Distance to Q3
7.9 − 4.35 = 3.55
6
Years until death
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
5
Interquartile range
Q3 – Q1
4.35 − 2.2 = 2.15
4
3
2
1
Q1 = 2.2
0
Disease X
Individual #25 has a value of 7.9 years, which is 3.55 years above
the third quartile. This is more than 3.225 years, 1.5 * IQR. Thus,
individual #25 is a suspected outlier.
Measure of spread: the standard deviation
The standard deviation “s” is used to describe the variation around the
mean. Like the mean, it is not resistant to skew or outliers.
1. First calculate the variance s2.
n
1
2
s2
(
x
x
)
i
n 1 1
2. Then take the square root to get
the standard deviation s.
x
Mean
± 1 s.d.
1 n
2
s
(
x
x
)
i
n 1 1
Calculations …
Women height (inches)
i
xi
x
(xi-x)
(xi-x)2
1
59
63.4
-4.4
19.0
2
60
63.4
-3.4
11.3
3
61
63.4
-2.4
5.6
4
62
63.4
-1.4
1.8
5
62
63.4
-1.4
1.8
6
63
63.4
-0.4
0.1
7
63
63.4
-0.4
0.1
8
63
63.4
-0.4
0.1
9
64
63.4
0.6
0.4
10
64
63.4
0.6
0.4
11
65
63.4
1.6
2.7
Degrees freedom (df) = (n − 1) = 13
12
66
63.4
2.6
7.0
s2 = variance = 85.2/13 = 6.55 inches squared
13
67
63.4
3.6
13.3
14
68
63.4
4.6
21.6
Sum
0.0
Sum
85.2
s
1
df
n
(x
i
x)
2
1
Mean = 63.4
Sum of squared deviations from mean = 85.2
s = standard deviation = √6.55 = 2.56 inches
Mean
63.4
We’ll never calculate these by hand, so make sure to know how to
get the standard deviation using your calculator.
Software output for summary statistics:
Excel - From Menu:
Tools/Data Analysis/
Descriptive Statistics
Give common
statistics of your
sample data.
Minitab
Choosing among summary statistics
Because the mean is not
Height of 30 Women
resistant to outliers or skew, use it
69
to describe distributions that are
68
fairly symmetrical and don’t have
Plot the mean and use the
standard deviation for error bars.
Otherwise use the median in the
five number summary which can
be plotted as a boxplot.
Height in Inches
outliers.
67
66
65
64
63
62
61
60
59
58
Box Plot
Boxplot
Mean ±
+/-SD
SD
Mean
What should you use, when, and why?
Arithmetic mean or median?
Middletown is considering imposing an income tax on citizens. City hall
wants a numerical summary of its citizens income to estimate the total tax
base.
Mean: Although income is likely to be right-skewed, the city government
wants to know about the total tax base.
In a study of standard of living of typical families in Middletown, a sociologist
makes a numerical summary of family income in that city.
Median: The sociologist is interested in a “typical” family and wants to
lessen the impact of extreme incomes.
Changing the unit of measurement
Variables can be recorded in different units of measurement. Most
often, one measurement unit is a linear transformation of another
measurement unit: xnew = a + bx.
Temperatures can be expressed in degrees Fahrenheit or degrees Celsius.
TemperatureFahrenheit = 32 + (9/5)* TemperatureCelsius a + bx.
Linear transformations do not change the basic shape of a distribution
(skew, symmetry, multimodal). But they do change the measures of
center and spread:
Multiplying each observation by a positive number b multiplies both
measures of center (mean, median) and spread (IQR, s) by b.
Adding the same number a (positive or negative) to each observation
adds a to measures of center and to quartiles but it does not change
measures of spread (IQR, s).