Transcript Chapter 3

Measures of Center and
Variation
Sections 3.1 and 3.3
Prof. Felix Apfaltrer
[email protected]
Office:N518
Phone: 212-220 8000X 7421
Office hours:
Mon-Thu 1:30-2:15 pm
1
Measures of center - mean
• A measure of center is a value that
represents the center of the data set
• The mean is the most important
measure of center (also called
arithmetic mean)
•
sample mean
addition of values
variable (indiv. data vals)
sample size
population size
•
population mean
Example. Lead (Pb) in air at BMCC (mmg/m3), 1.5 high:
5.4, 1.1, 0.42, 0.73, 0.48, 1.1
Outlier has strong effect on mean!
Measures of center - median
• Mean is good but sensitive to outliers!
• Large values can have dramatic
effect!
Previous example:
-reorder data:
0.42, 0.48, 0.73, 1.1, 1.1, 5.4
The median is the middle value of the
original data arranged in increasing
order
– If n odd: exact middle value
– If n even: average 2 middle values
If we had an extra data point:
5.4, 1.1, 0.42, 0.73, 0.48, 1.1, 0.66
After reordering we have
0.42, 0.48, 0.66, 0.73, 1.1, 1.1, 5.4
Outlier has strong effect on mean, not so on median!
Used for example in median household income:
$ 36,078
Measures of Center - mode and midrange
• Mode M value that occurs
most frequently
– if 2 values most frequent:
bimodal
– if more than 2: multimodal
– Iif no value repeated: no
mode
• Needs no numerical values
Examples:
a.
b.
c.
5.4, 1.1,0.42, 0.73, 0.48, 1.1
27, 27, 27, 55, 55, 55, 88, 88, 99
1, 2, 3, 6 , 7, 8, 9, 10
Solutions:
• unimodal: 1.1
• Bimodal 27 and 55
• No mode
• Midrange
= (highest-lowest value)/2
• Outliers have very strong
weight
M e an
M e dian
Mode
Midrange
172
170
1
245.5
a. (0.42+5.4)/2=2.91
b. (27+99)/2=63
c. (1+10)/2= 5.5
61
16
2
0
1
0
276 154.5
Mode and more …
• Mode: not much used
with numerical data
Example:
Survey shows students own:
• 84% TV
• 76% VCR
• 69% CD player
• 39% video game player
• 35% DVD
TV is the mode!
No mean, median or midrange!
Round-off: carry one more decimal
than in data!
• Mean from frequency distribution
• Weighted mean:
• Dis-Advantages of different
measures of center
Measures of variation
• Variation measures
consistency
• Range = (highest value - lowest
value)/2
• Standard deviation:
Precision
arrows
jungle
arrows
Same mean length, but different variation!
Standard deviation
• Measure of variation of all
values from mean
• Positive or zero (data = )
• Larger deviations, larger s
• Can increase dramatically with
outliers
• Same units as original data
values
T ota l:
Calculat iong st andard deviat ion
B ank U npre dic ta ble
x
x- me an
(x- me an)2
0
-5
25
15
10
100
5
0
0
0
-5
25
0
-5
25
10
5
25
30
200
mean=30/6=5min
s = s qrt(200/(6-1))=s qrt(40)=6.3 min
Recipe:
x
1. Compute the mean
2. Substract mean from
Individual values
3. Square the differences
4. Add the squared differences
5. Divide by n-1.
6.
Take the square root.
Example:
Bank Consistency
Bank Unpredictable
waiting times
6 5 4 4 6 5
0 15 5 0 0 10
1.
Mean: (6+5+4+4+6+5)/6=5
2.
(6-5)=1,(5-5)=0, (4-5)=-1, (4-5)=-1, (6-5)=1, 0
3.
4.
5.
6.
12=1 , 02=0, (-1)2=1, (-1)2=1, 12=1,02=0
∑ 1+0+1+1+1+0 = 4
n-1=6-1=5
4/5=0.8
√0.8 = 0.9 min
vs
6.3
min
Standard deviation of sample and population
Standard deviation of a
population
Example using fast formula:
• Find values of n,
,
n=6 6 values in sample
= 30 adding the values
= 62+52+42 +42 +52+ 62 =
154
• divide by N
• - mu (population mean)
• Sigma (st. dev. of
population)
• Different notations in
calculators
– Excell: STDEVP instead of
– STDEV
Estimating s and  :
(highest value - lowest value)/4
Example: class grades
A statistics class of 20 students obtains the
following grades:
S tudent N ame
P eter
Kathy
P at
N ina
N anc y
V ic tor
V ikki
J en
J ay
Fred
G rade
83
98
57
73
78
86
82
95
92
66
N ame
A lbert
J ohn
J ohn B.
H ughes
Zak
Zoe
L ena
M ary
J oe
Betty
G rade
69
71
64
85
89
84
83
92
74
78
83 98 57 73 78 86 82 95 92 66 69 71 64 85 89 84 83 92 74 78
To rapidly approximate the mean, we take a
random sample of 5 students. At random,
we pick
N anc y
M ary
J ohn B.
Betty
The population mean is obtained
by adding all grades
Lena
x = (78+92+64+83+78)/5=395/5 =79
s =√((78-79) 2 +(92-79) 2 +(64-79)2+(83-79) 2 +(78-79)2)/4
=√(( -1) 2 + ( 13 ) 2 + ( -15 )2+ ( 4 ) 2 +( -1 )2)/4
=√( 1 + 169 + 225 + 16 + 1)/4
=√( 412 )/4 =√( 103 ) = 10.15
and dividing by 20, which is 79.95.
The population variance is 10.71.
Which we can obtain using Excell:
N ame
P eter
Kathy
P at
N ina
N anc y
V ic tor
V ikki
J en
J ay
Fred
A lbert
J ohn
J ohn B .
H ughes
Z ak
Z oe
L ena
M ary
J oe
B etty
G rade x x- mu
83
3 .1
98
1 8 .1
57
- 2 3 .0
73
- 7 .0
78
- 2 .0
86
6 .1
82
2 .1
95
1 5 .1
92
1 2 .1
66
- 1 4 .0
69
- 1 1 .0
71
- 9 .0
64
- 1 6 .0
85
5 .1
89
9 .1
84
4 .1
83
3 .1
92
1 2 .1
74
- 6 .0
78
- 2 .0
s um
/ N =2 0
root
s quared
9 .3
3 2 5 .8
5 2 6 .7
4 8 .3
3 .8
3 6 .6
4 .2
2 2 6 .5
1 4 5 .2
1 9 4 .6
1 1 9 .9
8 0 .1
2 5 4 .4
2 5 .5
8 1 .9
1 6 .4
9 .3
1 4 5 .2
3 5 .4
3 .8
2 2 9 3 .0
1 1 4 .6
10.71
Variance and coefficient of variation
Variance
Examples:
Variance = square of
standard deviation
In class grade case,
sample standard
deviation was 10.15.
Therefore, s2=103.
The population standard
deviation was 10.71,
therefore,
 2=10.71 2= 114.7.
sample
population
General terms refering to
variation: dispersion,
spread, variation
Variance: specific
definition
Ex: finding a variance 0.8,
40
Coefficient of variation
Coefficient of variation CV
[p.155 ex. 49]
Describes the standard
deviation relative to the
mean:
• Coefficient of variation
allows to compare
dispersion of completely
different data sets
– ex:
In previous example,
CVsample=10.1/79 =12.8%
CVpopulation=10.71/ 79.95 =13.4%
• consistent bank data set
6,5,4,4,6,5; x=5, s=0.9
CV=.9/5=0.18
• Class sample: x=79, s=10.1
CV=10.1/79=0.13
– Variation of consistent
bank is larger than that of
the class in relative terms!
More on variance and standard deviation
• Why use variance, standard
deviation is more intuitive?
– (Independent) variances
have additive properties
– Probabilistic properties
– Standard deviation is more
intuitive
Empirical rule for data with normal
distribution
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-3
-2
-1
0
1
2
68% of data
95% of data
99.7% of data
• Why divide sample st. dev by
Example: Adult IQ scores have a bell-shaped
n-1?
distribution with mean of 100 and a standard
– Only n-1 free parameters
deviation of 15. What percentage of adults
have IQ in 55:145 range?
s=15, 3s=45, x-3s=55, x+3s=145 Hence, 99.7%
of adults have IQs in that range.
Chebyshev’s theorem: At least 1-1/k2 percent of the data lie between k standard
deviations from the mean. Ex: At least 1-1/3^2=8/9=89% of the data lie within 3
st. dev. of the mean.
3
• The mean and the median are often
different
• This difference gives us clues about the
shape of the distribution




Is it symmetric?
Is it skewed left?
Is it skewed right?
Are there any extreme values?
• Symmetric – the mean will usually be close to
the median
• Skewed left – the mean will usually be smaller
than the median
• Skewed right – the mean will usually be larger
than the median
• Skewness: Pearson’s index
I=3( mean-median )/s
•If I < -1 or I > 1: significantly skewed
• For a mostly symmetric distribution, the
mean and the median will be roughly
equal
• Many variables, such as birth weights
below, are approximately symmetric
Summary: Chapter 3 – Sections 1and 2
• Mean
 The center of gravity
 Useful for roughly symmetric quantitative data
• Median
 Splits the data into halves
 Useful for highly skewed quantitative data
• Mode
 The most frequent value
 Useful for qualitative data
• Range
 The maximum minus the minimum
 Not a resistant measurement
• Variance and standard deviation
 Measures deviations from the mean
 Not a resistant measurement
• Empirical rule
 About 68% of the data is within 1 standard deviation
 About 95% of the data is within 2 standard deviations
Summary: Chapter 3 – Section 3
(Grouped Data)
• As an example, for the following frequency table,
Class
Midpoint
Frequency
0 – 1.9 2 – 3.9 4 – 5.9 6 – 7.9
1
3
3
7
we calculate the mean as if
 The value 1 occurred 3 times
 The value 3 occurred 7 times
 The value 5 occurred 6 times
 The value 7 occurred 1 time
5
6
7
1
• Evaluating this formula
(1 3)  (3  7)  (5  6)  (7  1)
61

 3 .6
3  7  6 1
17
• The mean is about 3.6
• In mathematical notation
 xi fi
 fi
• This would be μ for the population mean
and x for the sample mean
Variance and Standard deviation (grouped data)
•
Finding s from a frequency distribution
Example: cotinine levels of smokers
Range
0-99
100-199
200-299
300-399
400-499
500-599
Midpoint
49.5
149.5
249.5
349.5
449.5
549.5
Smokers
11
12
14
1
2
0
using Excel we obtain
Range
0-99
100-199
200-299
300-399
400-499
500-599
Totals:
Frequency Midpoint
f
x
11
49.5
12
149.5
14
249.5
1
349.5
2
449.5
0
549.5
40
f. x
544.5
1794
3493
349.5
899
0
7080
with which we calculate:
f.( x^2)
26952.75
268203
871503.5
122150.25
404100.5
0
1692910
Interpreting a known value of the standard
deviation s: If the standard deviation s is
known, use it to find rough estimates of
the minimum and maximum “usual”
sample values by using
max “usual” value ≈ mean + 2(st. dev)
min “usual” value ≈ mean - 2(st.
dev)
N-1: DATA 3,6,9
=6,  2=6
Samples (replacement): 33 36 39 63 66 69 93
96 99
x=
3 4.5 6 4.5 6
7.5 6 7.5 9
∑(x-x )2 =
0 4.5 18 4.5 0 4.5 18 4.5
0
S2=(divide by n-1=2-1) 0 4.5 18 4.5 0 4.5 18 4.5
0
Mean value of s2=
54/9 = 6
S 2=(divide by n=2)
0 2.25 9 2.25 0 2.25 9
2.25 0
Mean value of s 2=
27/9 = 3
Measures of relative standing
unusual values ordinary values
-3
-2
-1
0
z
Useful for comparing different
data sets
• z scores
– Number of standard
deviations that a value x
is above of below the
mean
sample
population
Example:
• NBA Jordan 78, =69,  =2.8
• WNBA Lobo 76, =63.6,  =2.5
Number of standard
deviations that a value x is
above of below the mean
– J: z=(x-)/=(78-69)/2.8=3.21
– L: z=(x-)/=(76-63.6)/2.5=4.96
1
unusual values
2
3
• Percentiles:
– Percentile of value x Px
Px= number of values less than x
total number of values
Example
data point 48 in Smoker data
8/40*100=20th percentile = P20
Exercise:
Locate the percentiles of data
points 1, 130 and 250.
Quartiles and percentiles
pos
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
SMOKERS
value
1
0
131
173
265
210
44
277
32
3
35
112
477
289
227
103
222
149
313
491
130
234
164
198
17
253
87
121
266
290
123
167
250
245
48
86
284
1
208
173
sorted
0
1
1
3
17
32
35
44
48
86
87
103
112
121
123
130
131
149
164
167
173
173
198
208
210
222
227
234
245
250
253
265
266
277
284
289
290
313
477
491
CLASS
grade
83
98
57
73
78
86
82
95
92
66
69
71
64
85
89
84
83
92
74
78
SORTED
57
64
66
69
71
73
74
78
78
82
83
83
84
85
86
89
92
92
95
98
Percentiles and Quartiles
Pk: k= number of values less than x
total number of values
•
Quartiles:
– Q1,= P25, Q2 = P50 =median, Q3=
P75
Pk: k = (L – 1)/n •100
Example: data point 48 in Smoker data is 9th
on table, n= 40.
(9 – 1)/40 •100=20  48 is in P20 or 20th
percentile or the first quartile Q1.
pos
1
2
3
4
5
6
7
28
9
10
11
12
13
14
15
16
17
18
19
20
Data point 234 is 28th. k=(28 – 1)/40 •100=
68th percentile, or the 3rd quartile Q3.
sorted
0 173
1 173
1 198
3 208
17 210
32 222
35 227
44 234
48 245
86 250
87 253
103 265
112 266
121 277
123 284
130 289
131 290
149 313
164 477
167 491
SORT DATA
Compute
L=(k/100)*n
n=number of values
k=percentile
Yes:
take average of
Lth and (L+1)st value
as Pk
L whole
number?
No:
ROUND UP
Pk is the Lth value
pos SORTED
1
57
2
64
3
66
4
69
5
71
6
73
7
74
8
78
9
78
10
82
11
83
12
83
13
84
14
85
15
86
16
89
17
92
18
92
19
95
20
98
Example: In class table ( n = 20 )
•
Conversely, if you are looking for data in the
kth percentile:
L=(k/100)*n
n total number of values
k percentiles being used
L locator that gives position of a value
(the 12th value in the sorted list L=12)
Pk kth percentile (ex: P25 is 25th percentile)
START
•
find value of 21 percentile
– L=21/100 * 20 = 4.2
– round up to 5th data point
– --> P21 = 71
find the 80th percentile:
– L=80/100 * 20 = 16,
– WHOLE NUMBER:
– P80 =(89+92)/2=90.5
Exploratory Data Analysis
Exploratory data analysis is
the process of using
statistical tools (graphs,
measures of center and
variation) to investigate data
sets in order to understand
their characteristics.
Outlier: Extreme value.
(often they are typos when
collecting data, but not
always).
• can have a dramatic effect
on mean
• can have dr. effect on
standard deviation
• … on histogram
Min
Q1
Median
Q3
0
100
200
300
Max
400
500
• Box plots have less
information than
histograms and stem-andleaf plots
• Not that often used with
only one set of data
• Good when comparing
many different sets of
data