Summarizing and Describing Numerical Data

Download Report

Transcript Summarizing and Describing Numerical Data

Business Statistics
Spring 2005
Summarizing and Describing
Numerical Data
Topics
•Measures of Central Tendency
Mean, Median, Mode, Midrange, Midhinge
•Quartile
•Measures of Variation
The Range, Interquartile Range, Variance and
Standard Deviation, Coefficient of variation
•Shape
Symmetric, Skewed, using Box-and-Whisker
Plots
Numerical Data Properties
Central Tendency
(Location)
Variation
(Dispersion)
Shape
Measures of Central Tendency
for
Ungrouped Data
Raw Data
Summary Measures
Summary Measures
Central Tendency
Mean
Quartile
Mode
Median
Range
Midrange
Midhinge
Variation
Coefficient of
Variation
Variance
Standard Deviation
Measures of Central Tendency
Central Tendency
Mean
Median
Mode
n
xi
i 1
n
Midrange
Midhinge
3-2
Population Mean
For ungrouped data, the population mean is the
sum of all the population values divided by the
total number of population values:
SX

N
where µ stands for the population mean.
N is the total number of observations in the
population.
X stands for a particular value.
S indicates the operation of adding.
3-3
Population Mean Example
Parameter: a measurable characteristic of a
population.
The Kane family owns four cars. The
following is the mileage attained by each
car: 56,000, 23,000, 42,000, and 73,000.
Find the average miles covered by each
car.
The mean is (56,000 + 23,000 + 42,000 +
73,000)/4 = 48,500
3-4
Sample Mean
For ungrouped data, the sample mean is the
sum of all the sample values divided by
the number of sample values:
x

X
n
where X stands for the sample mean
n is the total number of values in the sample
Return on Stock
1998
Stock X
10%
Stock Y
17%
1997
8
-2
1996
12
16
1995
2
1
1994
8
8
40%
Average Return
on Stock
40%
= 40 / 5 = 8%
The Mean (Arithmetic Average)
•It is the Arithmetic Average of data values:
x
Sample Mean
n
 xi
i 1
n
x1  x2      xn

n
•The Most Common Measure of Central Tendency
•Affected by Extreme Values (Outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 5
0 1 2 3 4 5 6 7 8 9 10 12 14
Mean = 6
3-6
Properties of the Arithmetic Mean
Every set of interval-level and ratio-level data has
a mean.
All the values are included in computing the
mean.
A set of data has a unique mean.
The mean is affected by unusually large or small
data values.
The mean is relatively reliable.
The arithmetic mean is the only measure of central
tendency where the sum of the deviations of
each value from the mean is zero.
3-7
EXAMPLE
Consider the set of values: 3, 8, and 4.
The mean is 5.
Illustrating the fifth property,
(3-5) + (8-5) + (4-5) = -2 +3 -1 = 0. In
other words,
S( X  X )  0
3-10
The Median
Median: The midpoint of the values after
they have been ordered from the smallest
to the largest, or the largest to the
smallest. There are as many values above
the median as below it in the data array.
Note: For an even set of numbers, the median
will be the arithmetic average of the two
middle numbers.
Median
Position of Median in Sequence
Positioning Point 
n 1
2
The Median
•Important Measure of Central Tendency
•In an ordered array, the median is the
“middle” number.
•If n is odd, the median is the middle number.
•If n is even, the median is the average of the 2
middle numbers.
•Not Affected by Extreme Values
0 1 2 3 4 5 6 7 8 9 10
Median = 5
0 1 2 3 4 5 6 7 8 9 10 12 14
Median = 5
Properties of the Median
• There is a unique median for each data set.
• It is not affected by extremely large or small values
and is therefore a valuable measure of central
tendency when such values occur.
• It can be computed for ratio-level,
interval-level, and ordinal-level data.
• It can be computed for an open-ended frequency
distribution if the median does not lie in an openended class.
• No arithmetic properties
62
The Mode
•A Measure of Central Tendency
•Value that Occurs Most Often
•Not Affected by Extreme Values
•There May Not be a Mode
•There May be Several Modes
•Used for Either Numerical or Categorical Data
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
Midrange
•A Measure of Central Tendency
•Average of Smallest and Largest
Observation:
Midrange

x l arg est  x smallest
2
•Affected by Extreme Value
0 1 2 3 4 5 6 7 8 9 10
Midrange = 5
0 1 2 3 4 5 6 7 8 9 10
Midrange = 5
Quartiles
•
•
Not a Measure of Central Tendency
Split Ordered Data into 4 Quarters
25%
25%
Q1
•
25%
Q2
Position of i-th Quartile:
25%
Q3
position of point
Qi 
i(n+1)
4
Data in Ordered Array: 11 12 13 16 16 17 18 21 22
Position of Q1 =
1•(9 + 1)
4
= 2.50
Q1 =12.5
Quartiles
•See text page 107 for “rounding rules” for
position of the i-th quartile
25%
25%
Q1
25%
Q2
25%
Q3
• Position (not value) of i-th Quartile:
Qi 
i(n+1)
4
Midhinge
• A Measure of Central Tendency
• The Middle point of 1st and 3rd Quarters
Midhinge =
Q1  Q 3
2
• Used to Overcome Extreme Values
Data in Ordered Array: 11 12 13 16 16 17 18 21 22
Midhinge =
Q1  Q3 12.5  19.5

 16
2
2
Summary Measures
Summary Measures
Central Tendency
Mean
Mode
Median
n
xi
i 1
Quartile
n
Midrange
Midhinge
Range
Variance
 x i  x 
s 
n1
2
2
Variation
Coefficient of
Variation
Standard Deviation
Measures of Variation
Variation
Variance
Range
Population
Variance
Sample
Variance
Interquartile Range
Standard Deviation
Population
Standard
Deviation
Sample
Standard
Deviation
Coefficient of
Variation
S
CV  
X

  100%

The Range
• Measure of Variation
• Difference Between Largest & Smallest
Observations:
Range =
x La rgest  x Smallest
• Ignores How Data Are Distributed:
Range = 12 - 7 = 5
Range = 12 - 7 = 5
7
8
9
10
11
12
7
8
9
10
11
12
Return on Stock
1998
Stock X
10%
Stock Y
17%
1997
8
-2
1996
12
16
1995
2
1
1994
8
8
Range on Stock X = 12 - 2 = 10%
Range on Stock Y = 17 - (-2) = 19%
Interquartile Range
• Measure of Variation
• Also Known as Midspread:
Spread in the Middle 50%
• Difference Between Third & First
Quartiles: Interquartile Range = Q3  Q1
Data in Ordered Array: 11 12 13 16 16 17
•
Q 3  Q1
= 17.5 - 12.5 = 5
17 18 21
Interquartile Range
• IQR = 75th percentile - 25th percentile
•The IQR is useful for checking for outliers
•Not Affected by Extreme Values
Data in Ordered Array: 11 12 13 16 16 17
Q 3  Q1 = 17.5 - 12.5 = 5
17 18 21
Variance &
Standard Deviation
Measures of Dispersion
Most Common Measures
Consider How Data Are Distributed
Show Variation About Mean (`X or )
X = 8.3
4 6
8 10 12
Variance
•Important Measure of Variation
•Shows Variation About the Mean:
2


X



2
i
•For the Population:  
N
•For the Sample:
 X i  X 
s 
n 1
2
2
For the Population: use N in the
denominator.
For the Sample : use n - 1
in the denominator.
4-5
Population Variance
The population variance for ungrouped data
is the arithmetic mean of the squared
deviations from the population mean.
S(
X


)
2
 
N
2
Population Variance EXAMPLE
The ages of the Dunn family are 2, 18, 34, and 42
years. What is the population variance?
x

(x-  
2
18
34
42
24
24
24
24
-22
-6
10
18
(x-  )
2
484
36
100
324
S( X   )
 
N
2
2
  SX / N  96 / 4  24
944
 2  S ( X   ) 2 / N  944 / 4  236
Population
Standard Deviation

(
x


)

N
2
Population Standard Deviation
EXAMPLE
The ages of the Dunn family are 2, 18, 34, and 42 years.
What is the population variance?
x

(x-  
2
18
34
42
24
24
24
24
-2 2
-6
10
18
(x-  )
2
484
36
100
324
944

S( X   )
N
  SX / N  96 / 4  24
S( X   )
944


 236
N
4
2
2
Standard Deviation
•Most Important Measure of Variation
•Shows Variation About the Mean:
•For the Population:
•For the Sample:
s 
For the Population: use N in the
denominator.

2


X


 i
N
 X i
 X
n 1
2
For the Sample : use n - 1
in the denominator.
Sample Variance and
Standard Deviation
am
The sample variance estimates the population
variance. NOTE: important computation formriance
estimates the population
variance.
2
S( X  X )
n 1
2
(
S
X
)
SX 2 
n
S2 
n 1
S2 
The sample standard deviation =
s s
2
Example of Standard Deviation
Amount
600
350
275
430
520
s=
s=
X
435
435
435
435
435
(X  X)
n 1
Deviation from Mean
(X - X)
600 - 435 = 165
350 - 435 = -85
275 - 435 = -160
430 -435 =
-5
520 - 435 = 85
0
2
=
2
(X-X)
27,225
7,225
25,600
25
7,225
67,300
67,300
= 16,825 = 129.71
4
Example of Standard Deviation
(Computational Version)
(X - X )
( X - X )2
X
600
435
165
27,225
360000
350
435
-85
7,225
122500
275
435
-160
25,600
75625
430
435
-5
25
184900
520
435
85
7,225
270400
67,300
1013425
2175
X

s=
2

 x
n 1
n

2
X
2
Am ount(X )

2175
1013425 
2
=
5 1
5
= 129.71
Sample Standard Deviation
 X i  X 

n1
2
s
Data:
Xi :
10
12
n=8
s=
NOTE: For the Sample :
use n - 1 in the
denominator.
14
15 17 18 18 24
Mean =16
(10  16) 2  (12  16) 2  ..... (18  16) 2  (24  16) 2
8 1
= 4.2426
4-14
Interpretation and Uses of the
Standard Deviation
Chebyshev’s theorem: For any set of
observations, the minimum proportion of
the values that lie within k standard
deviations of the mean is at least 1 - 1/k2
where k is any constant greater than 1.
Multiply by 100% to get percentage of values within k
standard deviations of the mean
4-15
Interpretation and Uses of the
Standard Deviation
Empirical Rule: For any symmetrical, bellshaped distribution, approximately 68%
of the observations will lie within  1
of the mean (  );approximately 95% of
the observations will lie within  2 of
the mean (  ); approximately 99.7% will
lie within  3
of the mean (  ).
Comparing Standard Deviations
Data :
X i : 10
N= 8
12
14
15 17 18 18 24
Mean =16
 X i  X 
n 1
2
 X i   
N
2
s =
 
= 4.2426
=
3.9686
Value for the Standard Deviation is larger for data considered as a Sample.
Comparing Standard Deviations
Data A
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 3.338
Data B
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = .9258
Data C
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.57
Coefficient of Variation
•Measure of Relative Variation
•Always a %
•Shows Variation Relative to Mean
•Used to Compare 2 or More Groups
•Formula ( for Sample):
S
CV     100%
X 
Comparing Coefficient of Variation
Stock A: Average Price last year = $50
Standard Deviation = $5
Stock B: Average Price last year = $100
Standard Deviation = $5
S
CV     100%
X 
Coefficient of Variation:
Stock A: CV = 10%
Stock B: CV = 5%
Shape
•
•
Describes How Data Are Distributed
Measures of Shape:
Symmetric or skewed
Left-Skewed
Mean Median Mode
Symmetric
Mean = Median = Mode
Right-Skewed
Mode Median Mean
Box-and-Whisker Plot
Graphical Display of Data Using
5-Number Summary
X smallest Q1 Median Q3
4
6
8
10
Xlargest
12
Distribution Shape &
Box-and-Whisker Plots
Left-Skewed
Q1 Median Q3
Symmetric
Q1
Median Q3
Right-Skewed
Q1 Median Q3
Summary
• Discussed Measures of Central Tendency
Mean, Median, Mode, Midrange, Midhinge
• Quartiles
• Addressed Measures of Variation
The Range, Interquartile Range, Variance,
Standard Deviation, Coefficient of Variation
• Determined Shape of Distributions
Symmetric, Skewed, Box-and-Whisker Plot
Mean Median Mode
Mean = Median = Mode
Mode Median Mean