No Slide Title

Download Report

Transcript No Slide Title

Chapter 2
Describing Data
©
Summarizing and Describing
Data
 Tables and Graphs
 Numerical Measures
Classification of Variables
 Discrete numerical variable
Continuous numerical variable
Categorical variable
Classification of Variables
Discrete Numerical Variable
A variable that produces a response
that comes from a counting process.
Classification of Variables
Continuous Numerical Variable
A variable that produces a response
that is the outcome of a measurement
process.
Classification of Variables
Categorical Variables
Variables that produce responses that
belong to groups (sometimes called
“classes”) or categories.
Measurement Levels
Nominal and Ordinal Levels of Measurement
refer to data obtained from categorical
questions.
• A nominal scale indicates assignments to
groups or classes.
• Ordinal data indicate rank ordering of
items.
Frequency Distributions
A frequency distribution is a table used to
organize data. The left column (called classes or
groups) includes numerical intervals on a variable
being studied. The right column is a list of the
frequencies, or number of observations, for each
class. Intervals are normally of equal size, must
cover the range of the sample observations, and be
non-overlapping.
Construction of a Frequency
Distribution
 Rule 1: Intervals (classes) must be inclusive and nonoverlapping;
 Rule 2: Determine k, the number of classes;
 Rule 3: Intervals should be the same width, w; the
width is determined by the following:
(Largest Number - Smallest Number)
w  Interval Width 
Number of Intervals
Both k and w should be rounded upward, possibly to the next largest integer.
Construction of a Frequency
Distribution
Quick Guide to Number of Classes for a Frequency Distribution
Sample Size
Fewer than 50
50 to 100
over 100
Number of Classes
5 – 6 classes
6 – 8 classes
8 – 10 classes
Example of a Frequency Distribution
Example 2.1
Table 2.2 A Frequency Distribution for the Suntan Lotion Example
Weights (in mL)
220 less than 225
225 less than 230
230 less than 235
235 less than 240
240 less than 245
245 less than 250
Number of Bottles
1
4
29
34
26
6
Cumulative Frequency
Distributions
A cumulative frequency distribution contains
the number of observations whose values are
less than the upper limit of each interval. It is
constructed by adding the frequencies of all
frequency distribution intervals up to and
including the present interval.
Relative Cumulative Frequency
Distributions
A relative cumulative frequency
distribution converts all cumulative
frequencies to cumulative percentages
Example of a Frequency Distribution
Example 2.1
Table 2.3 A Cumulative Frequency Distribution for the Suntan
Lotion Example
Weights (in mL)
less than 225
less than 230
less than 235
less than 240
less than 245
less than 250
Number of Bottles
1
5
34
68
94
100
Histograms and Ogives
A histogram is a bar graph that consists of
vertical bars constructed on a horizontal line
that is marked off with intervals for the variable
being displayed. The intervals correspond to
those in a frequency distribution table. The
height of each bar is proportional to the number
of observations in that interval.
Histograms and Ogives
An ogive, sometimes called a cumulative line
graph, is a line that connects points that are the
cumulative percentage of observations below
the upper limit of each class in a cumulative
frequency distribution.
Histogram and Ogive for Example 2.1
Histogram of Weights for Example 2.1
40
100
90
35
80
70
Frequency
30
25
60
50
40
20
15
30
20
10
5
10
0
0
224.5
229.5
234.5
239.5
Interval Weights (mL)
244.5
249.5
Stem-and-Leaf Display
A stem-and-leaf display is an exploratory data
analysis graph that is an alternative to the histogram.
Data are grouped according to their leading digits
(called the stem) while listing the final digits (called
leaves) separately for each member of a class. The
leaves are displayed individually in ascending order
after each of the stems.
Stem-and-Leaf Display
Stem-and-Leaf Display for Gilotti’s Deli Example
Stem-and-Leaf Display
Stem unit: 10
9
(9)
7
2
1
2
3
4
124678899
122246899
01234
02
Tables
- Bar and Pie Charts Frequency and Relative Frequency Distribution for
Top Company Employers Example
Industry
Tourism
Retail
Health Care
Restaurants
Communications
Technology
Space
Other
Number of
Employees Percent
85,287
0.35
49,424
0.2
39,588
0.16
16,050
0.06
11,750
0.05
11,144
0.05
11,418
0.05
21,336
0.08
Tables
- Bar and Pie Charts Figure 2.9 Bar Chart for Top Company Employers Example
1999 Top Company Employers in Central Florida
0.35
0.2
0.16
0.06
ism
r
u
To
l
ai
t
Re
lth
a
He
C
e
ar
0.05
0.05
s
gy
ns
nt
o
o
l
i
a
at
ur
no
c
a
h
i
st
ec
un
e
T
m
R
m
o
C
Industry Category
0.05
ce
a
Sp
0.08
er
h
t
O
Tables
- Bar and Pie Charts Figure 2.10 Pie Chart for Top Company Employers Example
1999 Top Company Employers in Central Florida
Others
29%
Health Care
16%
Tourism
35%
Retail
20%
Pareto Diagrams
A Pareto diagram is a bar chart that displays the
frequency of defect causes. The bar at the left
indicates the most frequent cause and bars to the
right indicate causes in decreasing frequency. A
Pareto diagram is use to separate the “vital few”
from the “trivial many.”
Line Charts
A line chart, also called a time plot, is a series of data
plotted at various time intervals. Measuring time
along the horizontal axis and the numerical quantity of
interest along the vertical axis yields a point on the
graph for each observation. Joining points adjacent in
time by straight lines produces a time plot.
Line Charts
35
30
25
20
15
10
5
0
31.3
32.7
17.2
13
18.5
14.2
26.3
16.5
9.8
5
20.2
13.8
7.5
15.8
11.4
Age 18 to 29
Ap
r- 9
7
Ju
l-9
O 7
ct
-9
7
Ja
n9
Ap 8
r- 9
8
Ju
l-9
O 8
ct
-9
8
Ja
n9
Ap 9
r- 9
9
Ju
l-9
9
Millions of Adults
Growth Trends in Internet Use by Age
1997 to 1999
April 1997 to July 1999
Age 30 to 49
Age 50+
Parameters and Statistics
A statistic is a descriptive measure computed
from a sample of data. A parameter is a
descriptive measure computed from an entire
population of data.
Measures of Central Tendency
- Arithmetic Mean A arithmetic mean is of a set of data is
the sum of the data values divided by
the number of observations.
Sample Mean
If the data set is from a sample, then the
sample mean, X , is:
n
X
x
i 1
n
i
x1  x2    xn

n
Population Mean
If the data set is from a population, then the
population mean,  , is:
N
x
x1  x2    xn


N
N
i 1
i
Measures of Central Tendency
- Median An ordered array is an arrangement of data in
either ascending or descending order. Once the
data are arranged in ascending order, the median is
the value such that 50% of the observations are
smaller and 50% of the observations are larger.
If the sample size n is an odd number, the median,
Xm, is the middle observation. If the sample size n
is an even number, the median, Xm, is the average
of the two middle observations. The median will
be located in the 0.50(n+1)th ordered position.
Measures of Central Tendency
- Mode The mode, if one exists, is the most
frequently occurring observation in
the sample or population.
Shape of the Distribution
The shape of the distribution is said to
be symmetric if the observations are
balanced, or evenly distributed, about
the mean. In a symmetric distribution
the mean and median are equal.
Shape of the Distribution
A distribution is skewed if the observations are
not symmetrically distributed above and below
the mean. A positively skewed (or skewed to
the right) distribution has a tail that extends to
the right in the direction of positive values. A
negatively skewed (or skewed to the left)
distribution has a tail that extends to the left in
the direction of negative values.
Shapes of the Distribution
Frequency
Symmetric Distribution
10
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
Negatively Skewed Distribution
12
12
10
10
8
8
Frequency
Frequency
Positively Skewed Distribution
6
4
2
6
4
2
0
0
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
Measures of Central Tendency
- Geometric Mean The Geometric Mean is the nth root of the product
of n numbers:
X g  ( x1  x2   xn )  ( x1  x2   xn )
n
The Geometric Mean is used to obtain mean
growth over several periods given compounded
growth from each period.
1/ n
Measures of Variability
- The Range The range is in a set of data is the
difference between the largest and
smallest observations
Measures of Variability
- Sample Variance The sample variance, s2, is the sum of the squared
differences between each observation and the
sample mean divided by the sample size minus 1.
n
s2 
2
(
x

X
)
 i
i 1
n 1
Measures of Variability
- Short-cut Formulas for Sample
Variance Short-cut formulas for the sample variance are:
( xi ) 2
xi 

n
2
i 1
s 
n 1
n
or s 2 
2
2
x

n
X
i
n 1
Measures of Variability
- Population Variance The population variance, 2, is the sum of the
squared differences between each observation and
the population mean divided by the population
size, N.
N
2 
2
(
x


)
 i
i 1
N
Measures of Variability
- Sample Standard Deviation The sample standard deviation, s, is the positive
square root of the variance, and is defined as:
n
s  s2 
2
(
x

X
)
 i
i 1
n 1
Measures of Variability
- Population Standard DeviationThe population standard deviation, , is
N
  
2
 (x  )
i 1
i
N
2
The Empirical Rule
(the 68%, 95%, or almost all rule)
For a set of data with a mound-shaped histogram, the
Empirical Rule is:
• approximately 68% of the observations are contained
with a distance of one standard deviation around the
mean;  1
• approximately 95% of the observations are contained
with a distance of two standard deviations around the
mean;  2
• almost all of the observations are contained with a
distance of three standard deviation around the mean;
 3
Coefficient of Variation
The Coefficient of Variation, CV, is a measure of
relative dispersion that expresses the standard
deviation as a percentage of the mean (provided the
mean is positive).
The sample coefficient of variation is
s
CV  100
X
if X  0

CV  100

if   0
The population coefficient of variation is
Percentiles and Quartiles
Data must first be in ascending order.
Percentiles separate large ordered data sets into
100ths. The Pth percentile is a number such that
P percent of all the observations are at or below
that number.
Quartiles are descriptive measures that separate
large ordered data sets into four quarters.
Percentiles and Quartiles
The first quartile, Q1, is another name for the
25th percentile. The first quartile divides the
ordered data such that 25% of the observations
are at or below this value. Q1 is located in the
.25(n+1)st position when the data is in
ascending order. That is,
(n  1)
Q1 
ordered position
4
Percentiles and Quartiles
The third quartile, Q3, is another name for the
75th percentile. The first quartile divides the
ordered data such that 75% of the observations
are at or below this value. Q3 is located in the
.75(n+1)st position when the data is in
ascending order. That is,
3(n  1)
Q3 
ordered position
4
Interquartile Range
The Interquartile Range (IQR) measures the
spread in the middle 50% of the data; that is
the difference between the observations at the
25th and the 75th percentiles:
IQR  Q3  Q1
Five-Number Summary
The Five-Number Summary refers to the five
descriptive measures: minimum, first quartile,
median, third quartile, and the maximum.
X minimum  Q1  Median Q3  X maximum
Box-and-Whisker Plots
A Box-and-Whisker Plot is a graphical procedure
that uses the Five-Number summary.
A Box-and-Whisker Plot consists of
• an inner box that shows the numbers which span
the range from Q1 Box-and-Whisker Plot to Q3.
• a line drawn through the box at the median.
The “whiskers” are lines drawn from Q1 to the
minimum vale, and from Q3 to the maximum
value.
Box-and-Whisker Plots (Excel)
Box-and-whisker Plot
45
40
35
30
25
20
15
16
10
Grouped Data Mean
For a population of N observations the mean is
K

fm
i
i 1
i
N
For a sample of n observations, the mean is
K
X 
fm
i
i 1
i
n
Where the data set contains observation values m1, m2, . . ., mk occurring with
frequencies f1, f2, . . . fK respectively
Grouped Data Variance
For a population of N observations the variance is
K
2 

i 1
f i (mi   ) 2
N
K


i 1
f i m i2
N
 2
For a sample of n observations, the variance is
K
s2 

i 1
f i (mi  X ) 2
n 1
K


i 1
f i m i2  nX 2
n 1
Where the data set contains observation values m1, m2, . . ., mk occurring with
frequencies f1, f2, . . . fK respectively
Key Words
 Arithmetic Mean
 Bar Chart
 Box-and-Whisker Plot
 Categorical Variable
 Coefficient of Variation
 Continuous Numerical
Variable
 Cumulative Frequency
Distribution
 Discrete Numerical
Variable
 Empirical Rule
 First Quartile
 Five-Number Summary
 Frequency Distribution
 Geometric Mean
 Histogram
 Interquartile Range (IQR)
 Line Chart (Time Plot)
 Measurement Levels
 Median
 Mode
Key Words
(continued)
 Numerical Variables
 Ogive
 Outlier
 Parameter
 Pareto Diagram
 Percentiles
 Pie Chart
 Qualitative
 Quantitative Variables
 Quartiles
 Range
 Relative Cumulative
Frequency Distribution
 Short-cut Formula for s2
 Skewness
 Standard Deviation
 Statistic
 Stem-and-Leaf Display
 Third Quartile
 Variance