Exploring Data

Download Report

Transcript Exploring Data

Chapter 1 (Yates et al. 2002)
Introduction
Individuals – objects described by data.
Individuals may be people, animals or
objects.
 Variable – is any characteristic of an
individual that can take different values
for different individuals.
 A data set contains information
(organized as variables) about some
group of individuals.


When you view a new data set, ask the
following questions:
 Who?
○ Individuals?
○ How many individuals?
 What?
○ How many variables?
○ Definitions for each variable?
○ Units for numerical values?
○ Reason to mistrust information?
 Why?
○ For what reason was the data gathered?

Categorical variable

Quantitative variable, or numerical
variable

Litmus test: Does it make sense to
calculate an average for the variable? If
so it is a quantitative variable.
Example 1.1 Education in the United Sates.
State
Region
Pop.
(1000s)
SAT
Verbal
SAT
Math
Percent
taking
Percent
no HS
Teachers’
pay ($1000)
:
CA
PAC
33,871
497
514
49
23.8
43.7
CO
MTN
4,301
536
540
32
15.6
37.1
CT
NE
3,406
510
509
80
20.8
50.7
:
Each row of data is called a case.
Can you answer the three
“W” questions about this data set?
Who?
Individuals?
How many individuals?
What?
How many variables?
Definitions for each variable?
Units for numerical values?
Reason to mistrust information?
Why?
For what reason was the data
gathered?

Distribution – of a variable, tells us what
values the variable takes and how often it
takes these variables.
 Think of it as the pattern of variation.

Examining data in order to describe its main
features if called exploratory data analysis.
 Examine variables individually, then look for
relationships among variables.
○ These examinations start with graph(s), then
numerical summaries of specific aspects of the data.
Do exercises 1.1 to 1.3
1.1 Displaying Distributions with
Graphs
Displaying categorical variables: bar graphs
and pie charts
 We already practiced displaying frequency
distributions and relative frequency
distributions with tables and bar graphs.
 According to Yates et al. (2002) the
distribution of categorical variables can be
accomplished as either counts or the
percent of individuals that fall in each
category.
Example 1.2. The most popular soft drink. The following table displays the
sales figures and market share (percent of total sales) achieved by several
major soft drink companies in 1999.
Company
Cases sold (millions) Market share (percent)
Coca-Cola Co.
4377.5
44.1
Pepsi-Cola Co.
3119.5
31.4
Dr. Pepper/7-UP
(Cadbury)
1455.1
14.7
Cott Corp.
310.0
3.1
National Beverage
205.0
2.1
Royal Crown
115.4
1.2
Other
347.5
3.4
1999 Soft Drink Sales
5000
4000
3000
2000
1000
0
Coca-Cola Co.
Pepsi-Cola Co.
Dr. Pepper/7-UP
(Cadbury)
Cott Corp.
National Beverage
1999 Soft Drink Sales—Market Shares
Coca-Cola Co.
Pepsi-Cola Co.
Dr. Pepper/7-UP
(Cadbury)
Cott Corp.
National Beverage
Royal Crown
Other
Royal Crown
Other
How to construct a bar graph and pie
chart. See p. 8.
 Bar graphs and pie charts help an
audience grasp the distribution quickly.
 Example 1.3, p. 10 gives an example of
categorical data that would not be
appropriate for a pie chart.

Do exercises 1.5 to 1.6
Displaying quantitative variables: dotplots
and stemplots
 Example 1.4., p. 11, shows how to
construct a dotplot.
 Figure 1.3, p. 11, is and example of a dotplot.

Making a graph is not an end in itself. The
graph must be interpreted.
 Look for overall pattern and also for striking
deviations from that pattern.
○ Give the center and the spread.
○ Described the shape of the distribution.
 Symmetrical? Bimodal? Skewed right or skewed left?
 Is there an outlier–and individual observation that falls
outside the pattern?
When the values are too spread out a
stemplot may be useful.
 Example 1.5, p. 13, shows how to
construct a stemplot.
 Figure 1.4, p. 14.

 Sometimes stems can be split to better resolve
the distribution.
 Too few stems will result in a skyscraper-shaped
plot. Too many and you get a “pancake” graph.
 Round if the data have too many digits.

See Technology Toolbox for interpreting
data from computer output.
Do exercises 1.8 to 1.11
Displaying quantitative variables: histogram
 Quantitative variables often take on many
values. A graph of the distribution is
clearer if we group nearby values as in a
histogram.
 Example 1.6, p. 19, shows how to make a
histogram.
 Let’s interpret Figure 1.7, p. 20.
 Do Technology Toolbox, p. 21.
Do exercises 1.12 and 1.15
More about shape
 Look for:
 Major peaks
 Clear outliers
 Rough symmetry or clear skewness
Symmetric – referring to a distribution of a
histogram in which the right and left sides
are approximately mirror images of each
other.
 Skewed right
 Skewed left


Example 1.7 Lightning Flashes and
Shakespeare.
 Figure 1.8, p. 26
 Figure 1.9, p. 26
○ Note that the vertical scale here is a percent not a
count.
 Convenient when counts are large or when we want to
compare several distributions.

Overall shape of a distribution provides
important information about a variable.
 Some types of data regularly distributions that
are symmetrical or that are skewed.
 Some display neither.
Do exercises 1.16 and 1.17
Relative frequency, cumulative frequency,
and ogives.
 So you received a test score report that
said you were in the 85th percentile. So
what?
 The pth percentile of a distribution is
the value such that p percent of the
observations fall at or below it.

Relative cumulative frequency graph,
or ogive – shows the relative standing
of an individual observation.
How to construct a relative cumulative frequency graph, pp. 28-30.
•Decide on intervals and make a frequency table
•Add three columns: relative frequency, cumulative frequency, and relative
cumulative frequency.
•To get the values of the relative frequency, divide the count in each class by
the total number of individuals.
•To fill the cumulative Frequency column, add the counts in the frequency column
that fall in or below the current class interval.
•For relative cumulative frequency, divide the entries in the cumulative frequency
column by the total number of individuals.
Relative Frequency
Cumulative
Frequency
Relative Cumulative
Freq
Class
Frequency
40-44
2
2/43 = 0.047, or 4.7%
2
2/43 = 0.047, or 4.7%
45-49
6
6/43 = 0.140, or 14.0%
8
8/43 = 0.186, or 18.6%
50-54
13
13/43 = 0.302, or 30.2%
21
21/43 = 0.488, or 48.8%
55-59
12
12/43 = 0.279, or 27.9%
33
33/43 = 0.767, or 76.7%
60-64
7
7/43 = 0.163, or 16.3%
40
40/43 = 0.930, or 93.0%
65-69
3
3/43 = 0.070, or 7.0%
43
43/43 = 1.000, or 100.0%
Total
43

Next, graph the relative cumulative
frequency against the variable of
interest.
 The vertical axis is scaled from 0% to 100%.
 The horizontal axis is scaled to fit the
highest value and each major tick is labeled
with the left end-point of the next class
interval.
How do you locate an individual within the distribution?
How do you locate a value corresponding to a percentile?
How can you find the center of the distribution?
100%
90%
Rel. Cum. Freq.
80%
70%
60%
50%
40%
30%
20%
10%
0%
30
35
40
45
50
55
60
Age at Inauguration
Figure 1.12 Ogive of presidents’ ages at inauguration.
65
70
75
Do exercise 1.19.
Time plots
 Many variables are measure at intervals
over time.
 Example: Height of a growing child.

A time plot of a variable plots each
observation against the time at which it
was measured.
 Time is always placed on the horizontal axis.
 Variable of interest on the vertical axis.
 If there are not too many points, connect the
points to show patter of change over time.

When examining a time plot look for
 Overall pattern
 Strong deviations from the pattern
A common overall pattern is called a
trend.
 A pattern that repeats itself at regular
time intervals is called seasonal
variation.

Do exercise 1.21.
1.2 Describing Distributions with
Numbers
Shape, center, and spread provide a
good description of the overall pattern of
any distribution of a quantitative
variable.
Measuring center: the mean
 Mean is the most common measure of
center; is ordinary arithmetic average.

1
x   xi
n

Example 1.10 Barry Bonds vs. Hank Aaron
illustrates that the mean is sensitive to
extreme values (outliers); also to skewed
distributions.
 The mean is not a resistant measure of center.
Measuring center: the median
 The median, M, is the formal version of
the midpoint, with a specific rule for
calculation.
 Number such that half the observations are
smaller and the other half larger. See p. 39.
Example 1.11 Finding Medians shows
tells a different story in the comparison
of Barry Bonds and Hank Aaron’s
homeruns.
Comparing the mean and the median
 The mean and median of a symmetric
distribution are close together.
 In a skewed distribution, the mean is
farther out in the long tail than the
median.

Reports of house prices, incomes, and
other strongly skewed distributions
usually give the median rather than the
mean.
 However, if you are a tax assessor
interested in figuring out the total value
of all the homes in your area, use the
mean.

Do exercises 1.31 to 1.35
Measuring spread: the quartiles
 Center does not include how variable
the values of a variable are.
 The simplest useful numerical description of
a distribution contains both the measure of
center and a measure of spread.

Range is the difference between the
largest and smallest observation.
 But, this can depend on outliers.

Measure of spread can be improved by
looking at the spread of the middle half of
the data.
 Quartiles mark the middle half
○ First quartile, Q1, lies one-quarter of the way up
an ordered list of values for a variable.
 Larger than 25 % of the observations.
○ Third quartile, Q3, lies one-quarter of the way up
an ordered list of values for a variable.
 Larger than 75 % of the observations.
○ Second quartile, is the median, M, previously
defined
 Larger than 50 % of the observations.

Calculating quartiles, how to, p. 42.
 Calculators and computer software may use
slightly different rules, but the slight
differences are no problem.
Example 1.12, p. 43.
 Distance between the first and third
quartiles is a simple measure of spread
called interquartile range (IQR).

IQR  Q3  Q1

If and observation fall between Q1 and Q2, it
is not unusual.
 IQR is the basis for determining outliers.

Criterion for determining outliers
 If a value if is1.5  IQR above the third quartile
or below the first quartile.
○ Upper cutoff
Q3  1.5  IQR
○ Lower cutoff
Q1  1.5  IQR
The five-number summary and boxplots
 Smallest and largest values give us information
about the tails of the distribution.
 This gives a quick summary of center and
spread.
 Minimum Q1 M Q3 Maximum

Graph called a boxplot can be used to view
five-number summaries for a variable and to
compare groups of data for the same variable.
 Figure 1.17, p. 45. Comparing Barry Bonds and
Hank Aaron

When looking at a boxplot
 Locate median
 And identify the spread, the quartiles and the
extreme values.

The boxplot also gives an indication of
symmetry.
 A distribution that is right skewed, the third
quartile will be farther from the median than the
first quartile below it. What bout left skewed.

Modified boxplot plots the outliers as
isolated points. See Figure 1.18, p. 46.
Do exercises Technology Toolbox and1.36 to 1.39
Measuring spread: the standard deviation
 The most common numerical description
of distribution is the mean and the
standard deviation—measures spread
by looking at how far the observations
are from their mean.
 Variance (s2) – is the average of the
squares of the deviations from their
mean.
2


x

x
 i
2
s 
n 1

Standard deviation (s) square root of the
variance.
2


 xi  x
2
s 
n 1
Example 1.14 Metabolic Rate, p. 49
 Figure 1.20, p. 50, illustrates the idea of a
deviation using the metabolic rate data.
 Variance (and standard deviation) is large if
the observations are widely spread about their
mean; opposite is true if the observations are
all close to the mean.

Degrees of freedom
 Basic properties of standard deviation as a
measure of spread.

 Only use when the mean is a measure of spread.
Why?
 s = 0 only when there is no spread, i.e. all
observations have the same value. Otherwise, s
> 0. As observations become more spread out
about their mean, s gets larger.
 s like the mean, is not resistant.
Choosing measures of center and spread
 Use the five-number summary when
describing a skewed distribution with
strong outliers
 Use mean and standard deviation when
describing a reasonably symmetric
distribution that is free of outliers.
Always plot your data!
Do Exercises 1.40 to 1.43
Changing the unit of measurement
 Done by a linear transformation.
xnew  a  bx
 Adding a constant a shifts x up or down by the
same amount.
 Multiplying by a positive constant b changes the
size of the unit of measurement.
 Example:
9
TF  TC  32
5

Effects of linear transformation on spread (p.
55)
 Multiplying each observation by a positive number b
multiplies both measures of center (mean and
median) and measures of spread (s and IQR) by b.
 Adding the same number a (either positive or
negative) to each observation adds a to the
measures of center and to quartiles but does NOT
change measures of spread.
 Overall, does not change the shape of the
distribution.

Example 1.15 Los Angeles Lakers’ Salaries
(pp. 53-55)
Stem-and-Leaf
Display: Base Salary
Stem-and-Leaf
Display: +.1 Bonus
Stem-and-Leaf
Display: 10% Bonus
Stem-and-leaf of Base
Salary N = 14
Leaf Unit = 0.10
Stem-and-leaf of +.1
Bonus N = 14
Leaf Unit = 0.10
Stem-and-leaf of 10%
Bonus N = 14
Leaf Unit = 0.10
3
5
7
7
6
3
2
2
2
2
2
2
1
1
1
1
1
1
3
5
7
7
6
3
2
2
2
2
2
2
1
1
1
1
1
1
3
5
7
7
6
3
2
2
2
2
2
2
2
1
1
1
1
1
1
0 378
1 00
2 01
3 1
4 235
5 0
6
7
8
9
10
11 8
12
13
14
15
16
17 1
0 489
1 11
2 12
3 2
4 346
5 1
6
7
8
9
10
11 9
12
13
14
15
16
17 2
0 378
1 11
2 23
3 4
4 679
5 5
6
7
8
9
10
11
12 9
13
14
15
16
17
18 8
Figure 1.21 Stemplots of the salaries of LA Laker players before linear transformation, and
after + 0.1 and a factor transformation of 0.1
Figure 1.21A Boxplots of the salaries of LA Laker players before linear transformation, and
after + 0.1 linear transformation and a factor linear transformation of 0.1.
Do Exercises 1.44 to 1.46
Comparing distributions
 Side-by-side bar graphs for categorical
data.
 Back-to-back stemplots for small
quantitative data sets.
 Side-by-side boxplots.
Percent Purchased
25
20
15
Full-sized or intermediatesized car
Light Truck or van
10
5
0
Med or
dark
gree
White
Light
brown
Silver
Black
Color
Figure 1.22 Favorite car and truck colors for 1998.
Male
Female
0 57
1 0489
87550 2 59
76431 3 13
4 4
90 5
6
7
65 8
Figure 1.23 Back-to-back stemplot of the number of cesarean
sections performed by male and female Swiss doctors.
Descriptive statistics for the number of Cesarean section performed by male
and female Swiss doctors. Compare the distributions.
Mean
SD
Min
Q1
M
Q3
Max
IQR
Male doctors
41.3
20.61
20
27
34
50
86
23
Female doctors
19.1
10.13
5
10
18.5
29
33
19
Do Exercises 1.47 to 1.49