Turning Data Into Information

Download Report

Transcript Turning Data Into Information

Never let time idle away aimlessly.
1
Chapters 1, 2: Turning Data into
Information
Types of data
 Displaying distributions
 Describing distributions

2
What are Data?
Any set of data contains information about
some group of individuals. The information is
organized in variables.
Individuals are the objects described by a set of data.
Could be animals, people, or things.
A variable is any characteristic of an individual.
A variable can take different values for different
individuals.
3
Population/Sample/Raw Data



4
A population is a collection of all individuals
about which information is desired.
A sample is a subset of a population.
Raw data: information collected but not been
processed.
Example: A College’s Student Dataset
The data set includes data about all currently
enrolled students such as their ages, genders,
heights, grades, and choices of major.



5
Population/sample/raw data of study?
Who? What individuals do the data describe?
What? How many variables do the data
describe? Give an example of variables.
Types of Variables
A categorical variable places an individual
into one of several groups or categories.
 A quantitative variable takes numerical
values for which arithmetic operations such
as adding and averaging make sense.
Q. Which variable is categorical ? Quantitative?

6
A variable
Q: Does “average” make sense?
No
Categorical/
Qualitative
Q: Is there any natural ordering among categories?
No
Nominal variable
Yes
Ordinal variable
Yes
Numerical/
Quantitative
Q: Can all possible values be listed down?
Yes
Discrete variable
No
Continuous variable
7
Two Basic Strategies to Explore Data
8

Begin by examining each variable by itself.
Then move on to study the relationship
among the variables.

Begin with a graph or graphs. Then add
numerical summaries of specific aspects of
the data.
Summarizing Data
Goal: to study or estimate the distributions of
variables
The distribution of a variable tells us what
values/categories it takes and how often it
takes those values/categories.
 Displaying distributions of data with graphs
 Describing distributions of data with numbers
9
A Dataset of CSUEB Students
Gender
10
Weight
(pounds)
155
College
M
Height
(inches)
68.5
F
F
M
F
61.2
63.0
70.0
68.6
99
115
205
170
Bsns
Arts
Arts
Arts
F
M
M
65.1
72.4
--
125
220
188
Bsns
Arts
Bsns
Bsns
Displaying Distributions of Categorical
Variables
Calculating these first:
 Frequency/counts
 Relative frequency/percentage
11
Displaying Distributions of Categorical
Variables

Pie charts: good for one variable

Bar graphs: good for one or two variables
and better than pie charts for ordinal
variables
Example 1.3 (page 9)
12
Class Make-up on First Day
13
Year
Count
Percent
Freshman
18
41.9%
Sophomore
10
23.3%
Junior
6
14.0%
Senior
9
20.9%
Total
43
100.1%
Class Make-up on First Day
Pie Chart
Senior
20.9%
Freshman
41.9%
Junior
14.0%
Sophomore
23.3%
14
Class Make-up on First Day
45.0%
Bar Graph
41.9%
40.0%
35.0%
Percent
30.0%
23.3%
25.0%
20.9%
20.0%
14.0%
15.0%
10.0%
5.0%
0.0%
Freshman
15
Sophomore
Junior
Year in School
Senior
Displaying Distributions of
Quantitative Variables
16

Stem-and-leaf plots: good for small to
medium datasets

Histograms:
Similar to bar charts; good for medium to
large datasets
How to Make a Histogram
1.
Break the range of values of a variable into equal-width
intervals. Make sure to specify the classes precisely so that
each individuals falls into exactly one class.
2.
Count the # of individuals in each interval. These counts are
called frequencies and the corresponding %’s are called
relative frequencies.
3.
Draw the histogram: the variable on the horizontal axis and
the count (or %) on the vertical axis.
*** work on blackboard for height ***
17
Histograms: Class Intervals

How many intervals?
–

Size of intervals?
–

18
One rule is to calculate the square root of the
sample size, and round up.
Divide range of data (maxmin) by number of
intervals desired, and round to convenient number
Pick intervals so each observation can only
fall in exactly one interval (no overlap)
How to Make a Stemplot
Separate each observation into a stem
consisting of all but the final (rightmost) digit
and a leaf, the final digit. Stems may have
as many digits as needed, but each leaf
contains only a single digit.
Example: height of 68.5  leaf = “5” and the
other digit “68” will be the stem
1.
19
How to Make a Stemplot
2.
3.
20
Write the stems in a vertical column with the
smallest at the top, and draw a vertical line
at the right of this column.
Write each leaf in the row to the right of its
stem, in increasing order out from the stem.
Weight Data:
Stemplot
(Stem & Leaf
Plot)
Key
20|3 means
203 pounds
Stems = 10’s
Leaves = 1’s
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
0166
009
0034578
00359
08
00257
555
000255
000055567
245
3
025
0
0
21
Extended Stem-and-Leaf Plots
If there are very few stems (when the data
cover only a very small range of values),
then we may want to create more stems by
splitting the original stems.
22
Extended Stem-and-Leaf Plots
Example: if all of the data values were between
150 and 179, then we may choose to use the
following stems:
15
15
16
16
17
17
23
Leaves 0-4 would go on each
upper stem (first “15”), and leaves
5-9 would go on each lower stem
(second “15”).
What do We See from the Graphs?
Important features we should look for:
 Overall pattern
–
–
–

Shape
Center (the location data tend to cluster to)
Spread (the spread level of data)
Outliers, the values that fall far outside the
overall pattern
(for quantitative variables only)
24
Overall Pattern—Shape


How many peaks, called modes? A distribution with
one major peak is called unimodal.
Symmetric or skewed?
–
–
–
Symmetric if the large values are mirror images of small
values
Skewed to the right if the right tail (large values) is much
longer than the left tail (small values)
Skewed to the left if the left tail (small values) is much
longer than the right tail (large values)
*** Show examples on blackboard. ***
25
Numerical Summaries for Quantitative
Variables (Chapter 2)


To measure center (location): Mode, Mean and
Median
To measure spread: Range, Interquartile Range
(IQR) and Standard Deviation (SD)
Five-number summaries
** show height
 Outliers
** give a large number for the missing height

26
Mean or Average


Traditional measure of center
Sum the values and divide by the number of
values
n
1
1
x   x1  x 2  xn    xi
n
n i 1
27
Median (M)





28
A resistant measure of the data’s center
At least half of the ordered values are less
than or equal to the median value
At least half of the ordered values are greater
than or equal to the median value
If n is odd, the median is the middle ordered value
If n is even, the median is the average of the two
middle ordered values
Median (M)
Location of the median: L(M) = (n+1)/2 ,
where n = sample size.
Example: If 25 data values are recorded, the
Median would be the
(25+1)/2 = 13th ordered value.
29
Median

Example 1 data: 2 4 6
Median (M) = 4

Example 2 data: 2 4 6 8
Median = 5 (ave. of 4 and 6)

Example 3 data: 6 2 4

Median
2
(order the values: 2 4 6 , so Median = 4)
30
Comparing the Mean & Median


31
The mean and median of data from a
symmetric distribution should be close
together. The actual (true) mean and median
of a symmetric distribution are exactly the
same.
In a skewed distribution, the mean is farther
out in the long tail than is the median [the
mean is ‘pulled’ in the direction of the
possible outlier(s)].
Question
A recent newspaper article in California
said that the median price of single-family
homes sold in the past year in the local
area was $136,000 and the mean price
was $149,160. Which do you think is more
useful to someone considering the
purchase of a home, the median or the
mean?
32
Spread, or Variability
33

If all values are the same, then they all equal
the mean. There is no variability.

Variability exists when some values are
different from (above or below) the mean.

We will discuss the following measures of
spread: range, IQR, and standard deviation
Range

One way to measure spread is to give the
smallest (minimum) and largest (maximum)
values in the data set;
Range = max  min

34
The range is strongly affected by outliers
Quartiles




35
Three numbers which divide the ordered
data into four equal sized groups.
Q1 has 25% of the data below it.
Q2 has 50% of the data below it. (Median)
Q3 has 75% of the data below it.
Obtaining the Quartiles




36
Order the data.
For Q2, just find the median.
For Q1, look at the lower half of the data values,
those to the left of the median location; find the
median of this lower half.
For Q3, look at the upper half of the data values,
those to the right of the median location; find the
median of this upper half.
Weight Data: Sorted
L(M)=(53+1)/2=27
L(Q1)=(26+1)/2=13.5
100
101
106
106
110
110
119
120
120
123
37
124
125
127
128
130
130
133
135
139
140
148
150
150
152
155
157
165
165
165
170
170
170
172
175
175
180
180
180
180
185
185
185
186
187
192
194
195
203
210
212
215
220
260
Weight Data: Quartiles



38
Q1= 127.5
Q2= 165 (Median)
Q3= 185
Five-Number Summary





minimum = 100
Q1 = 127.5
M = 165
Q3 = 185
maximum = 260
IQR gives spread of middle 50% of the data
39
Interquartile
Range (IQR)
= Q3  Q1
= 57.5
Variance and Standard Deviation

Recall that variability exists when some
values are different from (above or below)
the mean.

Each data value has an associated deviation
from the mean:
xi  x
40
Deviations
 what
is a typical deviation from the
mean? (standard deviation)
 small values of this typical deviation
indicate small variability in the data
 large values of this typical deviation
indicate large variability in the data
41
Variance





Find the mean
Find the deviation of each value from the
mean
Square the deviations
Sum the squared deviations
Divide the sum by n-1
(gives typical squared deviation from mean)
42
Variance Formula
n
1
2
2
s 
( xi  x )

(n  1) i 1
43
Standard Deviation Formula
typical deviation from the mean
n
1
2
s
( xi  x )

(n  1) i 1
[ standard deviation = square root of the variance ]
44
Variance and Standard Deviation
Example from Text
Metabolic rates of 7 men (cal./24hr.) :
1792 1666 1362 1614 1460 1867 1439
1792  1666  1362  1614  1460  1867  1439
x
7
11,200

7
 1600
45
Variance and Standard Deviation
Example from Text
Observations
Deviations
Squared deviations
xi  x 
xi
xi  x
1792
17921600 = 192
1666
1666 1600 =
1362
1362 1600 = -238
1614
1614 1600 =
1460
1460 1600 = -140
(-140)2 = 19,600
1867
1867 1600 = 267
(267)2 = 71,289
1439
1439 1600 = -161
(-161)2 = 25,921
sum =
2
66
14
0
(192)2 = 36,864
(66)2 =
4,356
(-238)2 = 56,644
(14)2 =
196
sum = 214,870
46
Variance and Standard Deviation
Example from Text
214,870
s 
 35,811.67
7 1
2
s  35,811.67  189.24 calories
47
More Graphs for Quantitative
Variables
Boxplots (pages 46 - 49)
** to show location and spread, and identify outliers

Scatterplots
** to see the relationship between two quan. var’s:
height vs. weight

Time plots
** a special scatterplot; time is the x-axis
** example 1.10, page 23

48
Boxplot
49

Central box spans Q1 and Q3.

A line in the box marks the median M.

Lines extend from the box out to the minimum
and maximum.
Weight Data: Boxplot
min
100
Q1
125
M
150
Q3
175
Weight
50
max
200
225
250
275
Example from Text: Boxplots
51
Identifying Outliers
52

The central box of a boxplot spans Q1 and
Q3; recall that this distance is the
Interquartile Range (IQR).

We call an observation a suspected outlier if
it falls more than 1.5  IQR above the third
quartile or below the first quartile.
Time Plots
53

A time plot shows behavior over time.

Time is always on the horizontal axis, and the variable
being measured is on the vertical axis.

Look for an overall pattern (trend), and deviations from
this trend. Connecting the data points by lines may
emphasize this trend.

Look for patterns that repeat at known regular intervals
(seasonal variations).
Class Make-up on First Day
(Fall Semesters: 1985-1993)
Class Make-up On First Day
70%
60%
Percent of Class
That Are Freshman
50%
40%
30%
20%
10%
0%
1985
1986
1987
1988
1989
1990
1991
Year of Fall Semester
54
1992
1993
Average Tuition (Public vs. Private)
55
Graphs for the Relation of Two
Variables
56

1 categorical + 1 quantitative var’s:

2 quantitative var’s:

2 categorical var’s: