STA 291-021 Summer 2007 - University of Kentucky

Download Report

Transcript STA 291-021 Summer 2007 - University of Kentucky

Lecture 3
Dustin Lueker

Simple Random Sampling (SRS)

Stratified Random Sampling

Cluster Sampling

Systematic Sampling
◦ Each possible sample has the same probability of being selected
◦ The population can be divided into a set of non-overlapping
subgroups (the strata)
◦ SRSs are drawn from each strata
◦ The population can be divided into a set of non-overlapping
subgroups (the clusters)
◦ The clusters are then selected at random, and all individuals in the
selected clusters are included in the sample
◦ Useful when the population consists as a list
◦ A value K is specified. Then one of the first K individuals is
selected at random, after which every Kth observation is included
in the sample
STA 291 Spring 2010 Lecture 2
2

Summarize data
◦ Condense the information from the dataset
 Graphs
 Table
 Numbers

Interval data
◦ Histogram

Nominal/Ordinal data
◦ Bar chart
◦ Pie chart
STA 291 Spring 2010 Lecture 2
3
Alabama
11.6
Arizona
8.6
Arkansas
10.2
13.1
Colorado
5.8
6.3
Delaware
5.0
California
Connecticut
Alaska
9.0
DC
78.5
Florida
8.9
Georgia
11.4
Hawaii
3.8
…

…
Difficult to see the “big picture” from these
numbers
◦ We want to try to condense the data
STA 291 Spring 2010 Lecture 2
4


A listing of intervals of possible values for a
variable
Together with a tabulation of the number of
observations in each interval.
STA 291 Spring 2010 Lecture 2
5
Murder Rate
Frequency
0-2.9
3-5.9
6-8.9
9-11.9
5
16
12
12
12-14.9
15-17.9
18-20.9
4
0
1
>21
Total
1
51
STA 291 Spring 2010 Lecture 2
6

Conditions for intervals
◦ Equal length
◦ Mutually exclusive
 Any observation can only fall into one interval
◦ Collectively exhaustive
 All observations fall into an interval

Rule of thumb:
◦ If you have n observations then the number of
intervals should approximately n
STA 291 Spring 2010 Lecture 2
7

Relative frequency for an interval
◦ Proportion of sample observations that fall in that
interval
 Sometimes percentages are preferred to relative
frequencies
STA 291 Spring 2010 Lecture 2
8
Murder Rate
Frequency
Relative
Frequency
Percentage
0-2.9
5
.10
10
3-5.9
16
.31
31
6-8.9
12
.24
24
9-11.9
12
.24
24
12-14.9
4
.08
8
15-17.9
0
0
0
18-20.9
1
.02
2
>21
1
.02
2
Total
51
1
100
STA 291 Spring 2010 Lecture 2
9

Notice that we had to group the observations
into intervals because the variable is
measured on a continuous scale
◦ For discrete data, grouping may not be necessary
 Except when there are many categories

Intervals are sometimes called classes
◦ Class Cumulative Frequency
 Number of observations that fall in the class and in
smaller classes
◦ Class Relative Cumulative Frequency
 Proportion of observations that fall in the class and in
smaller classes
STA 291 Spring 2010 Lecture 2
10
Murder Rate
Frequency
Relative
Frequency
Cumulative
Frequency
Relative
Cumulative
Frequency
0-2.9
5
.10
5
.10
3-5.9
16
.31
21
.41
6-8.9
12
.24
33
.65
9-11.9
12
.24
45
.89
12-14.9
4
.08
49
.97
15-17.9
0
0
49
.97
18-20.9
1
.02
50
.99
>21
1
.02
51
1
Total
51
1
51
1
STA 291 Spring 2010 Lecture 2
11

Use the numbers from the frequency
distribution to create a graph
◦ Draw a bar over each interval, the height of the bar
represents the relative frequency for that interval
◦ Bars should be touching
 Equally extend the width of the bar at the upper and
lower limits so that the bars are touching.
STA 291 Spring 2010 Lecture 2
12
STA 291 Spring 2010 Lecture 2
13
STA 291 Spring 2010 Lecture 2
14

Histogram: for interval (quantitative) data
Bar graph is almost the same, but for

Difference:

qualitative data
◦ The bars are usually separated to emphasize that
the variable is categorical rather than quantitative
◦ For nominal variables (no natural ordering), order
the bars by frequency, except possibly for a
category “other” that is always last
STA 291 Spring 2010 Lecture 2
15

First Step
◦ Create a frequency distribution
Highest Degree Obtained
Grade School
Frequency
(Number of Employees)
15
High School
200
Bachelor’s
185
Master’s
55
Doctorate
70
Other
25
Total
550
STA 291 Spring 2010 Lecture 2
16

Bar graph
◦ If the data is ordinal, classes are presented in the
natural ordering
250
200
150
100
50
0
Grade
School
High School Bachelor's
Master's
Doctorate
STA 291 Spring 2010 Lecture 2
Other
17

Pie is divided into slices
◦ Area of each slice is proportional to the frequency
of each class
Highest Degree
Relative Frequency
Angle ( = Rel. Freq. x 360)
Grade School
15/550 = .027
9.72
High School
200/550 = .364
131.04
Bachelor’s
185/550 = .336
120.96
Master’s
55/550 = .1
36.0
Doctorate
70/550 = .127
45.72
Other
25/550 = .045
16.2
STA 291 Spring 2010 Lecture 2
18
Grade School
Other
Doctorate
High
School
Master's
Bachelor's
STA 291 Spring 2010 Lecture 2
19

Write the observations ordered from smallest
to largest
◦ Looks like a histogram sideways
◦ Contains more information than a histogram,
because every single observation can be recovered
 Each observation represented by a stem and leaf
 Stem = leading digit(s)
 Leaf = final digit
STA 291 Spring 2010 Lecture 2
20
Stem
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Leaf
3
#
1
135
7
334469
2234
08
03469
5
034689
0238
46
0144468999
039
67
----+----+----+----+
3
1
6
4
2
5
1
6
4
2
10
3
2
STA 291 Spring 2010 Lecture 2
21

Useful for small data sets
◦ Less than 100 observations

Can also be used to compare groups
◦ Back-to-Back Stem and Leaf Plots, using the same
stems for both groups.
 Murder Rate Data from U.S. and Canada
◦ Note: it doesn’t really matter whether the smallest
stem is at top or bottom of the table
STA 291 Spring 2010 Lecture 2
22
PRESIDENT
Washington
Adams
Jefferson
Madison
Monroe
Adams
Jackson
Van Buren
Harrison
Tyler
Polk
Taylor
AGE
67
90
83
85
73
80
78
79
68
71
53
65
Stem
PRESIDENT
Fillmore
Pierce
Buchanan
Lincoln
Johnson
Grant
Hayes
Garfield
Arthur
Cleveland
Harrison
McKinley
Leaf
AGE
74
64
77
56
66
63
70
49
56
71
67
58
PRESIDENT
Roosevelt
Taft
Wilson
Harding
Coolidge
Hoover
Roosevelt
Truman
Eisenhower
Kennedy
Johnson
Nixon
Ford
Reagan
STA 291 Spring 2010 Lecture 2
AGE
60
72
67
57
60
90
63
88
78
46
64
81
93
93
23

Discrete data
◦ Frequency distribution

Continuous data
◦ Grouped frequency distribution

Small data sets
◦ Stem and leaf plot

Interval data
◦ Histogram

Categorical data
◦ Bar chart
◦ Pie chart
 Grouping intervals should be of same length, but may be dictated
more by subject-matter considerations
STA 291 Spring 2010 Lecture 2
24





Present large data sets concisely and
coherently
Can replace a thousand words and still be
clearly understood and comprehended
Encourage the viewer to compare two or more
variables
Do not replace substance by form
Do not distort what the data reveal
STA 291 Spring 2010 Lecture 2
25





Don’t have a scale on the axis
Have a misleading caption
Distort by using absolute values where
relative/proportional values are more
appropriate
Distort by stretching/shrinking the vertical or
horizontal axis
Use bar charts with bars of unequal width
STA 291 Spring 2010 Lecture 2
26



Frequency distributions and histograms exist
for the population as well as for the sample
Population distribution vs. sample
distribution
As the sample size increases, the sample
distribution looks more and more like the
population distribution
STA 291 Spring 2010 Lecture 2
27

The population distribution for a continuous
variable is usually represented by a smooth curve
◦ Like a histogram that gets finer and finer
 Similar to the idea of using smaller and smaller rectangles to
calculate the area under a curve when learning how to
integrate

Symmetric distributions

Not symmetric distributions:
◦ Bell-shaped
◦ U-shaped
◦ Uniform
◦ Left-skewed
◦ Right-skewed
◦ Skewed
STA 291 Spring 2010 Lecture 2
28
Symmetric
Right-skewed
Left-skewed
STA 291 Spring 2010 Lecture 2
29