Transcript Slide 1

Last Update
23rd February 2011
Introduction to Statistics
Florian Boehlandt
University of Stellenbosch Business School
Learning Objectives Part 1
1. Sampling (Random Sampling)
2. Sampling Error
3. Nonsampling Error
• Why? Cost!
• The sample proportions are used as an
estimate for the population proportions
• Examples:
– Nielsen ratings (1,000 television viewers)
– Quality Management (destroy items?)
• Target Population: the population about
which statisticians want to draw inferences
• Sampled Population: The actual population
from which the sample is taken
• The sample statistic is a good estimator of the
population parameter if target population =
sampled population
• Self-selected samples are always biased,
because individuals who participate are more
keenly interested in the issue than nonparticipants (SLOP = self-selected opinion poll)
Sampling Plan
• A simple random sample is a sample selected
in such a way that every possible sample with
the same number of observations is equally
likely to be chosen.
• A stratified random sample is obtained by
separating he population into mutually
exclusive sets (strata), and then drawing
simple random samples form each stratum.
Sampling Plan
• A cluster sample is a simple random sample of
groups or cluster of elements
Simple Random Sampling
• Concept: Raffles  each element of the chosen
population is assigned a unique number and then
‘drawn from a hat’
+ Social security numbers
+ Student numbers
– Telephone numbers
• A random number table / random number generator
(Excel: RAND) can be used to select sample numbers.
Simple Random Sampling
• Example Tax Returns (Keller 2006: p. 148)
Stratified Random Sampling
• Concept: Increase the amount of information aboiut
the population
• Examples of criteria separating the population into
Household Income
Stratified Random Sampling
• Example Proposed Tax Increase:
1. Draw random samples form four income groups
according to their proportions in the population
2. Make adjustments before making inferences about the
entire population
Income ‘000s
Population %
Under 25
Over 75
Systematic Sampling
• Concept: sample members are chosen in a
regular manner working progressively through
the list
• Example Vega students:
500 students from Vega’s 8,500 enrolled
students: 8,500 / 500 = 17. Thus, every 17th
student would be selected
Cluster Sampling
• Concept: Useful when it is difficult or costly to
develop a complete list of population members (i.e.
making it difficult to draw a simple random sample)
or when the population elements are widely
dispersed (geographically)
• Example: Each block within a city represents a
cluster. A sample of clusters could then be selected
and every household within these clusters is
questioned (sampling error?  sample size)
Sampling Error
• Sampling error refers to the differences between the
sample and the population that exist because of the
observations that happened to be selected for the
sample. The value of the sample mean will deviate
from the population mean simply by chance
• The difference between the true (unknown) value of
the population mean μ and its estimate (the sample
mean x-bar) is the sampling error
• The only way to reduce the sampling error is to
increase the sample size n
Nonsampling Error
• Nonsampling errors are due to mistakes made
in the acquisition of data or due to the sample
observations being selected improperly
• Nonsampling errors are more serious than
sampling errors, because taking a larger
sample won’t diminish the size, or possibilty
of occurrence, of this error
Types of Nonsampling Error
• Errors in data acquisition: incorrect
measurements/responses, inaccurate recording
• Nonresponse error: refers to bias introduced when
responses are not obtained from some members of
the sample (not representative of target
population); self-administered surveys
• Selection bias: Some members of the target
population cannot possibly be included in the sample
(e.g. members have no phone)
Learning Objectives Part 2
4. Frequency Tables
5. Histograms
6. Class Intervals and Width
Frequency Tables – Data Types
Interval Data
Class Intervals
Count the number of
observations that fall into
each of a series of intervals
Nominal Data
Ordinal Data
Count the number of times
each category of the
variable occurs
Bar Chart
Frequency Tables – Data Types
• There are times when a data set contains a large
number of values (even when the data type is
nominal) that would result in a table with too many
rows to be convenient. We can overcome this
problem by grouping the data into fewer categories
or classes and then compiling a grouped frequency
Frequency Tables – Data Types
Grouped Data
Class Intervals
Count the number of
observations that fall into
each of a series of intervals
Count the number of times
each category of the
variable occurs
Bar Chart
Frequency Tables – Data Types
• Example 1: Coffee refills
Data type nominal; Data ungrouped  Categories
• Example 2: Class marks out of 100
Data type nominal; BUT: Data may be grouped 
Class intervals (approximately interval)
• Example 3: Waiting times at supermarket cashiers
Data type interval  Class intervals
Number of Categories
Nominal / not grouped:
1. Determine maximum and minimum observation
2. Define categories including all distinct (integer)
observations in between
Example tossing two dice:
Min: 2
Max: 12
Other possible outcomes: 2 4 5 6 7 8 9 10 11
(all outcomes accounted for)
Number of Class Intervals
Interval or grouped data:
The more observations there are the larger the number
of class intervals required. Sturges’ Formula
Number of class intervals = 1 + 3.3 log10(n) OR
Number of class intervals = 1 + 1.4 ln(n)
Example n = 50:
Number of class intervals = 1 + 3.3 log10 (50) = 1 + 3.3 * 1.70 = 6.61 ≈ 7
Number of class intervals = 1 + 1.4 ln(50) = 1 + 1.4 * 3.91 = 6.48 ≈ 6
Excursion Logarithms
• The logarithm of a number to a given base is the exponent to which the
base must be raised in order to produce that number. (Example: 10^1.70 =
• The natural logarithm is the logarithm to the base e, where e is an
irrational constant approximately equal to 2.718. The natural logarithm of
a number x (written as ln(x)) is the power to which e would have to be
raised to equal x. (Example: e^3.91 = 50)
• The mathematical constant e (Euler’s number) is the unique real number
such that the value of the derivative d/dx (slope of the tangent line) of the
function f(x) = ex at the point x = 0 is equal to 1. It is called the exponential
Class Interval Width
Class width:
1. Subtract largest observation from smallest
2. Divide by number of classes (Sturges)
3. Round class width to convenient value
4. Select a lower limit so that the first class interval
contains the smallest observation. Determine all
other intervals consecutively by adding (multiples)
of the class width
• Class Mark or Class Midpoint: Adding the lower class
limits to the upper class limits and dividing by two 
frequency polygon
• Width of a class interval or class length: The
difference between the upper class limit and the
lower class limit. Usually, all classes are of equal
width / length (Sturges)
• Class Boundaries: The class limits are stated in such a
way that there is no overlap between classes
Class Boundaries: The class limits are stated in such a way that
there is no overlap between classes. Limits are stated in this
manner so that there cannot be any doubt as to which class a
certain value (observation) is to be allocated. Since data is often
rounded, the true class limits are not the same as the stated
class limits.
Weights recorded to the nearest kilogram
Stated class interval: 60 – 62
True class interval or class boundaries: 59.5 – 62.5