Graphical Descriptive Techniques
Download
Report
Transcript Graphical Descriptive Techniques
Economics 173
Business Statistics
Lectures 1 & 2
Summer, 2001
Professor J. Petry
1
Introduction
• Purpose of Statistics is to pull out information from data
– “without data, ours is just another opinion”
– “without statistics, we are just another person on data
overload”
• Because of its broad usage across disciplines, Statistics is
probably the most useful course irrespective of major.
– More data, properly analyzed allows for better decisions in
personal as well as professional lives
– Applicable in nearly all areas of business as well as social
sciences
– Greatly enhances credibility
2
Statistics as “Tool Chest”
• Different types of data, allow different types of
analysis
• Quantitative data
– values are real numbers, arithmetic calculations are valid
• Qualitative data
– categorical data, values are arbitrary names of possible categories,
calculations involve how many observations in each category
• Ranked data
– categorical data, values must represent the ranked order of
responses, calculations are based on an ordering process.
• Time series data
– data collected across different points of time
• Cross-sectional data
– data collected at a certain point in time
3
Statistics as “Tool Chest”
• Different objectives call for alternative tool usage
•
•
•
•
•
Describe a single population
Compare two populations
Compare two or more populations
Analyze relationship between two variables
Analyze relationship among two or more variables
• By conclusion of Econ 172 & 173, you will have
about 35 separate tools to select from depending
upon your data type and objective
4
Describe a single
population
Compare two or
more populations
Compare two
populations
Problem
Objective?
Analyze relationships
among two or more
variables.
Analyze relationships
between two variables
5
Describe a single population
Data type?
Quantitative
Type of descriptive
measurements?
Central location
Variability
t- test &
estimator of m
c2- test &
estimator of s2
Qualitative
Number of categories?
Two
Z- test &
estimator of p
Two or more
c2 goodness of
fit test
6
Compare two populations
Quantitative
Type of descriptive
measurements?
Central location
Experimental
design?
Continue
Continue
Continue
Continue
Continue
Continue
Continue
Continue
Variability
Data type?
Qualitative
Ranked
Number of
categories
Experimental
design?
Two
Z - test &
estimator of
p1 - p2
F- test &
estimator of s12/s22
Continue
Continue
Continue
Continue
Continue
Continue
Continue
Continue
Two or more
Independent
samples
Matched
pairs
Wilcoxon rank
sum test
Sign
test
c2-test of a
contingency
table
7
Continue
Experimental
Design
Continue
Independent samples
Matched pairs
Population
distribution
Distribution of
differences
Normal
Nonnormal
Normal
Population
variances
Wilcoxon rank
sum test
t- test &
estimator of mD
Equal
t- test &
estimator of m1-m2
(equal variances)
Nonnormal
Wilcoxon signed
rank sum test
Unequal
T-test &
estimator of m1-m2
(unequal variances)
8
Compare two or more populations
Experimental
design?
Independent samples
Population
distribution
Normal
Nonnormal
Kruskal-Wallis
test
Quantitative
Blocks
Qualitative
c2 - test of a
Ranked
Population
distribution
Normal
Data type?
contingency table
Experimental
design?
Nonnormal
Friedman
test
ANOVA
ANOVA
(independent samples) (randomized blocks)
Independent
samples
Kruskal-Wallis
test
Blocks
Friedman
test
9
Analyze relationship between two variables
Quantitative
Population
distribution
Error is normal, or
x and y are bivariate
normal
Simple linear
regression and
correlation
Data type?
Qualitative
Ranked
c2 - test of a
contingency table
Spearman rank
correlation
x and y are not
bivariate normal
Spearman rank
correlation
Analyze relationship
between two or more
variables
Quantitative
Multiple regression
Data type?
Qualitative
Ranked
Not covered
Not covered
10
Numerical Descriptive Measures
• Measures of central location
– arithmetic mean, median, mode, (geometric mean)
• Measures of variability
– range, variance, standard deviation, coefficient of
variation
• Measures of association
– covariance, coefficient of correlation
11
Measures of Central Location
Arithmetic mean
– This is the most popular and useful measure of
central location
Sum of the measurements
Mean =
Number of measurements
Sample mean
nn
ii11xxii
x
nn
Sample size
Population mean
N
i1 x i
m
N
Population size
12
• Example
The mean of the sample of six measurements 7, 3, 9, -2, 4, 6 is given by
i61 x i x71 x3 2 x93 x24 x45 x66
x
6
6
4.5
• Example
Calculate the mean of 212, -46, 52, -14, 66
54
13
The median
– The median of a set of measurements is the value
that falls in the middle when the measurements are
arranged in order of magnitude.
Example 4.4
Seven employee salaries were recorded
(in 1000s) : 28, 60, 26, 32, 30, 26, 29.
Find the median salary.
sort of
theobservations
salaries.
OddFirst,
number
Then, locate the value
in26,26,28,29,30,32,60
the middle
Suppose one employee’s salary of $31,000
was added to the group recorded before.
Find the median salary.
First,number
sort theofsalaries.
Even
observations
There
twothe
middle
values!
Then, are
locate
values
in 26,26,28,29,30,32,60,31
the middle
26,26,28,29,
26,26,28,29,
26,26,28,29,
30,32,60,31
30,32,60,31
29.5
, 30,32,60,31
14
The mode
– The mode of a set of measurements is the value that
occurs most frequently.
– Set of data may have one mode (or modal class), or
two or more modes.
The modal class
15
– Example
The manager of a men’s store observes the waist
size (in inches) of trousers sold yesterday: 31, 34,
36, 33, 28, 34, 30, 34, 32, 40.
• What is the modal value?
34
This information seems valuable
(for example, for the design of a
new display in the store), much
more than “ the median is 33.2 in.”.
16
Relationship among Mean, Median, and Mode
• If a distribution is symmetrical, the mean,
median and mode coincide
• If a distribution is non symmetrical, and
skewed to the left or to the right, the three
measures differ.
A positively skewed distribution
(“skewed to the right”)
Mode Mean
Median
17
`
• If a distribution is symmetrical, the mean, median
and mode coincide
• If a distribution is non symmetrical, and skewed
to the left or to the right, the three measures
differ.
A positively skewed distribution
(“skewed to the right”)
Mode
Mean
Median
A negatively skewed distribution
(“skewed to the left”)
Mean
Mode 18
Median
• Example
A professor of statistics wants to report the results of a midterm
exam, taken by 100 students. He calculates the mean, median, and
mode using excel. Describe the information excel provides.
Marks
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
73.98
2.1502163
81
84
21.502163
462.34303
0.3936606
-1.073098
89
11
100
7398
100
The mean provides information
about the over-all performance level
The
Median
indicates
thatashalf
of the
of
the
class.
It
can
serve
a
tool
The mode must be used when data isfor
class
received
a gradewith
below
81%,
making
comparisons
other
qualitative. If marks are classified by
and
half and/or
of the class
received
a grade
classes
other
exams.
letter
grade,results
the frequency of each
Excel
above
81%.
grade can be calculated.Then, the mode
becomes a logical measure to compute.
19
Measures of variability
(Looking beyond the average)
• Measures of central location fail to tell the whole story
about the distribution.
• A question of interest still remains unanswered:
How typical is the average value of all
the measurements in the data set?
or
How spread out are the measurements
about the average value?
20
Observe two hypothetical data sets
Low variability data set
The average value provides
a good representation of the
values in the data set.
High variability data set
This is the previous
data set. It is now
changing to...
The same average value does not
provide as good presentation of the
values in the data set as before.
21
The range
– The range of a set of measurements is the difference
between the largest and smallest measurements.
– Its major advantage is the ease with which it can be
But, how do all the measurements spread out?
computed.
? to provide
– Its major shortcoming?is its?failure
Largest
information onSmallest
the dispersion of the
values between
measurement
measurement
the two end points.
The range cannot assistRange
in answering this question
22
The variance
– This measure of dispersion reflects the values of all
the measurements.
– The variance of a population of N measurements
x1, x2,…,xN having a mean m is defined as
s2
2
N
(
x
m
)
i1 i
N
– The variance of a sample of n measurements
x1, x2, …,xn having a mean x is defined as
n
2
2 i1( xi x)
s
n 1
23
Consider two small populations:
Population A: 8, 9, 10, 11, 12
Population B: 4, 7, 10, 13, 16
9-10= -1
11-10= +1
8-10= -2
12-10= +2
Thus, a measure of dispersion
Let us start by calculating
is needed
agrees with this
the sumthat
of deviations
observation.
A
8
9 10 11 12
Sum = 0
The sum of deviations
is zero in both cases,
therefore, another
measure is needed.
…but measurements in B
The mean of both
are much more dispersed
populations is 10...
then those in A.
B
4
7
10
13
4-10 = - 6
16-10 = +6
7-10 = -3
16
13-10 = +3
Sum = 0
24
9-10= -1
11-10= +1
8-10= -2
12-10= +2
The sum of squared deviations
is used in calculating the variance.
Sum = 0
The sum of deviations
is zero in both cases,
therefore, another
measure is needed.
A
8
9 10 11 12
4-10 = - 6
16-10 = +6
7-10 = -3
B
4
7
10
13
16
13-10 = +3
Sum = 0
25
Let us calculate the variance of the two populations
2
2
2
2
2
(
8
10
)
(
9
10
)
(
10
10
)
(
11
10
)
(
12
10
)
s2A
2
5
2
2
2
2
2
(
4
10
)
(
7
10
)
(
10
10
)
(
13
10
)
(
16
10
)
sB2
18
5
Why is the variance defined as
the average squared deviation?
Why not use the sum of squared
deviations as a measure of After all, the sum of squared
dispersion instead?
deviations increases in
magnitude when the dispersion
of a data set increases!!
26
Which data set has a larger dispersion?
Let us calculate the sum of squared deviations for both data sets
However, when
Datacalculated
set B on
“per observation”
basis
(variance),
is more
dispersed
the data set around
dispersions
are properly ranked
the mean
A
B
1
2 3
1
3
SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10
5 times
SumB = (1-3)2 + (5-3)2 = 8
5
sA2 = SumA/N = 10/5 = 2
5 times
!
sB2 = SumB/N = 8/2 = 4
27
– Example
Find the mean and the variance of the following
sample of measurements (in years).
3.4, 2.5, 4.1, 1.2, 2.8, 3.7
– Solution
A shortcut formula
i61 xi
3.4 2.5 4.1 1.2 2.8 3.7 17.7
x
2.95
6
6
6
n
2
n
2
n
(
x
x
)
(
x
)
1
i
i
2
2
i
1
i
1
s
xi
n 1
n 1 i1
n
=1/5[3.42+2.52+…+3.72]-[(17.7)2/6] = 1.075 (years)
28
– The standard deviation of a set of measurements is the
square root of the variance of the measurements.
Sample standard deviation : s s 2
Population standard deviation : s s2
– Example
Rates of return over the past 10 years for two mutual
funds are shown below. Which one have a higher level of
risk?
Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05
Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3,2911.4
– Solution
– Let’s use the Excel printout that is run from the
“Descriptive statistics” sub-menu
Fund A
Fund A should be considered
riskier because its standard
deviation is larger
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Fund B
16 Mean
5.295 Standard Error
14.6 Median
#N/A Mode
16.74 Standard Deviation
280.3 Sample Variance
-1.34 Kurtosis
0.217 Skewness
49.1 Range
-6.2 Minimum
42.9 Maximum
160 Sum
10 Count
12
3.152
11.75
#N/A
9.969
99.37
-0.46
0.107
30.6
-2.8
27.8
120
10
30
The coefficient of variation
– The coefficient of variation of a set of measurements
is the standard deviation divided by the mean value.
s
Sample coefficien t of variation : cv
x
s
Population coefficien t of variation : CV
m
– This coefficient provides a proportionate measure of
variation.
A standard deviation of 10 may be perceived
as large when the mean value is 100, but only
moderately large when the mean value is 500
31
Interpreting Standard Deviation
• The standard deviation can be used to
– compare the variability of several distributions
– make a statement about the general shape of a
distribution.
• The empirical rule: If a sample of measurements has
a mound-shaped distribution, the interval
( x s, x s) contains approximat ely 68% of the measuremen ts
( x 2s, x 2s) contains approximat ely 95% of the measuremen ts
( x 3s, x 3s) contains virtually all of the measuremen ts
32
– Example
The duration of 30 long-distance telephone calls are
shown next. Check the empirical rule for the this set
of measurements.
• Solution
First check if the histogram has an approximate
mound-shape
10
8
6
4
2
0
2
5
8
11
14
17
20
More
33
• Calculate the mean and the standard deviation:
Mean = 10.26; Standard deviation = 4.29.
• Calculate the intervals:
( x s, x s) (10.26 - 4.29, 10.26 4.29) (5.97, 14.55)
( x 2s, x 2s) (1.68, 18.84)
( x 3s, x 3s) (-2.61, 23.13)
Interval Empirical Rule Actual percentage
5.97, 14.55
1.68, 18.84
-2.61, 23.13
68%
95%
100%
70%
96.7%
100%
34
Measures of Association
• Two numerical measures are presented, for the
description of linear relationship between two
variables depicted in the scatter diagram.
– Covariance - is there any pattern to the way two
variables move together?
– Correlation coefficient - how strong is the linear
relationship between two variables
35
The covariance
Populationcovariance COV(X,Y)
(x i m x )( y i m y )
N
mx (my) is the population mean of the variable X (Y)
N is the population size. n is the sample size.
Sample covariance cov(X,Y)
(x i m x )( y i m y )
n-1
36
• If the two variables move the same direction,
(both increase or both decrease), the covariance
is a large positive number.
• If the two variables move in two opposite
directions, (one increases when the other one
decreases), the covariance is a large negative
number.
• If the two variables are unrelated, the covariance
will be close to zero.
37
The coefficient of correlation
Population coefficient of correlation
COV( X, Y)
sx sy
Sample coefficient of correlation
cov(X, Y)
r
sx sy
– This coefficient answers the question: How strong is
the association between X and Y.
38
+1 Strong positive linear relationship
COV(X,Y)>0
or r =
or
0
No linear relationship
-1 Strong negative linear relationship
COV(X,Y)=0
COV(X,Y)<0
39
• If the two variables are very strongly positively
related, the coefficient value is close to +1
(strong positive linear relationship).
• If the two variables are very strongly negatively
related, the coefficient value is close to -1 (strong
negative linear relationship).
• No straight line relationship is indicated by a
coefficient close to zero.
40
– Example
Compute the covariance and the coefficient of
correlation to measure how advertising expenditure
and sales level are related to one another.
Advert
1
3
5
4
2
5
3
2
Sales
30
40
40
50
35
50
35
25
Shortcut Furmulas
n
i1 i
n
i1 i
x y
( x i x)(y i y ) x y
n
n
i1
n
i1
n
i1 i i
2
n
2
i1 i
( x i x) x
x
n
i1
2
n
41
• Use the procedure below to obtain the required summations
cov(X, Y)
x
y
xy
x2
y2
1
2
3
4
5
6
7
8
1
3
5
4
2
5
3
2
30
40
40
50
35
50
35
25
30
120
200
200
70
250
105
50
1
9
25
16
4
25
9
4
900
1600
1600
2500
1225
2500
1225
625
Sum
25
305 1025
93
Month
ni1 ( x i x )(y i y )
n 1
1 n
ni1 x i ni1 y i
i1 x i y i
n 1
n
1
25 305
1025
10.268
8
12175 7
2
n
2
1
1
x
23
2
2
i1
x i
93
sx
1.554
n 1
n 7
8
s x 1.554 1.458
Similarly, sy = 8.839
r
cov(X, Y)
10.268
.797
sx sy
1.458 8.839 42
• Excel printout
Advertsmnt sales
Advertsmnt
2.125
Sales
10.2679 78.125
Covariance matrix
Advertsmntsales
Advertsmnt
1
Sales
0.7969
1
Correlation matrix
• Interpretation
– The covariance (10.2679) indicates that
advertisement expenditure and sales levelare
positively related
– The coefficient of correlation (.797) indicates that
there is a strong positive linear relationship between
43
advertisement expenditure and sales level.