Graphical Excellence
Download
Report
Transcript Graphical Excellence
Review of Measures of Central
Tendency, Dispersion & Association
• Graphical Excellence
• Measures of Central Tendency
– Mean, Median, Mode
• Measures of Dispersion
– Variance, Standard Deviation, Range
• Measures of Association
– Covariance, Correlation Coefficient
• Relationship of basic stats to OLS
1
Graphical Excellence
• Learning from Monkeys
2
Why Graphs & Stats?
• Graphs and descriptive statistics when used
properly can summarize lines of data effectively
for the reader. What’s a good approximation of
the age of students in this class?
• We use graphs and basic stats (Mean, Variance,
Covariance) etc to highlight trends and to motivate
the research question.
• We use other tools for analysis – Regression, Case
Study, Content Analysis etc.
3
What story does this graph tell?
What questions does the graph raise?
HHI Over Tim e
0.048
0.047
0.046
0.045
0.044
HHI INDEX
0.043
0.042
0.041
HHI
0.04
0.039
0.038
0.037
0.036
0.035
0.034
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
YEAR
4
Graphical Excellence
• The graph presents large data sets concisely and
coherently – label your axes
• The ideas and concepts to be delivered are clearly
understood to the viewer – state the units used
(EX: $ or $ in Mil. etc.)
5
What’s the problem here?
90. 00
77.67
80. 00
70. 00
60. 00
43.51
50. 00
40. 00
30. 00
26.09
1991
20. 00
10. 00
30.50
1950
1991
1950
0. 00
6
Graphical Excellence
• The display induces the viewer to address
the substance of the data and not the form of
the graph. – Select the appropriate type of
graph (bar chart for levels, scatter plot
for trends etc.)
• There is no distortion of what the data
reveal. – Make sure the axes are not
stretched or compressed to make a point
7
Millions
Do New Stadiums Bring People in?
3.5
3
2.5
2
1.5
1
0.5
0
2.18
3.01
2.04
Average Annual Attendance
MLB Average
New Stadiums
Old Stadiums
8
Do New Stadiums Bring People in?
20
Millions
15
10
5
0
2.04
2.18
Average Annual Attendance
MLB Average
3.01Stadiums
New
Old Stadiums
9
Things to be cautious about when
observing a graph:
– Is there a missing scale on one axis.
– Do not be influenced by a graph’s caption.
– Are changes presented in absolute values only,
or in percent form too.
10
Numerical Descriptive Measures
• Measures of Central Tendency
– Mean, Median, Mode
• Measures of Dispersion
– Variance, Standard Deviation
• Measures of Association
– Covariance, Correlation Coefficient
11
Arithmetic
mean
– This is the most popular and useful measure of
central location
Sum of the measurements
Mean =
Number of measurements
Sample mean
x
n
n
ii11xxii
nn
Sample size
Population mean
N
i1 x i
N
Population size
12
• Example 1
The mean of the sample of six measurements 7, 3, 9, -2, 4, 6 is given by
i61 x i x71 x3 2 x93 x24 x45 x66
x
6
6
4.5
• Example 2
Suppose the telephone bills of example 2.1 represent population
of measurements. The population mean is
200
i1 x i
x42.19
x15.30
... x53.21
1
2
200
200
200
43.59
13
• Example 3
When many of the measurements have the same value, the
measurement can be summarized in a frequency table. Suppose
the number of children in a sample of 16 employees were recorded
as follows:
NUMBER OF CHILDREN 0 1 2 3
NUMBER OF EMPLOYEES 3 4 7 2
16 employees
x
16
i1 xi
16
x1 x2 ... x16 3(0) 4(1) 7(2) 2(3)
1.5
16
16
14
The median
– The median of a set of measurements is the value that
falls in the middle when the measurements are arranged
in order of magnitude.
Example 4
Seven employee salaries were recorded
(in 1000s) : 28, 60, 26, 32, 30, 26, 29.
Find the median salary.
sort of
theobservations
salaries.
OddFirst,
number
Then, locate the value
in26,26,28,29,30,32,60
the middle
Suppose one employee’s salary of $31,000
was added to the group recorded before.
Find the median salary.
First,number
sort theofsalaries.
Even
observations
There
twothe
middle
values!
Then, are
locate
values
in 26,26,28,29,30,32,60,31
the middle
26,26,28,29,
26,26,28,29,
26,26,28,29,
30,32,60,31
30,32,60,31
29.5
, 30,32,60,31
15
The mode
– The mode of a set of measurements is the value
that occurs most frequently.
– Set of data may have one mode (or modal
class), or two or more modes.
The modal class
For large data sets
the modal class is
much more relevant
than the a singlevalue mode.
16
– Example 5
• The manager of a men’s store observes the waist
size (in inches) of trousers sold yesterday: 31, 34,
36, 33, 28, 34, 30, 34, 32, 40.
• The mode of this data set is 34 in.
This information seems valuable
(for example, for the design of a
new display in the store), much
more than “ the median is 33.2 in.”.
17
Relationship among Mean, Median, and
Mode
• If a distribution is symmetrical, the
mean, median and mode coincide
• If a distribution is non symmetrical, and
skewed
to the left or to the right, the three
A measures
positively skewed distribution
(“skewed to the right”)
differ.
Mode Mean
Median
18
`
• If a distribution is symmetrical, the mean,
median and mode coincide
• If a distribution is non symmetrical, and
skewed to the left or to the right, the three
measures differ.
A positively skewed distribution
(“skewed to the right”)
Mode
Mean
Median
A negatively skewed distribution
(“skewed to the left”)
Mean
Mode 19
Median
Measures of variability
(Looking beyond the average)
• Measures of central location fail to tell the
whole story about the distribution.
• A question of interest still remains unanswered:
How typical is the average value of all
the measurements in the data set?
or
How much spread out are the measurements
about the average value?
20
Observe two hypothetical data sets
Low variability data set
The average value provides
a good representation of the
values in the data set.
High variability data set
This is the previous
data set. It is now
changing to...
The same average value does not
provide as good presentation of the
values in the data set as before.
21
The range
– The range of a set of measurements is the
difference between the largest and smallest
measurements.
But, how
do all the is
measurements
spread
out? it
– Its major
advantage
the ease with
which
can be computed.
? ? ?
The range cannot assistRange
in answering this question
– Its major shortcoming is its failure
to provide
Largest
Smallest
information
on the dispersionmeasurement
of the values
measurement
between the two end points.
22
The variance
– This measure of dispersion reflects the values of all the
measurements.
– The variance of a population of N measurements
x1, x2,…,xN having a mean is defined as
Excel
N
2
uses
2 i1( xi )
Varp
N
formula
– The variance of a sample of n measurements
x1, x2, …,xn having a mean xis defined as
Excel
uses Var
formula
n
2
2 i1( xi x)
s
n 1
23
Consider two small populations:
Population A: 8, 9, 10, 11, 12
Population B: 4, 7, 10, 13, 16
9-10= -1
11-10= +1
8-10= -2
12-10= +2
Thus, a measure of dispersion
Let us start by calculating
is needed
agrees with this
the sumthat
of deviations
observation.
A
8
9 10 11 12
Sum = 0
The sum of deviations
is zero in both cases,
therefore, another
measure is needed.
…but measurements in B
The mean of both
are much more dispersed
populations is 10...
then those in A.
B
4
7
10
13
4-10 = - 6
16-10 = +6
7-10 = -3
16
13-10 = +3
Sum = 0
24
9-10= -1
11-10= +1
8-10= -2
12-10= +2
The sum of squared deviations
is used in calculating the variance.
See example next.
Sum = 0
The sum of deviations
is zero in both cases,
therefore, another
measure is needed.
A
8
9 10 11 12
4-10 = - 6
16-10 = +6
7-10 = -3
B
4
7
10
13
16
13-10 = +3
Sum = 0
25
Let us calculate the variance of the two populations
2
2
2
2
2
(
8
10
)
(
9
10
)
(
10
10
)
(
11
10
)
(
12
10
)
2A
2
5
2
2
2
2
2
(
4
10
)
(
7
10
)
(
10
10
)
(
13
10
)
(
16
10
)
B2
18
5
Why is the variance defined as
the average squared deviation?
Why not use the sum of squared
deviations as a measure of After all, the sum of squared
dispersion instead?
deviations increases in
magnitude when the dispersion
of a data set increases!!
26
Which data set has a larger dispersion?
Let us calculate the sum of squared deviations for both data sets
However, when
Datacalculated
set B on
“per observation”
basis
(variance),
is more
dispersed
the data set around
dispersions
are properly ranked
the mean
A
B
1
2 3
1
3
SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10
5 times
SumB = (1-3)2 + (5-3)2 = 8
5
A2 = SumA/N = 10/5 = 2
5 times
!
B2 = SumB/N = 8/2 = 4
27
– Example 6
• Find the mean and the variance of the following
sample of measurements (in years).
3.4, 2.5, 4.1, 1.2, 2.8, 3.7
– Solution
A shortcut formula
i61 xi
3.4 2.5 4.1 1.2 2.8 3.7 17.7
x
2.95
6
6
6
n
2
n
2
n
(
x
x
)
(
x
)
1
i
i
2
2
i
1
i
1
s
xi
n 1
n 1 i1
n
=[3.42+2.52+…+3.72]-[(17.7)2/6] = 1.075 (years)2
28
– The standard deviation of a set of measurements is
the square root of the variance of the measurements.
Sample standard deviation : s s 2
– Example
4.9 standard deviation : 2
Population
• Rates of return over the past 10 years for two mutual
funds are shown below. Which one have a higher level of
risk?
Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1,
30.05
Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3,
11.4
29
– Solution
– Let us use the Excel printout that is run from the
“Descriptive statistics” sub-menu (use file Xm0410)
Fund A
Fund A should be considered
riskier because its standard
deviation is larger
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Fund B
16 Mean
5.295 Standard Error
14.6 Median
#N/A Mode
16.74 Standard Deviation
280.3 Sample Variance
-1.34 Kurtosis
0.217 Skewness
49.1 Range
-6.2 Minimum
42.9 Maximum
160 Sum
10 Count
12
3.152
11.75
#N/A
9.969
99.37
-0.46
0.107
30.6
-2.8
27.8
120
10
30
The coefficient of variation
– The coefficient of variation of a set of
measurements is the standard deviation divided
by the mean value.
s
Sample coefficien t of variation : cv
x
Population coefficien t of variation : CV
– This coefficient provides a proportionate
measure of variation.
A standard deviation of 10 may be perceived
as large when the mean value is 100, but only
moderately large when the mean value is 500
31
Interpreting Standard Deviation
• The standard deviation can be used to
– compare the variability of several distributions
– make a statement about the general shape of a
distribution.
32
Measures of Association
• Two numerical measures are presented, for
the description of linear relationship
between two variables depicted in the
scatter diagram.
– Covariance - is there any pattern to the way two
variables move together?
– Correlation coefficient - how strong is the
linear relationship between two variables
33
The covariance
Excel uses this
formula to
calculate Cov
Populationcovariance COV(X,Y)
(x i x )( y i y )
N
x (y) is the population mean of the variable X (Y)
N is the population size. n is the sample size.
Sample covariance cov(X,Y)
NOTE: The formula in
Excel does not give you
sample covariance
(x i x )( y i y )
n-1
34
• If the two variables move the same
direction, (both increase or both decrease),
the covariance is a large positive number.
• If the two variables move in two opposite
directions, (one increases when the other
one decreases), the covariance is a large
negative number.
• If the two variables are unrelated, the
covariance will be close to zero.
35
The coefficient of correlation
Population coefficient of correlation
COV( X, Y)
x y
Sample coefficient of correlation
cov(X, Y)
r
sx sy
– This coefficient answers the question: How strong is the
association between X and Y.
36
+1 Strong positive linear relationship
COV(X,Y)>0
or r =
or
0
No linear relationship
-1 Strong negative linear relationship
COV(X,Y)=0
COV(X,Y)<0
37
• If the two variables are very strongly
positively related, the coefficient value is
close to +1 (strong positive linear
relationship).
• If the two variables are very strongly
negatively related, the coefficient value is
close to -1 (strong negative linear
relationship).
• No straight line relationship is indicated by
a coefficient close to zero.
38
– Example 7
• Compute the covariance and the coefficient of
correlation to measure how advertising expenditure
and sales level are related to one another.
• Base your calculation on the data provided in
example 2.3
Advert
Sales
Shortcut Furmulas1
30
3
40 n x n y
i1 i i1 i
ni1( x i x)(y i y5) ni1 x i y40
i
n
4
50
2
n35 2
i1 x
n
2
n
2
i1( x i x) i51 x i 50
n
3
35
39
2
25
• Use the procedure below to obtain the required summations
cov(X, Y)
x
y
xy
x2
y2
1
2
3
4
5
6
7
8
1
3
5
4
2
5
3
2
30
40
40
50
35
50
35
25
30
120
200
200
70
250
105
50
1
9
25
16
4
25
9
4
900
1600
1600
2500
1225
2500
1225
625
Sum
25
305 1025
93
Month
ni1 ( x i x )(y i y )
n 1
1 n
ni1 x i ni1 y i
i1 x i y i
n 1
n
1
25 305
1025
10.268
8
12175 7
2
n
2
1
1
x
23
2
2
i1
x i
93
sx
1.554
n 1
n 7
8
s x 1.554 1.458
Similarly, sy = 8.839
r
cov(X, Y)
10.268
.797
sx sy
1.458 8.839 40
• Excel printout
Advertsmnt sales
Advertsmnt
2.125
Sales
10.2679 78.125
Covariance matrix
• Interpretation
Advertsmntsales
Advertsmnt
1
Sales
0.7969
1
Correlation matrix
– The covariance (10.2679) indicates that
advertisement expenditure and sales levelare
positively related
– The coefficient of correlation (.797) indicates
that there is a strong positive linear relationship
between advertisement expenditure and sales
41
level.
• The Least Squares Method
– We are seeking a line that best fit the data
– We define “best fit line” as a line for which the
sum of squared differences between it and the
data points is minimized.
n
Minimize ( y i yˆ i ) 2
i1
The actual y value of point i
The y value of point i
calculated from the
equation of the line
yˆ i b 0 b1x i
42
Y
Errors
X
Different lines generate different errors,
thus different sum of squares of errors.
43
The coefficients b0 and b1 of the line
that minimizes the sum of squares of errors
are calculated from the data.
n
b1
( x x)(y y)
i
i 1
i
n
2
(
x
x
)
i
, b 0 y b1x
i 1
n
w here y
y
i 1
n
n
i
and x
x
i 1
i
n
44