Transcript Document

or
How to make the
numbers say
whatever you want.
Acknowledgements:
Darrell Huff’s book: How to Lie With Statistics
(First published in 1954)
Danny Oppenheimer’s psychology course
Other books that may be interesting:
Statistical Tricks and Traps - by Ennis C. Almer.
“There are lies, damned lies, and statistics”
-- Disraeli
Anecdotal evidence is unreliable
Why does the phone always ring when you’re in
the shower?
Determining the difference between chance
and real effects
• Practically all statistics are based on a
sample of a population. So…...
–
–
–
–
how was the sample chosen?
how big is the sample?
what population does it claim to represent?
what population does it actually represent?
Data = Signal + Noise
Signal = What we’re trying to measure
Noise = Error in our measurement
If noise is random, then as the sample
size increases, noise tends to cancel,
leaving only signal.
• Flip a coin 5 times
– Heads four times
– 80 % heads
• Flip 100 times
– Same results???
• In general, the larger the sample size, the better
the estimation.
• A telephone poll was taken during the U.S.
presidential campaign of Franklin D.
Roosevelt (1932).
• Based on the results of that poll, the pollsters
predicted that FDR would not win. FDR did
win, however.
• The poll did NOT accurately reflect all of the
voters because the opinions of only one part of
the population (wealthy people with
telephones) were taken into account.
Center
• Mode
• Median
• Mean
Spread
• Variability
• Standard Deviation
• Range
Mode
• The mode is any value that occurs most
frequently
• 10 15 20 20 22 30 30 40 40 50 50 50 60 70 70 79
80 100 100 125 200 200 300 400 450
• The mode in this case is 50
Median (M)
• The midpoint of a
distribution.
• Half (50%) of the
observations are
larger, half (50%) are
smaller.
Steps to Find Median
• Arrange all the observations in order of size,
from smallest to largest.
• Find the position of the median by counting the
number of observations and finding the one in the
middle.
• If there is an odd number of observations, the
median will be one of your observations
• If there is an even number of observations, the
median is the average of the center observations
in the ordered list and will not be an observation.
Example
• How many CD’s do you own?
• 60 20 15 79 30 200 200 400 10 40 50 22 125 20
60 40 80 100 30 50 100 300 70 70 50 450
• 10 15 20 20 22 30 30 40 40 50 50 50 60 70 70 79
80 100 100 125 200 200 300 400 450
Mean
• Steps
– Add the values of the observations
– Divide the total by the number of observations
– That’s the mean!
Mean – arithmetic average = x/n
Median – the halfway point
Mode – the most common answer
Average could mean any of them…
Incomes:
• $9000
• $9000
• $9000
• $12,000
• $120,000
• $85,000
• $15,000
Mean = $37,000
Median = $12,000
Mode = $9000
Each is a legitimate average but
can serve conflicting purposes
Range – Overall difference between
the highest and lowest scores.
Variance – Average difference from
the mean.
Standard Deviation
• Measures the spread by looking at how far the
observations are from their mean.
• Measures the average distance of the observations from
their mean.
• Variance (s2) is the average of the squared distances from
the mean.
• Standard Deviation is the square root of the variance.
( x1  x )  ( x2  x )  ...  ( xn  x )
s 
n 1
2
2
2
2
Identical Range
1
9
9
11
11
11
9
9
9
11
11
19
1
1
1
19
19
19
1
1
1
19
19
19
Identical Variance
1
9
9
11
11
11
9
9
9
11
11
19
6
6
6
13
14
14
6
6
7
14
14
14
Median vs. Mean
•
•
The midpoint and mean both
describe the center, but
which is better?
The mean is strongly
influenced by a few extreme
observations and the median
is not
The Normal Curve
Central Limit Theorem
Any time you have a measure which is created by
summing several individual trials of data (Signal
+ Noise) you will end up with a normal curve
Bimodal Distributions
Skewed Distributions
Uniform distributions
Correlation measures the strength
of a relationship between two
variables.
Positive
Correlation
Negative
Correlation
No
Correlation
0<R<1
-1 < R < 0
R=0
Correlation does not imply
causation
A person wearing red is 129 times more
likely to be gored by a charging bull!
But what is the base rate?
(more likely than what or who?)
98% of Americans have internet access
available.
The average temperature in Oklahoma
City over the past 100 years is 63.4
degrees. (and each year has been within
1 degree of that average)
Consumer spending on prescription drugs
has doubled since 1980.
85% of all car accidents occur within 10
miles of the home.
10 times as many people die each year by
plane crashes than train crashes.
During the Spanish-American war, the
death rate for soldiers was 9 per 1000.
The death rate for civilians in New York
City was 16 per 1000.
Fluffy O’s Cereal Gives a body Energy!!
10
?? 5
1 minute
2 minutes
62.5
62
61.5
Earnings
61
60.5
60
1st Qtr
2nd Qtr
3rd Qtr
4th Qtr
Earnings
100
90
80
70
60
50
40
30
20
10
0
1st Qtr
2nd Qtr
3rd Qtr
4th Qtr
40
Even when the scale is
fine, graphs can still be
deceptive
35
30
25
20
15
10
5
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Even when the scale is fine, graphs can still be
deceptive
40
35
30
25
20
15
10
5
0
1st Qtr
2nd Qtr
3rd Qtr
4th Qtr
As you can see, Bob earns substantially more
than Joe
Joe’s income
Bob’s Income
Is it really a two to one ratio that’s being displayed?
Portion of US income going to federal taxes
Federal spending has become equal to the total income of the
people of the black states.
Percent of US income going to federal taxes
Tylenol is used by 90% doctors for their own
aches and pains!
So are Aspirin,
Ibuprofen, and
Bayer!
When Dewey was elected Governor in 1942, the
minimum teacher’s salary in some districts was
as low as $900 per year. Upon Governor
Dewey’s recommendation… the Legislature in
1947 appropriated $32,000,000 out of state
surplus to provide an increase in teacher’s
salaries. As a result, the minimum salaries of
teachers in New York City range from $2500 to
$5325.
1) Stanford is the #1 program in psychology
2) I am the only (and therefore #1) student at
Stanford studying Decision Errors
Therefore:
I am the top student studying Decision Errors in
the country.
1) Statistics are useful for figuring out random
noise from real effects
2) Numbers are not absolute, and they can be
easily manipulated
3) Always scrutinize data closely, and draw
your own conclusions.
4) 85% of all statistics are made up on the
spot: the rest are all wrong