normaldists - Shane Stevens

Download Report

Transcript normaldists - Shane Stevens

Normal Distributions
(aka Bell Curves, Gaussians)
Spring 2010
You’ve probably all seen a bell curve…
The Normal distribution is common




Lots of real data follows a normal shape. For
example
1) Many/most biometric measurements
(heights, femur lengths, skull diameters, etc.)
2) Scores on many standardized exams (IQ
tests) are forced into a normal shape before
reporting
3) Many quality control measurements, if you
take the log first, have a normal shape.
When sampling from a normal


Normal distributions are typically
characterized by two numbers, their mean or
“expected value” which corresponds to the
peak, and their “standard deviation” which is
the distance from the mean to the inflection
point.
Large standard deviations result in “spread
out” normals. Small standard deviations
result in “strongly peaked” distributions.
Mean (µ) and Standard deviation (σ) for
a normal distribution
this distance is σ
height is about
60% of peak here
µ (here 25)
Two normals, corresponding to
different standard deviations.


Mean=100, std.dev = 16
Mean=100, std.dev = 4
The EDA of Normal distributions



The central location of a normal distribution
is given by the mean μ. (The median is also
μ).
The spread is given by the standard
deviation σ. (The interquartile range is
1.35σ, but you do NOT need to know that)
Normal distributions are symmetric and
typically have few, if any, outliers. If your data
has a lot of outliers, but is otherwise
symmetric and unimodal, it may have a “t”
distribution (discussed later in class).
Probabilities from a Normal
distribution


Normal distributions have a nice property
that, knowing the mean (μ) and standard
deviation (σ), we can tell how much data will
fall in any region.
Examples – the normal distribution is
symmetric, so 50% of the data is smaller
than μ and 50% is larger than μ.
More Normal Probabilities

It is always true that about 68% of the data
appears within 1 standard deviation of the
mean (so about 68% of the data appears in
the region μ±σ)
Yet more normal probabilities


It is also true about 95% of the appears
within 2 standard deviation of the mean, and
about 99.7% of the data appear within 3
standard deviations of the mean (so it’s
VERY rare to go beyond 3 standard
deviations.
In quality control applications, one often is
interested in “6-sigma”. Note 6 standard
deviations includes 99.9999998% of the
data.
95% within 2 standard deviations,
99.7% within 3 standard deviations
Computing more general probabilities

Suppose you want to know how much data
appears within 1.5 standard deviations of the
mean, or how much data appears between
1.3 and 1.7 standard deviations of the mean.
Another way


There is another way of computing normal
probabilities that is 1) the way it used to be
done, back in pre-handy-computer days, 2)
useful for understanding more about the
normal distribution
The number of standard deviations an
observation is from the mean is called the Zscore for that observation.
Z-score examples



If μ=100 and σ=16 (this is true of IQ scores in
the U.S.), then an observation X=125 is 25
points above the mean, which corresponds
to 25/16 = 1.5625 standard deviations above
the mean.
If general, a Z-score for an observation X is
Z=(X-μ)/σ
Observations above the mean get positive Zscores, observations below the mean get
negative Z-scores.
Computing probabilities with Z-scores



Fortunately, the Z-score is all you need to
know to compute probabilities from a normal
distribution.
The reason is that Z-scores map directly to
percentiles.
For each Z-score Stata can provide the
percentile. For example, if the Z-score is 1,
the percentile is 84.13%. If the Z-score is 2.3,
then the percentile is 98.93%
Using a Z-table

Z-table shown separately in class and
illustrated.
Probabilities between Z-scores





Again, IQ scores are normally distributed with mean
100 and standard deviation 16.
How many people have IQ scores between 90 and
120?
Compute the corresponding Z-scores. For 90, the Zscore is (90-100)/16 = -0.625. For 120, the Z-score is
(120-100)/16 = 1.25.
Find the corresponding percentiles (SAS). The
percentile for Z=1.25 is 89.43%. The percentile for
Z=(-0.625) is 26.6%.
The amount between these is 89.43 – 26.60 =
62.83%
Comparing observations from different
normal distributions


The central idea is that a Z-score
corresponds to a percentile for the
observations.
If you have observations from multiple
normal distributions, you can compute the Zscore for each observations and compare
which has the “better” score.
Example



Suppose you have two students, one with a 23 on
the ACT (mean 22 and standard deviation 3) and
another with a 1220 on the SAT (mean 900 and
standard deviation 250).
The Z-score for the student with the ACT is (23-22)/3
= 0.33 while the Z-score for the student with the SAT
is (1220-900)/250 = 1.28.
The student with the SAT performed much better
(relative to peers on the exam).
Review





The mean and standard deviation of a normal is all
you need to know to compute any percentage.
Z-scores map directly to percentiles. A “Z-table”
provides these percentiles.
To find the probability below a value, compute the Zscore and look up the percentile.
To find the probability above a value, find the Zscore and percentile, then take 1 minus that
percentile (e.g. if the probability of being below is
0.24, the probability above is 0.76)
To find the probability between two values, find the
Z-scores and percentiles for each value, then take
the difference.
Example

Suppose you have data which is normally distribution
with mean 70 and standard deviation 4 (this describes
U.S. male heights in inches).
–
–
–
What proportion of data is less than 70? Half the data is always
below the mean, so this is 50%.
What proportion of data is between 68 and 74? The Z-scores for
68 and 74 are (68-70)/4 = (-0.50) and (74-70)/4 = 1.00,
corresponding to percentiles from the table of 84.13% and
30.85%. The difference between these
percentiles is 84.13% - 30.85% = 53.28%
What proportion of data is greater than 76? The Z-score for 76 is
(76-70)/4 = 1.5 which corresponds to a percentile of 93.32%.
Remember that percentiles are always in terms of “less than the
value”. To get greater than, we subtract from 100% to get 6.68%
as our answer.
Example

One manufacturing plant (plant A) produces bolts
whose lengths are normally distribution with mean 2
inches and standard deviation 0.009 inches. Another
plant (plant B) produces bolts whose lengths are
normally distributed with mean 1.99 inches and
standard deviation 0.003 inches. For the bolts to be
usable, their length must be between 1.98 and 2.02
inches. Which plant has the higher percentage of
usable bolts? (this gets to the issue that spread may
be more important than central location in some
quality control examples).
Example continued




Plant A is N(2.00,0.009)
Plant B is N(1.99,0.003)
For each plant, we want the percentage of
bolts between 1.98 and 2.02 inches.
Thus, we need the Z-scores for 1.98 and
2.02 from each plant.
Example continued





Plant A is N(2.00,0.009). Z-scores for 1.98 and 2.02
are Z=(1.98-2.00)/0.009=(-2.22) and Z=(2.022.00)/0.009)=2.22.
These correspond to percentiles of 0.0132 and
0.9868. Thus, the probability is 0.9868-0.0132 =
0.9736
Plant B is N(1.99,0.003). Z-scores for 1.98 and 2.02
are Z=(1.98-1.99)/0.003=(-3.33) and Z=(2.021.99)/0.003=10.
These correspond to percentiles of 0.0004 and
1.0000 (for values off the chart, use 0 or 1). Thus,
the probability is 1.0000-0.0004 = 0.9996.
Thus, even though Plant A averages a “perfect bolt”,
the increased spread makes them worse overall.
Percentiles back to Z-scores


We can go both directions between Z-scores
and percentiles. Each Z-score corresponds
to a percentile. Similarly, each percentile has
a corresponding Z-score.
For example, what IQ score has 80% of the
people below it?
Going from percents to Z-scores





In the U.S., IQ scores are normally distributed with
mean 100 and standard deviation 16.
What is the 80th percentile of IQs?
In other words, for what IQ do 80% of the people fall
below it?
The first step is to find the Z-score corresponding to
80% (look for the number closest to 0.8000 in the
BODY of the table, then find the corresponding Zscore)
That Z-score is Z=0.84
Z-scores back to IQ scores






The Z-score corresponding to 80% is 0.84
Remember the Z-score is Z=(X-μ)/σ
This formula can be reversed to find
X=σZ + μ
Our Z=0.84, μ=100, and σ=16, so
X = (0.84)(16) + 100 = 113.44
So 80% of people have an IQ score below
113.44
New Example




What about the middle 50% of IQ scores?
What percentiles does this correspond to?
To get the middle 50%, we need to stretch
from the 25th percentile to the 75th percentile.
The 25th percentile corresponds to Z-score of
(-0.67), while the 75th percentile corresponds
to a Z-score of 0.67
These correspond to IQ’s of (-0.67)(16)+100
= 89.28 and (0.67)(16) + 100 = 110.72
Related example




What about the middle 95% of values (removing
2.5% from each tail)
Quick answer is that we’ve already said going 2
standard deviations in either direction contains
approximately 95% of the values, which would
correspond to IQs between 68 and 132.
Exact answer, however, is that we need the 2.5%
and 97.5% percentiles. These correspond to -1.96
and +1.96.
Going 1.96 standard deviations in either direction
from the mean 100 means the middle 95% of IQ
scores fall between 68.64 and 131.36
Another example



What about the middle 99% of values?
We need to find the 0.5th and 99.5th
percentiles (removing one half a percent
from each tail, leaving 99% in the middle)
These are -2.576 and 2.576, which would
correspond to IQ scores between
(-2.576)(16) + 100 = 58.78 and (2.576)16 +
100 = 141.21