Normal Distribution

Download Report

Transcript Normal Distribution

DEFINITIONS
We are studying the effect of environmental conditions on human
subjects.
The SYSTEM is the human subject.
The POPULATION is the temperature readings of human subjects
under the experimental conditions.
The RANDOM VARIABLE is the measurement recorded on a
thermometer.
The SYSTEM OUTPUT is the reading on the thermometer.
In this example the MODEL is assumed to be a normal distribution.
There are two PARAMETERS of interest in a normal distribution: μ and σ
Ten subjects are tested and the DATA are:
98.2, 97.6, 97.7, 98.6, 98.2, 97.8, 96.7, 98.4, 97.9, 97.4.
Entering this data in SPSS we obtain ESTIMATES of μ as 97.85 and σ as
0.550.
These are the estimates of the MEAN and STANDARD DEVIATION of the
population temperatures.
The estimates of skewness (b11/2) and kurtosis (b2) can be obtained from
the SPSS output and are 0.798 and 0.975.
Estimate the 90th percentile:
1) If we assume that the normal distribution is
the correct model then the 90th PERCENTILE of
the standard normal distribution, Z.90 is 1.645.
Using the equation
Y.90 = μ + σZ.90.
Substituting the estimates for the parameters μ
and σ, yields Y.90 = 98.755.
2) If we do not assume a statistical model and want to estimate the
percentile directly from the data the following equation is used:
Let i = np +1/2
Where n is the number of data points, p is the desired percentile and i is the
order number in the ordered data list corresponding to the desired
percentile.
To find the estimate of the 75th percentile for our example
i = 10x0.75 + 0.5 = 8.0.
Thus the eighth value in the ordered list is the 75th percentile. If i is not an
integer we use linear interpolation to find the estimate.
The ordered list is denoted the ORDER STATISTICS. The order statistics
for our example are:
96.7, 97.4, 97.6, 97.7, 97.8, 97.9, 98.2, 98.2, 98.4, 98.6,
Thus 98.2 is the estimate of the 75th percentile.
To find P{Y < 99.0} we use the DISTRIBUTION FUNCTION of the
standard normal distribution, F(99.0).
To obtain this probability, we use the equation:
Z = (Y - μ)/σ
Using the estimates of μ and σ and the table of the normal
distribution to find F(Z) .
For this example:
Z = (99.0 – 97.85)/0.55 = 2.09 and F( 2.09) = 0.982.
If we want to obtain a confidence interval for μ we use the Student t
distribution.
The 95% interval from SPSS is (97.46, 98.24).
Testing Hypotheses
To TEST the HYPOTHESIS that:
H0: μ = 88.7 vs. HA: μ > 88.7
A lower tailed t test is used. This can be done using SPSS.
For this example t = -4.89 and p = 0.00043.
Thus the conclusion is that there is only a small probability that
μ = 88.7.
Note for a one tail test we must divide the p-value from SPSS by
two.
CONTINUOUS DISTRIBUTIONS
Normal Distribution
The normal is the most commonly used distribution to
model system output. The normal distribution is used to
represent system output that results from the additive
effect of many factors.
Consider the blood pressure measurement from a single
individual. As you know ones blood pressure varies each
time it is measured. The reading obtained is the result of
many factors both physical and mental at the time of the
reading. Thus it would be logical to use the normal
distribution as a model for blood pressure readings. Many
of the system outputs that are measured are the result of
a series of added effects.
The density function of the normal distribution is given by:
f(t) = [σ2(2π)]-1/2] exp[-(1/2){(t - μ)/σ}2]
-∞ <t< ∞,σ>0, -∞ <μ<∞ where μ and σ are the mean and standard deviation of
the distribution.
The density can only be evaluated numerically and hence a
table or computer program must be used to determine the
distribution function F(y). You learned in you statistics class
how to estimate these quantities using the sample mean and
sample standard deviation. (See equations (2-31) and (251a) in your text.) The normal distribution is symmetrical
about its mean hence its skewness measure is zero. The
kurtosis measure is 3. The Student t distribution is used to
obtain confidence intervals for the parameter μ and the chisquared distribution is used for the parameter σ. Review your
statistics notes for these formulas.
Normal Distribution
μ
μ
μ
μ
The Half Normal Distribution
In some problems the random variable of
interest is the absolute value of the
measurement. Thus if we are measuring the
deviation from a standard and are not interested
in whether the reading is positive or negative
and the distribution of the
original data is normal then the appropriate
model to use for the absolute values is a halfnormal distribution.
The density of the half normal distribution is:
f(t) = [2/πσ2]1/2 exp[-t2/2σ2], t>0 , σ>0.
There is only one parameter for the half-normal: σ
The mean of the distribution is 0.798σ and the variance is
0.363σ2.
The graph of the distribution looks like the positive portion of
the normal curve.
Thus it is skewed with a skewness measure of 0.995 and a
kurtosis measure of 3.869.
Half-Normal Distribution
Matching of Moments
The method of MATCHING OF MOMENTS is a useful
technique for estimating the parameters of a
distribution. In this method we equate the moments
obtained from the data (sample mean, sample
variance, sample measure of skewness and/or sample
measure of kurtosis) with the moments of the selected
model. We match one moment for each unknown
parameter. Thus for the half-normal distribution we
match the sample mean with the mean of the halfnormal, 0.798s. If a distribution had two parameters we
would match the sample mean and variance with the
mean and variance of the model.
EXAMPLE
A pacemaker is being tested to determine the average error from
the nominal of 60 beats per minute. Thus the random variable
of interest is the deviation from the nominal of 60.
We are not concerned whether the deviation is positive or
negative. The readings are:
6.5 1.6 6.9 11.7 5.7 2.1 2.5 5.3 1.8 2.0 10.9 6.0 13.2
The sample mean is 5.86. Therefore we match 5.86 with, 0.798σ
yielding the estimate of sigma of 5.86/0.798 = 7.34. Tables and
other methods of estimation can be found in the 1961 issue of
Technometrics, volume 3 on page 543 in an article by Leone and
Nelson entitled “The Folded Normal Distribution”. The folded
normal distribution is more general than the half-normal which
is a special case of the folded normal where the distribution is
folded at zero.
Exponential distribution
Consider some outcome that represents the time to the end
of life of a system under study, the time of death of a
subject, the time of failure of a piece of equipment, or the
time between occurrences of an event where the
probability of occurrence is proportional to the length of the
time interval and the rate of occurrence is constant over
time.
Let F(t) = pr{T < t}, the distribution function. Now the
probability that the event of a failure occurring in the next
instant of time given that it is working at time t is given by
the conditional probability in the interval (t, t+Δt).
This probability is:
[F(t+ Δt) – F(t)]/ [1 – F(t)] = λ Δt
where λ is the constant rate of occurrence. The solution
of this equation is:
1 – F(t) = e-λt where t > 0 and λ > 0
The density function is obtain by differentiating with
respect to t.
This yields the exponential distribution
f(t) = λe-λt t > 0 and λ > 0
The distribution function is 1 – e-λt
Exponential Distribution
The Conditional Failure Rate
Function
A useful function which can serve as a guide for the
selection of a failure model is r(t), the CONDITIONAL
FAILURE RATE FUNCTION. This yields the conditional
probability of a failure occurring in the next instant of time
given that it has not failed up to time t.
r(t) = f(t)/ [1 – F(t)]
The value of r(t) for the exponential distribution
r(t) = λe-λt/{1- (1 - e-λt) = λ
The conditional failure rate for the exponential distribution is a
constant and it would be used for the distribution of the time to failure
for systems which do not wear-out.
The reciprocal of the parameter λ is both the mean and standard
deviation of this distribution. The reciprocal of the sample mean, t, is
used to estimate this parameter.
A confidence interval for this parameter can be obtained using the fact
that the statistic 2nλ.t has a chi-squared distribution with 2n degrees
of freedom, where n is the sample size. Hence a (1-α) 100%
confidence interval from the fact that:
pr{Χ22ν,α/2 < 2nλ.t < Χ22ν,1-α/2}
_
_
yielding the interval (Χ22ν,α/2/ 2nt, Χ22ν,1-α/2/ 2nt).
Review the use of the chi-squared tables.
In an experiment to test the durability of a new design of
pacemakers an accelerated test was conducted on 50
units. The average time to failure was 4.05 months.
Estimate the parameter λ and obtain a 95% confidence
interval.
The estimate of λ is 1/4.05 = 0.247. The confidence
interval is obtained via use of the chi-square table with
degrees of freedom of 100 using α = 0.95 and thus the
0.025 and 0.975 percentiles are found in the table to be
74.2 and 129.6.
This yield the interval (74.2/100x4.05, 129.6/405) =
(0.183, 0.320). We conclude that the best estimate of the
mean life of the pacemakers is 4.05 months and we have
a confidence that 95% of the units will have a mean life
between 3.33 months and 5.46 months in the accelerated
environment.
Gamma distribution
Let us now consider a model that can be used when the
event or failure does not occur until there are
suboccurrences and the time between each of these suboccurrences has an independent exponential distribution.
Thus such cases as the life of an animal that does not die
until it is attacked five times by a predator or for the time to
overhaul a machine after it is repaired six times where the
time between events are independent exponential variables
with a constant value of . Thus the random variable of
interest is the sum of the time between failures until the
occurrence of where the time between occurrences are
independent exponential variables each with the parameter
This random variable has a gamma distribution whose density is
f(t) =
t
e- t; t > 0,
0
0
The function
= ( -1)! when
is a positive integer. When is
not an integer this value of the function must be looked up in a table of
the gamma function.
The distribution function can only be evaluated analytically when is
an integer. The distribution function F(t), when is an integer is given
by
F(t) = 1 -
k
t}k e- t /k!.
This sum can be obtained from a Poisson table with parameter
y=
-1
Check this out in your text.
The mean of this distribution is
and the variance is
t and
2.
The parameters can be estimated by the method of matching
the moments. Since there are two parameters we use the
sample mean and sample variance in the matching process.
The estimate of
is t/s2. The estimate of is t2/s2, i.e. the
sample mean squared divided by the sample variance.
If the parameter can only take on integer values the
gamma distribution is sometimes called the Erlangian
distribution, Computation of confidence intervals and the
conditional failure rate are complicated and can be found in
Statistical Modeling Techniques by
Shapiro and Gross published by Marcel Dekker, 1981. The
failure rate increases with time when
and decreases
when it is less than one.
An experiment is run to estimate the average time it takes
for a machine to require a complete overhauling. A
machine is overhauled after it needs to be recalibrated six
times. The times between recalibrations have independent
exponential distributions.
The average time between overhauls is 525.5 hours and
the standard deviation is 207.7. The estimate of
is
525.5/207.72 = 0.0122 and the estimate of is 0.0122
(525.5) = 6.4.
Find the probability that a machine will need overhauling
in less than 300 hours.
We can get an approximate answer to this if:
• We assume that
is equal to 6.0
• Use the Poisson Table with y = 6-1 = 5 and the column
value of t = (0.0122)300 = 3.67.
Using the closest column value to 3.67 the of the sum for a
value of 5 is approximately 0.844.
Therefore F(300) = 1 – 0.844 = 0.156.
S. Shapiro and L. Chen, “Composite Test for the Gamma Distribution”, Journal
of Quality Technology,33, 47-59, (1998)
Gamma distribution
η
η
η
η
η
Weibull distribution
The exponential model is limited in terms of a lifetime model since it can be
only be used in situations where the conditional failure rate is constant. If
we start with the function:
r(t) = [
](t/ )
then if > 1 the function increases with time and if it is < 1 it decreases.
Note that when
r(t) is constant and it is the function for the
exponential distribution. Setting r(t) equal to f(t)/(1-F(t)) yields:
f(t) =(
)[t/
F(t) = 1 - exp{- t/
] exp[- t/
]
t≥ 0,
0,
0 and
] , t ≥ 0.
The mean of this distribution is
where ( x) is the
gamma function discussed previously. The variance of the distribution is
{ [
–{
}2 ]. The estimation of the
parameters requires a numerical procedure or a graphical procedure to be
discussed later, Once the parameters are known the distribution function
can be used to obtain probabilities.
In a study of 20 patients receiving an analgesic to relieve a
headache pain a Weibull distribution was used to model the
time to the cessation of pain.
The estimate of was 2.79 and the estimate of
2.14, The mean relief time was 1.89 hours.
was
The probability that the relief time will exceed 4.0 hours is
obtained from
1-F(4) = exp{-4/2.14}2.79 = 0.003.
L. Chen and S. Shapiro, “Can the Idea of the QH Test for Normality be Used for
Testing the Weibull Distribution”, J. Statistical Computation and Simulation, 55,
258-263. (1996)
Weibull distribution
σ=1 for all
plots
η
η
η
η
η
The Rayleigh distribution is a
Weibull with = 2.
Lognormal distribution
It was stated that the normal distribution is the statistical model for
events that represent additive effects. We now consider a model that is
used when the event is caused by multiplicative effects.
If T= X1X2……Xn then Y = ln(Y) = lnXi and Y is the sum of random
variables and can be modeled by a normal distribution. The T has a
lognormal distribution with density function
f(t) = [
t2
exp[-(1/2
){lnt -
}]2 , t≥0,
≥0, -inf
< <inf
The mean and variance of the lognormal are exp[
exp[
]{exp[ ] -1}.
] and
Estimation of the parameters is simple. Take the natural log of the data
and the estimate of
is the sample mean of the logs and the estimate
of
is the sample standard deviation of the logs.
Remember that these are not the mean and the standard deviation
of the data!
An experiment is run to estimate the growth of a strain of bacteria in a
period of two days. The growth at any point in time depends on the size
at the instant prior to the measurement. There is a multiplicative effect
that determines the size of the bacteria colony. We will model the size
using a lognormal distribution. The following sizes of ten colonies were
measured after two days:
9.98 10.36 10.04 12.82 10.86 10.39 9.06 11.17 10.29 10.78
Taking the natural log of these numbers yields:
2.30 2.34 2.31 2.55
2.36 2.34 2.20
2.41
The estimate of
is 0.039.
is 2.352 and the estimate of
2.33
2.38
The mean size of the colonies is:
exp[2.352 + 0.0392/2] = 10.514
The variance of the colony size is
exp[2(2.352) + 0.0392]{exp[0.0392] – 1} = 110.56(.0015) = 0.17
Lognormal distribution
Logistic distribution
The logistic distribution plays a major role in describing growth
processes, survival data and demographic studies. Some of the early
applications have been used in the study of population growth as well
as a numerous number of other growth function studies which have
been referenced in the Handbook of the Logistic Distribution by
Balakrishnan (1992). More recently it has been used in bioassay
studies and in the analysis of survival distributions.
1
F(x;  ,  ) 
1 e
 (x  ) / 
3
,    x  ,   0,      
with corresponding distribution function given by:
f (x;  ,  ) 


e  (x  ) / 
 3 1 e
3
 (x  ) /  3

2
,    x  ,   0,      
The mean and variance of the distribution are
and
.
Like the normal distribution, practitioners often work with
the standard logistic random variable, Z =  (X   )
 3
Z has density:

f(z) =
e z
1 e 
z 2
,  z 
The standard logistic distribution function is

1
F(z) 
,  z 
z
1 e
The moment estimators of the parameters are the sample
mean for and the sample

variance for 2. The distribution resembles the normal
distribution but has heavier tails.
The growth of a bacteria colony can be modeled by a logistic
distribution. In an experiment the diameter of a colony is measured
after three days growth. Ten colonies are measured and the results are:
23 46 32 38 30 68 44 37 36 30
The sample mean is 38.4 and the sample standard deviation is 12.44.
Hence, the estimate of is 38.4 and the estimate of is 12.44
The probability that the size of a colony will be less than 22 is given by
F(22).
Converting this to the standard variable Z = 3.14(22-38.4)/12.44(1.73) =
-2.39 then
F(-2.39) = 0.084.
Thus there is a little more than a probability of 0.084 that a colony will
have a diameter of less than 22.
Logistic distribution
σ
σ
σ
σ
See pages 122-132 of HS for a summary description of this
material.
There are several other model which are not covered in this
course.
Such models as the Pareto and the extreme value
distribution can be found in the literature.
DISCRETE DISTRIBUTIONS
Discrete distributions
The foregoing distributions are used when the data
being observed are measurements. In some cases the
data are counts (discrete) and hence a discrete
distribution must be chosen. The selection process
depends on two factors. The type of experiment and
the question under study. There are two types of
experiments that we will cover in this course. The first
involves Bernoulli trials.
Bernoulli trial
A Bernoulli trial is one with the outcome of the experiment has only
two possibilities. We will call these success and failure. These are
coded zero and one. This covers experiments such as live or die, win
or lose, success or failure, cure or not cure, etc.
The Bernoulli random variable is:
0 if a failure occurs
Y=
{ 1 if a success occurs
The probability function is:
f(y) = py (1 – p)1-y y = 0,1.
This is not a useful model but serves as a building block for other
discrete distributions.
The Binomial Distribution
Experiment: n independent Bernoulli trials where the
pr{1} = p on each trial.
Question: What is the probability of Y successes in n
trials?
Answer: Binomial Distribution:
f(y) = [n!/y!{(n-y)}!])py(1-p)n-y; y=0,1,….,n; 0<p<1.
Where n is the sample size, Y is the number of ones and p
is the probability of a one. The estimator of the parameter
p is Y/n, the mean of Y is np and the variance is np(1-p).
A new antibiotic is being tested on 20 subjects that have
been infected with a virus.
At the end of three days they are tested to see if the
infection is gone.
There are two possible outcomes for this experiment: Yes
and No
The trials are independent and we are interested in the
probability that two or less people will be cured (Y ≤ 2).
The results were that only one person was not cured. In
this case Y is the number of persons not cured. The
estimate of p is 1/20 = 0.05.
Using a binomial table or the computer the probability that
Y is less than or equal to two is: 0.9245.
Thus, the estimate of the probability that someone will be
cured from the treatment is 0.95.
The Geometric distribution
Experiment: n independent Bernoulli trials where the pr{1}
= p on all trials.
Question: What is the probability that the first success
comes on trial Y?
Answer: Geometric distribution
In this situation the random variable is the number of trials
while in the binomial distribution it was the number of ones
occurring in n trials.
f(y) = (1-p)y-1p, y = 1,2,…….., 0≤p≤1
The mean of the distribution is 1/p and the variance is (1p)/p2. The estimator of the parameter, p, is the reciprocal of
the sample mean.
In the use of a defibrillator the device is used until the patients heart
starts beating on its own. A study was conducted to estimate the
probability that the heart starts beating after at least four uses of the
device. The data available is the number of trials required on 20 patients
for a successful use of the device. The data is:
2 5 3 6 7 4 2 8 2 1 4 6 3 7 1 4 2 4 5 2
The sample mean is 78/20 = 3.9. Thus the estimate of p is 1/3.9 =0.256.
The p{Y≥y} = (1-p)y-1 and p{Y<y} = 1 - (1-p)y-1
The estimated probability that a defibrillator will be used at least five
times is (1 - 0.256)4 = 0.306
A generalization of the geometric is the negative binomial or Pascal
distribution. The experiment is the same as in the geometric case;
however the question is what is the probability that the s success occurs
on trial Y.
The formula and other information can be found in HS.
The Hypergeometric distribution
Experiment: Dependent Bernoulli trials in a finite distribution of size N
from which a sample of size n is taken where there are k ones and N-k
zeros.
Question: What is the probability of Y successes in the sample of n?
The probability function is:
f(y) = [{k!/(y!(k-y))!}][(N-k)!/{(n-y)!(N-k-n+y)}]/ [N!/{n!(N-n)!}].
y=0,1,2…..,n ; y<k; n-y<N-k.
The mean of the distribution is nk/N and the variance is[nk(N-k)(Nn)]/{N2(N-1)}.
The parameter of interest in this case is k; its estimator is the greater
integer less than or equal to [(N+1)y/n]. In some problems N is the
unknown parameter. If that is the case the estimator of N is the greatest
integer less than or equal to kn/y.
A public health official is interested in estimating the
number of persons with a given disease in a group of
20.
The examination is costly so he takes a random sample
of 10 people and he finds that 4 have the disease.
He wishes to estimate, k, the number of sick
individuals.
The estimate of k is 21(4)/10 = 8.4, hence the estimate
is 8 sick individuals.
Using this estimate, the probability of finding 4 people
out of ten with the disease would be:
{8!/(4!4!)x[12!/(6!6!)]}/[20!/(10!10!)] = 0.35
The Poisson distribution
Experiment: Count the number of occurrence of independent events
that occur at rate in a period of time or in an area, volume, unit, etc.
Question: What is the probability of Y occurrences in the period of time,
area, etc.?
Answer: Poisson distribution.
This experiment does not involve Bernoulli trials nor is there a sample
size. Examples are the number of times a pacemaker has to be replaced
in a patient in a period of one year, the number of tumors found on a
patient, the number of fish caught in a five minute period, the number of
arrivals of ambulances in a one hour period, etc. The probability function
is:
f(y) =
ye-
/y!;
> 0, y = 0,1,2,…..
The distribution function is tabled in most statistics books. Note we used
these tables to evaluate F(t) for the gamma distribution when
is an
integer. The mean and variance of the distribution is
The sample
mean is an estimator of the parameter
In a experiment the number of bacterial colonies on a
slide is counted.
This count can be modeled by a Poisson distribution.
A total of 25 slides are examined yielding a sample mean
of 24.0
The estimate of
is 24 colonies per slide.
If we wish to estimate the probability that there could be
less than 30 colonies on a slide we find:
p{Y<30}=0.868
from the Poisson table using F(29).
TESTING MODEL FIT
Test of Model Fit
An important part of analyzing data is to check on
whether the model you intend to use is a reasonable
approximation of the system output. A test of the
distributional assumption should precede any data
analysis. There are tests for all the models that we
covered in the previous lecture. I will show you some of
these; for the others I will supply references. There are
many procedures available for testing model fit; many
of the ones presented here were developed by your
professor who obviously has a bias in their selection.
Probability Plotting
The simplest method of assessing model fit is to use a
graphical procedure known as probability plotting. This
technique can be used for distributions that do not have a
shape parameter or can be transformed to one that does
not have a shape parameter. A shape parameter is one
that changes the shape of the distribution.
The normal, logistic and exponential distributions have no
shape parameters. The lognormal and Weibull can be
transformed to distributions without shape parameters. If
the parameter is known then probability plots can be
made for specified values of this parameter.
To construct a probability plot one follows these steps:
1. Obtain a sheet of probability paper for the distribution being tested.
2. Rank the observations from smallest to highest.
3. Plot the ranked observations vs p=i/(n-1). The letter “i” is the rank
number of the data point which runs from 1 to n.
4. Depending on the brand of the paper chosen one of the two axes will
have a preprinted scale. The values of p are plotted on this axis.
5. The values of the ordered observations are plotted on the other axis.
You must scale this axis according to the values of the data. Try to use as
much as the axis as is possible.
6. Plot the data.
7. If the data fall in an approximate straight line then you have chosen
a reasonable model. It helps to draw a line on the paper representing a
straight line to judge whether it reasonably fits the data. Remember the
data are random variables and will not form a perfect straight line.
Look for departures in the tails. See chapter 8 in HS for examples of
plots.
8. If the plot is approximately linear crude estimators of the parameters
can be obtained from the plot.
9. In evaluating a plot remember that the variance of points in the
tail(s) is higher than those in the middle of the distribution. Thus the
relative linearity of the plot near the tails of a distribution will often
seem poorer than at the center of the distribution even if the correct
model was chosen. This statement is not applicable if a tail is
bounded. Thus for the exponential distribution it is not true for the
lower tail.
10. The plotted points are ordered and hence not independent. Each
point is higher than the previous one. Thus we should not expect them
to be randomly scattered about the plotted line.
11. A model can never be proven to be correct even if a straight line
appears to be appropriate. This is especially true for small sample
size where only extreme differences from the selected model can be
detected. The best one can say is that there is no evidence from the
data that the model is unreasonable.
Using the data from the class assignment prepare a normal probability
plot by hand on the probability paper given to you. Normal probability
plots can also be done using SPSS.
Enter the data in the computer and use SPSS to prepare the plot.
Compare the two plots.
Normal Distribution
An estimate of
the mean of the distribution, can be
obtained from the plotted line by determining the data value
on the line that crosses the 50th percentile line (the preprinted
scale). The parameter can be estimated from the slope of
the plotted line. For any normal distribution the standard
deviation equals approximately two-fifths of the difference
between the 90th and the 10th percentiles. The percentiles are
obtained from the plotted line; the data value where the
plotted line crosses the .10 and .90 percentile lines on the preprinted scale. Any percentile can be obtained directly from the
plotted line in like manner. Use the plot to estimate the
standard deviation for the above data set and compare the
result with that obtained from SPSS. Find the value from the
distribution that will not be exceeded by more the five
percentile of the values.
Weibull Distribution
The plot for the Weibull distribution is done exactly as that
for the normal distribution except the scales on the paper
are different and there are two scales for each axis.
On the paper shown in HS the values on the bottom X
scale are Z = ln X.
The scaling of this axis makes the transformation.
You need only to plot the data using the preprinted scale.
The parameter can be estimated from the slope of the
plotted line as follows:
1. Select 2 values of W the Y axis values at the right hand
side of the paper. Call these W2 (the larger value chosen)
and W1 (the smaller value chosen).
2. Using the scale at the top of the X axis find values Z2 and
Z1 corresponding to where the chosen W lines cross the
plotted line.
3. The estimator of η is b = [W2 - W1 ]/[ Z2 - Z1 ].
4. The estimator of σ is obtained from finding a, the intercept
where, where the plotted line crosses the line Z = 0 on the
top scale.
Other Probability Plots
Probability plots for the lognormal distribution can be done in
two ways. If you do not have a sheet of lognormal paper you
need only take the log of the data and plot the transformed
data on regular normal probability paper. Use the same
procedures for estimating the two parameters as was done
for the normal case. If you have a sheet of log normal paper
then simply plot the data without transforming them. This
paper has a log scale.
There is also probability paper for the gamma distribution if
you know the value of η and it is an integer. The paper varies
with the value of η. Note that when η is equal to one the
distribution is the exponential. The scale parameter of the
gamma distribution can be estimated by the slope of the line.
See HS for further details.
Tests Of Distributional Fit
Probability plots are a simple method of checking whether a chosen
model is reasonable.
However it is a subjective procedure and two individuals looking at
the same plot could come to different conclusions.
A formal test of hypothesis can be made for which a p-value can be
obtained and used for making a decision as to the reasonableness
of the chosen model.
Most of these procedures are complex and only references to them
will be given.
There are several general approaches to these test procedures; one
that I have used will be given for this class.
We have seen that in the normal probability plotting
procedure one can estimate the variance of the distribution
from the slope of the plotted line squared.
However this estimate is a valid estimator only if the points
plotted fall in an approximate linear pattern. If the points do
not fall in an approximate straight line then the estimated
variance is incorrect.
The sample variance is an unbiased estimator of the
variance whether or not the sample data came from a
normal model. Similar to the ANOVA technique that you
learned in your statistics class a test of the null hypothesis
that the model is correct can be made by comparing these
two estimates by computing their ratio.
Shapiro-Wilk W Test For The Normal
Distribution
The numerator and denominator of the ratio for testing for
normality are not independent and hence the ratio does not
have an F distribution and the slope of the line can not be
obtained by simple linear regression.
The steps for computing the ratio and obtaining the test
statistic, W, can be found in HS. This test procedure is
included in most general software packages including
SPSS.
Use the data from the normal and probability plot examples
to test for normality.
The Brain-Shapiro Test for the Exponential
Distribution
The test for the exponential distribution in HS is outdated. A better test was
devised by Brain and Shapiro published in Technometrics Vol. 25 Pg, 60-76
1983 entitled, “A Regression Test for Exponentiality: Censored and Complete
Samples. The test statistic is computed as follows:
1.Order the data from smallest to largest. X1 ≤ X2 ≤ ……..≤ Xn
2. Compute the quantities Yi+1 = [n-1] [Xi+1 – Xi], i = 1, 2, 3, .., n-1
3. Compute Z = {[12/(n-2)]1/2
i-(n/2
Yi+1}/[ Yi+1] , i = 1, 2, …, n-1
4. Compute V ={ [5/{4(n+1)(n-2)(n-3)}]1/2[12 {i-(n/2)}2Yi+1 ]- n(n-2) Yi+1}/ Yi+1
5. Calculate the test statistic U = Z2 + V2.
6. For large samples U has a chi-squared distribution with two
degrees of freedom when the data come from an exponential
distribution. This is an upper tail test; non-exponentiality will result in
large value of U. When n > 15 the critical values for the test are as
follows:
90th percentile: U0.90 = 4.605 -2.5/n
95th percentile: U0.95 = 5.991 – 1.25/n
97.5th percentile: U0.975 = 7.378 +3.333/n.
7. Thus if the value of U is greater than one of the three values above
then the p-value is less than 1 minus the corresponding subscript.
Hence if it is greater than the 90th percentile then the p-value is less
than 0.10.
Chi-squared Goodness-of-Fit Test
The chi-squared goodness of fit test is used to test
whether a selected discrete model is appropriate to
model a set of discrete data. We will not discuss this
procedure since it was covered in your prior statistics
course. The procedure is described in HS.
Goodness-of Fit Tests for Other
distributions
The following are references for tests for other
models.
Johnson Distribution
Scientists are often faced with the problem of summarizing a set of
data by means of a mathematical function which fits the data and
allows them to obtain estimates of percentiles and probabilities. A
common practice is to use a flexible family of distributions to
accomplish this. In most cases the family has four parameters. One
such system is the Johnson distributions which has three families
each with four parameters.
These families are described in HS; however determining which of
the three families to use and estimation of the parameters is out of
date. The current procedures are found in an article by Slifker and
Shapiro entitled “The Johnson System: Selection and Parameter
Estimation” published in Technometrics, Vol. 22 Pg. 239-246, 1980.
The following is abstracted from that article:
The system was devised by using a transformation of a
standard normal variable using the equation:
z=γ
ηki(x;λ ε)
where z is a standard normal variable and k is a function
that includes a wide variety of shapes. The parameters
γ and η are shape parameters, λ is a scale parameter
and ε is a location parameter.
The three families are obtained by letting the function k equal
to the following:
k(x;
) = sinh-1[(x- )/ ] denoted the SU distribution
-inf<x<inf.
k(x;
) = ln[(xdistribution
≤x≤
k(x;
-x)] denoted the SB
) = ln[(x- )/ ] denoted the SL distribution
x≥
The SL distribution is a form of the log-normal distribution having
three parameters. The first step in using this family is to determine
which of the three to use. The following is the procedure to
accomplish this:
1.Choose a value of z > 0. Any value will do; however if you want to
have a good fit in the tail of the distribution selection of a value close
to 0.5 is recommended for moderate sample sizes and 1.0 or
higher for large sample sizes. A good value is 0.548.
2.Compute 3z. If the recommended value of 0.548 is used then 3z
= 1.645
3.Determine from a table of the normal distribution the percentage
points corresponding to -3z, -z, z, and 3z. Using the above selected
percentiles the corresponding percentage points are 0.05, 0.292,
0.708 and 0.95. Call these pi’s.
4. We next estimate the data percentiles corresponding to these percentages
using the equation i = npi + ½ to determine the ith ranking in an order list of the
data. Note that n is the sample size. This will usually not be an integer and
linear interpolation will be necessary.
5. Next compute the quantities m, n and p using the data values (xjz,; j=-3,-1,1,3)
from the prior step as follows.
m = x3z - xz
n = x-z – x-z
p = x-z – x-3z
6. If mn/p2>1 use the SU distribution, if it less than one use the SB and if it equals
one use the SL.
7. Once the proper family is selected the next step is to estimate the
parameters. The estimation formulas for each family are different; you must
select the appropriate set depending on the selection from the last step.
i) Johnson SU Distribution
The values of the parameters are presented in such a way as to
emphasize their dependance on the ration m/p and n/p.
Parameter estimates for Johnson SU Distribution:
ii) Johnson SB Distribution
The solutions for the SB parameters turn out to depend on the ratios p/m
and p/n ( as opposed to m/p and n/p for SU).
Parameter estimates for Johnson SB Distribution:
iii) Johnson SL Distribution
Parameter estimates for Johnson SL Distribution:
8. To determine the value of F(x) for the data set one simply
substitutes the estimators of the parameters in the
transformation for z for the selected model and use a table
of the normal distribution.
9. To determine a data percentile corresponding to a given
percentage point we solve the defining equation of zp to
obtain xp where zp is the standard normal value
corresponding to the desired value of p.
For SL use xp = exp[(zp -
ew]/[1 + ew] where w =
For SB use xp = [
(zp For SU use xp =
]
e2w -1]/2ew +
a. The following example was taken from a large sample of size 9440
measurements.
b. Since this is a very large sample we use a value of z = 1.0.
c. The first step is to order the data from the smallest to the largest.
d. Using the chosen value of z we obtain the order numbers
corresponding :
-3z (p=0.0014), -z (p=0.1587), z (p=0.8413) and 3z (p=.9986).
e. Using i = npi+ ½ we find that:
x-3z = 10.16, x-z = 13.58, xz = 15.24 and x3z = 16.68.
f. Thus m = 1.447, n = 3.172 and p = 1.661. (Note that the numbers on
the above line were rounded off.) Hence mn/p2 = 1.664. This indicates
that the proper model is the SU distribution.
g. Using the estimating equations for this model we find that the
estimates of the parameters are:
= 2(1)/[cosh-1{1/2 (0.87) + 1.910}] = 2.333.
= 2.333 sinh-1[{1.910 - 0.871}/2{(1.910)(0.871) – 1}1/2] = 1.402.
= 2(1.66)(1.664 – 1)1/2/[(0.871 + 1.910 – 2)(0.871 + 1.910 + 2)1/2] =
1.585.
= [15.242 + 13.58]/2 +1.661(1.910 – 0.871)/2(1.910 + 0.871 – 2) =
In order to find the probability that a measurement will be smaller than
9,0 we compute F(9.0) by first finding the corresponding z value.
z=
sinh-1[(x- )/ ]
z = 1.402+ 2.333 sinh-1[{9.0 -15.516}/1.585] =
1.402 + 2.333 sinh-1(-4.11) = -3.54
F( -3.54) = 0.00021.
Thus there is a very low probability to get a measurement below 9.0.
If one desires the median of the distribution (p = 0.5) corresponding to
a value of z of zero then
x0.50 = =
e2w -1]/2ew +
and w = (zp (0 –
1.402)/2.333 = - 0.6 and x0.50 = 1.585[.30 -1]/.549 + 15.516 = 14.41.
Thus one-half of the distribution is below 14.41.
1. S. Gulati and S. Shapiro, “Goodness of Fit for the Pareto Distribution” in
Statistical Models and Biomedical and Technical Systems published by
Birkhauser, Boston, 263-277.(2007)
2. S. Gulati and S. Shapiro, “Goodness of Fit Tests for the Logistic Distribution”,
Journal of Statistical Theory and Practice, 1,
Analyzing Random Models Via
Simulation
In the prior two weeks you learned about modeling of biological
systems. However these systems only represent the average output to
be expected. The variables in the model are all constants. In real life
each subject has a different value for the constant and in order to get a
more realistic picture of the output these constants should be replaced
by random variables that have distributions like the ones we have
discussed. In order to do this it is necessary to choose a statistical
distribution for each variable in the model, assign values to the
parameters for each random variable and then run the model over and
over again on a computer inserting a new value for each of the random
variables. This technique is called Monte Carlo simulation and is
repeated maybe 1000 or more times. Thus you generate a data set that
gives the distribution of the output as opposed to a static model that
gives you one value. Then you can use the Johnson System to fit a
model to the output and find desired probabilities. Thus you are able to
get a more comprehensive picture about the properties of the output.
In this lecture we will first discuss how to generate
random numbers from some of the distributions covered
earlier. Most computer systems have programs for
generating random variates from the normal and uniform
distributions. We will use these programs to generate
variates from other models. Define Z as a normal variate
with mean zero and variance one and U as a uniform
variate on the interval (0,1).
Normal Distribution
Suppose we desire random variates, Y, from a normal
distribution with mean and variance 2.
Thus, setting
Y=
+
Z
will yield the desired random variable.
Exponential Distribution
F(y) for all distributions has a uniform distribution on (0,1).
If we can express F(y) = U we can generate a random
variable by solving for Y.
Thus F(Y) for an exponential is as an analytical function
then by solving for Y we can obtain a variate from that
distribution.
Thus for the exponential distribution F(y) = 1 e y = U.
thus Y = -1/
ln(1 – U) yields the desired random
variate from the exponential distribution.
Gamma Distribution
(integer shape parameter)
The gamma distribution with an integer shape parameter
can be viewed as a sum of independent exponential
variables with scale parameter .
Thus to generate a random variate we merely add
exponential variables using the above formula:
η
Y = -1/
i=1 ln
i
i = 1,2,,, .
Log-normal Distribution
The log of the log-normal variable has a normal
distribution with parameters
and .
Thus, we simply take the inverse of the natural log of Z:
Y=e
Z+
Weibull Distribution
Since F(Y) = 1 – exp[-(t/
]= U,
then solving for Y yields:
Y = - [ln(1-U)]1/ .
Other Distributions
Formulas for some discrete and other continuous
distributions can be found in HS.