Transcript Document

Probability distribution functions

• • • • • Normal distribution Lognormal distribution Mean, median and mode Tails Extreme value distributions

Normal (Gaussian) distribution

Probability density function (PDF)   1 2  exp    1 2

x

   2   • What does figure tell about the cumulative distribution function (CDF)?

 

x

) 

x

 

More on the normal distribution

• • • • • Normal distribution is denoted 𝑁 𝜇, 𝜎 2 square giving the variance.

, with the If X is normal, Y=aX+b is also normal. What would be the mean and standard deviation of Y?

Similarly, if X and Y are normal variables, any linear combination, aX+bY is also normal.

Can often use any function of a normal random variables by using a linear Taylor expansion.

Example: X=N(10,0.5

2

) and Y=X

2

. Then 𝑋 2 ≈ 100 +

Estimating mean and standard deviation

• • Given a sample from a normally distributed variable, the sample mean is the best linear unbiased estimator (BLUE) of the true mean.

For the variance the equation gives the best unbiased estimator, but the square root is not an unbiased estimate of the standard deviation 

n

  1 

x i

 2 

n

  1

x i

Lognormal distribution

• • • If ln(X) has normal distribution X has lognormal distribution. That is, if X is normally distributed exp(X) is lognormally distributed.

Notation: ln𝑁 𝜇, 𝜎 2 PDF 

x

 1 2  exp       ln

x

 2  2   2   • Mean and variance 

X

 exp  2 

X

2   

e

 2  1 

e

2 2

Question

• Suppose the income of a family of four in the United States follows a lognormal distribution with µ = log(20,000) and σ 2 = 1.0. ( 𝜇 𝑋 =32974, 𝜎 𝑋 = 43224 ). See figure: What is your estimate of the mode (that is the most common income)? The median?

Mean, mode and median

• • • Mode (highest point) = exp[𝜇 − 𝜎 2 Median (50% of samples) = 𝑒 𝜇 Figure for 𝜇 =0.

Light and heavy tails

• • Normal distribution has light tail; 4.5 sigma is equivalent to 3.4e-6 failure or defect probability .

Lognormal can have heavy tail 𝜇 = 0, 𝜎 = 1, 0.0075

𝜇 = 0, 𝜎 = 0.25, 7.5e−4 ,

Fitting distribution to data

• • • • Usually fit CDF to minimize maximum distance (Kolmogorov-Smirnoff test) Generated 20 points from N(3,1

2 ).

Normal fit N(3.48,0.93

2 )

Lognormal lnN(1.24,0.26

) 0.9

0.8

1 Almost same mean and 0.7

0.6

standard deviation .

0.5

0.4

0.3

0.2

0.1

0 1 2 3 4 5 6 x experimental lognormal normal 7 8

Extreme value distributions

• • • No matter what distribution you sample from, the mean of the sample tends to be normally distributed as sample size increases (what mean and standard deviation?) Similarly, distributions of the minimum (or maximum) of samples belong to other distributions.

Even though there are infinite number of distributions, there are only three extreme value distributions.

– Type I (Gumbel) derived from normal.

– Type II (Frechet) e.g. maximum daily rainfall – Type III (Weibull) weakest link failure

Maximum of normal samples

With normal distribution, maximum of sample is more narrowly distributed than original distribution.

8000 9000 7000 6000 5000 Max of 10 standard normal samples. 1.54 mean, 0.59 standard deviation 8000 7000 6000 Max of 100 standard normal samples. 2.50 mean, 0.43 standard deviation 5000 4000 4000 3000 3000 2000 2000 1000 1000 0 -1 0 1 2 3 4 5 6 0 1 1.5

2 2.5

3 3.5

4 4.5

5 5.5

1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -5

Gumbel distribution

• • .

PDF

 1  exp 

e

z

 ,

z

x

  

CDF

 exp( 

e

z

) Mean, median, mode and variance

Mean median

ln(ln(2)) mode= 

Variance

 6 2 Euler-Mascheroni constant   0.5772

fitted ev1 -max10 data -4 -3 -2 -1 0 1 1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -5.5

fitted ev1 -max100 data -5 -4.5

-4 -3.5

-3 -2.5

-2 -1.5

-1

0.5

0.4

0.3

0.2

0.1

1 0.9

0.8

0.7

0.6

0 -8

Weibull distribution

• • • • • Probability distribution  

k x

 

k

 1

e

 

x

/  

k x

 0,

k

 0,   0 Its log has Gumbel dist.

Used to describe distribution of strength or fatigue life in brittle materials.

If it describes time to failure, then  k<1 indicates that failure rate decreases with time,  k=1 indicates constant rate,  k>1 indicates increasing rate.

Can add 3 rd parameter by replacing x by x-c.

log weibull ev1 fit -6 -4 -2 0 2 4

1.

2.

3.

4.

Exercises

Estimate how much rain will Gainesville have in 2014 as well as the aleatory and and epistemic uncertainty in your estimate.

Find how many samples of normally distributed numbers you need in order to estimate the mean with an error that will be less than 5% of the true standard deviation 90% of the time. Use the fact that the mean of a sample of a normal variable has the same mean and a standard deviation that is reduced by the square root of the number of samples.

Both the lognormal and Weibull distributions are used to model strength. Fit 100 data generated from a standard lognormal distribution by both lognormal and Weibull distributions. Repeat with 5 randomly generated samples. In each case measure the distance using the KS distance, and translate the result to a sentence of the following format: The maximum difference between the two CDFs is at x=2, where the true probability of x<2 is 60%, the probability from the experimental CDF is 61%, the probability from the lognormal fit is 62% and the probability from the Weibull fit is 64% (these numbers are invented for the purpose of illustrating the format).

Generate a histogram of word lengths in this assignment, including hyphens and the math (e.g., x=2 is a 3-letter word), but not punctuation marks. Select an appropriate number of boxes for the histogram and explain your selection). Then fit the distribution of word lengths with five standard distributions including normal, lognormal, and Weibull using the K-S criterion. What distribution fits best? Compare the graphs of the CDFs.