Probability Density Functions

Transcript Probability Density Functions

Probability Plots

Jake Blanchard Spring 2010 Uncertainty Analysis for Engineers 1

Introduction

  Probability plots allow us to assess the degree to which a set of data fits a particular distribution The idea is to scale the x-axis of a CDF such that the result would be a straight line if the data conforms to the assumed distribution Uncertainty Analysis for Engineers 2

An Example

    Suppose we have a set of data that we suspect is normal.

First, we form an empirical cdf ◦

[f,xx]=ecdf(x)

Then scale cdf so that each unit on the axis corresponds to 1 standard deviation ◦

z=norminv(ff);

Then we plot the data (sorted) against this new axis ◦

figure, plot(y,z,'+')

Uncertainty Analysis for Engineers 3

The Script

n=100; x=normrnd(10,3,n,1); y=sort(x); [f, xx]=ecdf(x) for i=1:n ff(i)=(f(i)+f(i+1))/2; end z=norminv(ff); figure, plot(y,z,'+')

Uncertainty Analysis for Engineers 4

3 2 -1 -2 1 0 -3 0 2 4 6 8 10 12 14 16 Uncertainty Analysis for Engineers 18 5

Matlab has an alternative

  probplot('norm',x) ◦ ◦ ◦ ◦ Options are ◦ exponential ◦ extreme value lognormal normal rayleigh weibull Uncertainty Analysis for Engineers 6

Probability plot for Normal distribution 0.995

0.99

0.95

0.9

0.75

0.5

0.25

0.1

0.05

0.01

0.005

0 2 4 6 8 Data 10 Vertical axis here is cdf, not number of standard deviations 12 14 16 Uncertainty Analysis for Engineers 18 7

Look at some lognormal data

Probability plot for Normal distribution 0.995

0.99

0.95

0.9

0.75

0.5

0.25

0.1

0.05

0.01

0.005

0 0.995

0.99

0.95

0.9

0.75

0.5

0.25

0.1

0.05

0.01

0.005

10 0 2 Normal probability plot 4 6 8 10 12 x 10 6 0.995

0.99

0.95

0.9

0.75

0.5

0.25

0.1

0.05

0.01

0.005

2 4 6 Probability plot for Normal distribution 8 10 Data 12 14 10 2 10 4 Data Lognormal probability plot 10 6 10 8 Normal probability plot of log of data Uncertainty Analysis for Engineers 16 18 8

Now some exponential data

Probability plot for Lognormal distribution 0.995

0.99

0.95

0.9

0.75

0.5

0.25

0.1

0.05

0.01

0.005

10 -3 10 -2 10 -1 Data 10 0 Lognormal probability plot 10 1 0.995

0.99

10 2 0.95

0.9

0.75

0.5

0.25

0.1

0 Probability plot for Exponential distribution Exponential probability plot 10 20 30 40 Data 50 60 Uncertainty Analysis for Engineers 70 80 9

Facts

    On normal probability plots, the intercept is the mean On exponential paper, the slope is 1/  Results at the extremes are expected to deviate from the straight line more than those in the middle On the other hand, for some data, multiple distributions will fit in the center, but not in the tails Uncertainty Analysis for Engineers 10

Results

 We are dealing with samples, so our conclusions tend to be one of ◦ The model appears to be adequate ◦ The model is questionable ◦ The model is not adequate Uncertainty Analysis for Engineers 11

Comparison

     Take some wind data (maximum measured wind velocity over a given period) 20 data points taken over 20 years Compare all 6 Matlab probability plots Compare looking at CDFs Compare other error measures Uncertainty Analysis for Engineers 12

Probability plot for Extreme value distribution 0.999

0.99

0.95

0.9

0.75

0.5

Probability plot for Normal distribution 0.95

0.9

0.75

0.5

0.25

0.1

0.05

60 70 80 Data 90 100 Probability plot for Weibull distribution 0.999

0.99

0.95

0.9

0.75

0.5

0.25

0.1

0.05

Probability plot for Exponential distribution 0.95

0.9

60 70 80 Data 90 100 Probability plot for Lognormal distribution 0.95

0.9

0.75

0.5

0.25

0.1

0.05

10 1.84

10 1.93

Data Probability plot for Rayleigh distribution 0.95

0.9

0.75

0.1

0.05

10 1.84

Data 10 1.93

0.75

0.5

0.25

0.005

60 70 80 Data 90 100 0.5

0.25

0.1

0.05

0.01

60 70 80 Data 90 100 Uncertainty Analysis for Engineers 13

Empirical CDF 1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 68 70 EV Norm Exp LogN Weibull Rayleigh 72 74 76 x 78 80 82 84 Uncertainty Analysis for Engineers 86 14

Goodness of Fit Statistics

 ◦ ◦ ◦ ◦ For discrete and continuous sampled data distributions ◦ Chi-square statistic Kolmogorov-Smirnoff (K-S) statistic ◦ Anderson-Darling (A-D) statistic Root Mean Square Error (RMS). Value is limited if there are fewer than about 30 data points.

The lower the value, the closer the distribution appears to fit the data. But they do not provide a measure that the data actually come from the distribution.

Chi-square statistic

 This goodness-of-fit statistic measures ◦ The oldest, most commonly used ◦ Data are grouped into frequency cells and compared to the expected number of observations based on the proposed distribution.

◦ Definition  2 

i N

  1 

O i E E

 

 2 ◦   Where O(i) is the observed frequency of the i th and histogram bar E(i) is the expected frequency from the fitted distribution of x values falling within the x range of the i th histogram bar. It can be overly sensitive to large errors

Chi-Squared Tests

 First we divide values into groups; suggestion is

 4 

 0 .



/ 5  1  2  0 .

 200

 200  For example, if we have n=500 data points, then this gives us about 45 groups (I would use 50 for convenience)

Example (cont.)

   Sort data and divide into 50 cells Find upper and lower bound of values in each cell Calculate expected number of data points in each cell by subracting cdf of lower bound from cdf of upper bound

Example (cont.)

 Compare this value to the value of the chi squared distribution for k-np-1 degrees of freedom and a desired confidence level, where np is the number of parameters in the model (eg, 2 for a normal distribution)

Example (cont.)

  In Matlab, we can get this chi-squared distribution from

chi2inv(p,v)

the number of DOF , where p is the confidence level (0

Kolmogorov-Smirnov Test

   Compare measured cumulative frequency with CDF of assumed theoretical distribution Compare the maximum discrepancy between these two with a critical value of a test statistic and reject fit if former exceeds latter Good when we don’t have many data points

D n

Kolmogorov-Smirnoff Statistic

 max 

F n



   ◦ ◦ ◦ ◦ Where Dn is the K-S distance, n is the total number of data points, F(x) is the distribution function of the fitted distribution, and Fn(x)=i/n and i is the cumulative rank of the data point. K-S is better than χ 2 because data are assessed at all points— ◦ avoids problem of number of bars (bins).

But value determined by the one largest discrepancy ◦ So it takes no account of lack of fit across entire distribution

More on K-S statistic

  The position of D likely to occur away from the low probability tails.

n along the x-axis is more ◦ This insensitivity to lack of fit at the extremes is corrected for in the Anderson-Darling statistic.

Some statistical literature is critical about distribution fitting software that use this statistic as a goodness-of-fit test. ◦ Because the statistic assumes the fitted distribution is fully specified so that the critical region of the curve can be checked.

Process

  Sort n data points Make a step-wise cdf ◦ (cdfplot in Matlab)

S n

(

)      



x i



x i

 1



x n

   Fit data to a model to obtain model cdf Find maximum difference between these two cdf’s over each of the steps in the first cdf Look up comparison data in tables

KS Critical Values

10 20 30 40 50 >50 

=0.2

0.32

0.23

0.19

0.17

0.15

1.07/sqrt(n) 

=0.1

0.37

0.26

0.22

0.19

0.17

1.22/sqrt(n) 

=0.05

0.41

0.29

0.24

0.21

0.19

1.36/sqrt(n) 

=0.01

0.49

0.36

0.29

0.25

0.23

1.63/sqrt(n)

What is the Significance Level?

    Our hypothesis is that the fit is a good fit.

If difference in cdf’s exceeds that of the test statistic, we reject the hypothesis There are two possibilities: ◦ Fit really is bad, or ◦ We are rejecting a good fit Significance level is probability that we are rejecting a good fit

Fit Tests in Matlab

 

chi2gof(x) – normal distribution only kstest(x,CDF)

Matlab Code

w1=data(:,4); n=numel(w1); hist(w1,25); ncells=128; npoints=51; nused=ncells*npoints; datasort=sort(w1(1:nused)); a=4.52997; b=2.72931; %from dfittol chisqr=0; upperbound=min(datasort); for i=1:ncells indx=(i-1)*(npoints)+1; thisset=datasort(indx:indx+npoints-1); lowerbound=upperbound; upperbound=max(thisset); ee=n*(gamcdf(upperbound,a,b) gamcdf(lowerbound,a,b)); chisqr=chisqr+(npoints-ee)^2/ee; end chisqr param=2; v=ncells-param-1; chpdf=chi2inv(0.95,v)

KS Test

a=13.9761; %from dfittool b=2.34506; %from dfittol cdfvals=wblcdf(xvals,a,b); kstest(datasort,[xvals' cdfvals'])

Anderson-Darling Statistic

  This is a more sophisticated and complex version of the K-S, It is more powerful because ◦ The f(x) weights the observed distances by the probability that the value will be generated at that x value.  This helps focus the difference measure more equitably.

◦  The vertical distances are integrated over ALL values of x rather than just looking at the maximum. This makes maximum use of the observed data

Root Mean Square Error

  RMS error is available as a test statistic in BESTFIT for expert data that is sampled using percentiles.

Measures the area between the distribution fit and the data.

◦ The smaller the better. Does not provide fine distinction.

Back to Example - Tests

  All pass KS test except exponential and rayleigh at 5% significance level Same holds at 2% level Uncertainty Analysis for Engineers 32

Script

vel=[78.2 75.8 81.8 85.2 75.9 78.2 72.3 69.3 76.1 74.8 ...

78.4 76.4 72.9 76.0 79.3 77.4 77.1 80.8 70.6 73.5]; sortv=sort(vel); subplot(2,3,1),probplot('ev',vel) subplot(2,3,2),probplot('norm',vel) subplot(2,3,3),probplot('lognorm',vel) subplot(2,3,4),probplot('weibull',vel) subplot(2,3,5),probplot('exponential',vel) subplot(2,3,6),probplot('rayleigh',vel) x=68:0.1:86; alpha=0.02; zev=evfit(vel); sev=evcdf(x,zev(1),zev(2)); kev=kstest(sortv,[x' sev'],alpha) [an bn]=normfit(vel); sn=normcdf(x,an,bn); kn=kstest(sortv,[x' sn'],alpha) Rayleigh and Exponential fail test (kx=kr=1) zx=expfit(vel); sx=expcdf(x,zx); kx=kstest(sortv,[x' sx'],alpha) zl=lognfit(vel); sl=logncdf(x,zl(1),zl(2)); kl=kstest(sortv,[x' sl'],alpha) zw=wblfit(vel); sw=wblcdf(x,zw(1),zw(2)); kw=kstest(sortv,[x' sw'],alpha) zr=raylfit(vel); sr=raylcdf(x,zr); kr=kstest(sortv,[x' sr'],alpha) figure,plot(x,sev,'o',x,sn,'r',x,sx,'b',x,sl,'+',x,s w,'m',x,sr,'c',...

'LineWidth',2) legend('EV','Norm','Exp','LogN','Weibull','Ra yleigh') hold on cdfplot(vel) Uncertainty Analysis for Engineers 33

A Second Example

        A new controller was installed on 96 diesel locomotives The mileage at failure for each was recorded 37 failed at less than 135,000 miles All we know about the others is that each lasted beyond 135k miles Thus, we have to “censor” the data We assume we have 96 data points, but only plot 37 This is important when we compute CDF Goal is 80,000 mile warranty Uncertainty Analysis for Engineers 34

Censoring in Matlab

  Tell Matlab which data points are censored and how many of each there are We’ll use a lognormal plot in the example because failure mechanism indicates this is appropriate Uncertainty Analysis for Engineers 35

Uncensored “Normal” probability plot

Probability plot for Normal distribution 0.99

0.95

0.9

0.75

0.5

0.25

0.1

0.05

0.01

20 40 60 80 Data 100 120 140 Uncertainty Analysis for Engineers 36

Un-Censored “Lognormal” Plot

Probability plot for Lognormal distribution 0.99

0.95

0.9

0.75

0.5

0.25

0.1

0.05

0.01

10 1 10 2 Data Uncertainty Analysis for Engineers 10 3 37

Censored “Lognormal” Plot

Probability plot for Lognormal distribution 0.5

0.25

0.1

0.05

0.01

0.005

10 1 10 2 Data Uncertainty Analysis for Engineers 10 3 38

Conclusions

  The lognormal distribution looks like a good fit Probability of failure is approximately 15% Uncertainty Analysis for Engineers 39

Script

d=[22.2 37.5 46.0 48.5 51.5 ...

53 54.5 57.5 66.5 68 ...

69.5 76.5 77 78.5 80 ...

81.5 82 83 84 91.5 93.5 ...

102.5 107 108.5 112.5 113.7 ...

116 117 118.5 119 120 ...

122.5 123 127.5 131.0 132.5 134] data=[d'; 135]; bools=zeros(size(data)); bools(end)=1; freq=bools*59+1; probplot('logn',data) figure, probplot('logn',data,bools,freq)

Uncertainty Analysis for Engineers 40