DATA ANALYSIS - DCU School of Computing

Download Report

Transcript DATA ANALYSIS - DCU School of Computing

DATA ANALYSIS
Module Code: CA660
Lecture Block 3
Standard Statistical Distributions
Importance
Modelling practical applications
Mathematical properties are known
Described by few parameters, which have natural interpretations.
Bernoulli Distribution.
This is used to model a trial/expt. which gives
rise to two outcomes:
success/ failure: male/ female, 0 / 1..…
Let p be the probability that the outcome is one
and q = 1 - p that the outcome is zero.
Prob
1
p
1-p
0
p
1
E[X] = p (1) + (1 - p) (0) = p
VAR[X] = p (1)2 + (1 - p) (0)2 - E[X]2 = p (1 - p).
2
Standard distributions - Binomial
Binomial Distribution.
Suppose that we are interested in the number of successes X
in n independent repetitions of a Bernoulli trial, where the
probability of success in an individual trial is p. Then
Prob{X = k} = nCk pk (1-p)n - k, (k = 0, 1, …, n)
E[X] = n p
VAR[X] = n p (1 - p)
This is the appropriate distribution to model
e.g. Preferences expressed between two brands
e.g. Number of recombinant gametes produced
by a heterozygous parent for a 2-locus model.
Extension for  3 loci, (brands) is multinomial
Prob
(n=4, p=0.2)
1
np
4
3
Standard distributions - Poisson
Poisson Distribution.
The Poisson distribution arises as a limiting case of the binomial distribution,
where n  , p  0 in such a way that np  ( Constant)
P{X = k} = exp ( - )k /k!(k=0,1,2,… ).
1
E [X] = 
VAR [X] = .
Poisson is used to model No. of
occurrences of a certain phenomenon in a
fixed period of time or space, e.g.
e.g.
O particles emitted by radioactive source in fixed direction for interval  T
O people arriving in a queue in a fixed interval of time
O genomic mapping functions, e.g. cross over as a random event
5
X
4
Other Standard examples: e.g.
Hypergeometric, Exponential….
• Hypergeometric. Consider a population of M items, of which W are
deemed to be successes.
• Let X be the number of successes that occur in a sample of size n,
drawn without replacement from the finite population, then
Prob { X = k} = WCk M-WCn-k / MCn
( k = 0, 1, 2, … )
• Then E [X] = n W / M
VAR [X] = n W (M - W) (M - n) / { M2 (M - 1)}
• Exponential : special case of the Gamma distribution with n = 1 used
e.g. to model inter-arrival time of customers or time to arrival of first
customer in a simple queue, e.g. fragment lengths in genome
mapping etc.
• The p.d.f. is
f (x) =  exp ( -  x ),
x 0,>0
=0
otherwise
5
Standard p.d.f.’s - Gaussian/ Normal
• A random variable X has a normal distribution with mean m and
standard deviation s if it has density
2



x

m


1
1
exp



 2
   x  
 s 2
s

f ( x) = 




0 otherwise
with E (X ) = m and V ( X ) = s 2
• Arises naturally as the limiting distribution of the average of a set of
independent, identically distributed random variables with finite
variances.
• Plays a central role in sampling theory and is a good approximation to
a large class of empirical distributions. Default assumption in many
empirical studies is that each observation is approx. ~ N(m,s 2)
• Note: Statistical tables of the Normal distribution are of great
importance in analysing practical data sets. X is said to be a
Standardised Normal variable if m = 0 and s = 1.
6
Standard p.d.f.’s :
Student’s t-distribution
• A random variable X has a t -distribution with ‘ ’ d.o.f. ( t ) if it has
density
 (  1) 


2  ( 1) 2
2  t 

1  
f (t ) =
  t  

 
   
 2
=0
otherwise.
Symmetrical about origin, with E[X] = 0 & V[X] = n / (n -2).
• For small n, the tn distribution is very flat.
• For n  25, the tn distribution  Standard Normal curve.
• Suppose Z a standard Normal variable, W has a cn2 distribution and
Z and W independent then r.v. has form X = Z
W n
• If x1, x2, … ,xn is a random sample from N(m, s2) , and, if define
s =
2

( xi  x ) 2
n 1
then
(x  m)
s
n
~ tn  1
7
Chi-Square Distribution
• A r.v. X has a Chi-square distribution with n degrees of freedom; (n a positive
integer) if it is a Gamma distribution with  = 1, so its p.d.f. is
f ( x) = x n 1 exp( x) (n  1)! x  0
Prob
c2 ν (x)
0 otherwise
E[X] =n ; Var [X] =2n
• Two important applications:
- If X1, X2, … , Xn a sequence of independently distributed Standardised
Normal Random Variables, then the sum of squares
X12 + X22 + … + Xn2 has a c2 distribution (n degrees of freedom).
X
- If x1, x2, … , xn is a random sample from N(m,s2), then

x=
i =1

n
n
xi and
n
s =
2
i =1
( xi  x ) 2
s2
and
s2 has c2 distribution, n - 1 d.o.f., with r.v.’s x and s2 independent.
8
F-Distribution
• A r.v. X has an F distribution with m and n d.o.f. if it has a density
function = ratio of gamma functions for x>0 and = 0 otherwise.
•
E[ X ] = n (n  2) if n > 4
2n 2 (m  n  2)
Var[ X ] =
m(n  4)(n  2) 2
if n > 4
• For X and Y independent r.v.’s, X ~ cm2 and Y~ cn2 then Fm , n =
X m
Y n
• One consequence: if x1, x2, … , xm ( m 2) is a random sample from
N(m1, s12), and y1, y2, … , yn ( n 2) a random sample from N(m2,s22),
then

( y  y)
( xi  x ) 2 (m  1)
i
2
(n  1)
~ Fm  1, n  1
9
Sampling and Sampling Distributions –
Extended Examples: refer to primer
Central Limit Theorem
If X1, X2,… Xn are a random sample of r.v. X, (mean m, variance s2),
then, in the limit, as n , the sampling distribution of means has a
Standard Normal distribution, N(0,1)
xi =
'
xi  m
s
i = 1,2,...
n
Probabilities for sampling distribution – limits
• for large n


x  mx
P a 
 b  P{a  U  b}
sx


U (or Z) = standardized Normal deviate
10
Large Sample theory
• In particular
P{ x  m  r} = P{r  x  m  r}
 r x  m
r 
= P



sx
sx 
s x
 r 
 r 
  F

= F 



s n 
s n 
•
F is the C.D.F. or D.F.
• In general, the closer the random variable X behaviour is to the
Normal, the faster the approximation approaches U. Generally, n ~25
 “Large sample” theory
11
Attribute and Proportionate Sampling
recall primer sample proportion pˆ and sample mean x synonymous
Probability Statements
If X and Y are independent Binomially distributed r.v.’s parameters n,
p and m, p respectively, then X+Y ~ B(n+m, p)
• So, Y=X1+ X2+…. + Xn ~ B(n, p) for the IID X~B(1, p)
• Since we know mY = np, sY=(npq) and, clearly Y = nx then
Y mY

x  m x n n Y  np
=
=
 N (0,1) as n  
sY
sx
npq
n
•
and, further U = pˆ  p ~ N (0, 1) is the sampling distribution of
pq
a proportion
n
12
Differences in Proportions
• Can use c2 : Contingency table type set-up
• Can set up as parallel to difference estimate or test of 2 means
(independent) so for 100 (1-a)%C.I.
( pˆ 1  pˆ 2)  U a
2
pˆ 1qˆ1  pˆ 2 qˆ 2
n1
n2
• Under H0: P1 – P2 =0
so, can write S.E. as
 1
1 
ˆq
ˆ
p
n  n 

2 
 1
ˆ 1  n2 p
ˆ2
X Y
n1 p
ˆ =
p
=
n1  n2
n1  n2
2-sided
S.E., n1, n2
large.
Small sample
n-1
for pooled
X & Y = No. of successes
13
C.L.T. and Approximations summary
• General form of theorem - an infinite sequence of independent r.v.’s,
with means, variances as before, then approximation  U for n large
enough.
Note: No condition on form of distribution of the X’s (the raw data)
• -Strictly - for approximations of discrete distributions, can improve by
considering correction for continuity
e.g.
X    0.5
U
Poisson, param eter 

U
( x n)  0.5  p
pq n
x = No. in sam ple , so observed/ sam ple proportion= pˆ
14
Generalising Sampling Distn. Concept
-see primer
• For sampling distribution of any statistic, a sample characteristic is an
unbiased estimator of the parent population characteristic, if the
mean of the corresponding sampling distribution is equal to the
parent characteristic.
Also the sample average proportion is an unbiased estimator of the
E { pˆ } = P
parent average proportion E { x } = m
• Sampling without replacement from a finite population gives the
Hypergeometric distribution.
finite population correction (fpc) =  [( N - n) / ( N - 1)] ,
N, n are parent population and sample size respectively.
• Above applies to variance also.
15
Examples
Large scale 1980 survey in country showed 30% of adult population with
given classification. If still the current rate, what is probability that, in a
random sample of 1000, the number with this classification will be
(a) < 280, (b) 316 or more?
Soln. Let X = no. successes (with trait) in sample. So, for expected proportion
of 0.3 in population, we suppose X ~B(1000,0.3)
Since np=300, and √npq = √210 =14.49, distn. of X ~N(300,14.49)
(a)
279.5  300

PU 
 = PU  1.415 = 0.0786
14.49 

315.5  300

 = PU  1.07 = 1  0.8588= 0.1423
14.49 
P{X<280} or P{X≤279} 
(b) P{X≥316}  PU

16
Examples
Auditors checking if certain firm overstating value of inventory items. Decide to
randomly select 15 items. For each, determine recorded amount (R), audited
(exact) amount (A) and hence difference between the two = X, variable of
interest. Of particular interest is whether average difference > 250 Euro.
170 350 310 220 500 420 560 230 270 380 200 250 430 450 210
So n = 15, x = €330 and s = €121.5
H0 : m  €250
H1 : m > €250
Decision Rule: Reject H0 if t =
Value from data
t=
x  250
> t0.05,14 = 1.761 where the dof = n-1 =14
s n
330 250
= 2.55
121.5 15
Since 2.55 > 1.761, reject H0. Also, the p-value is the area to the right of 2.55. It
is between 0.01 and 0.025, (so less than a= 0.05), so again - reject H0
The data indicate that the firm is overstating the value of its inventory items by
more than €250 on average
Examples contd.
Blood pressure readings before and after 6 months on medication taken in
women students, (aged 25-35); sample of 15. Calculate (a) 95% C.I. for
mean change in B.P. (b) test at 1% level of significance, (a= 0.01) that the
medication reduces B.P.
Data:
Subject
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1st (x) 70 80 72 76 76 76 72 78 82 64 74 92 74 68 84
2nd (y) 68 72 62 70 58 66 68 52 64 72 74 60 74 72 74
d =x-y
2
8 10
6 18 10
4 26 18 -8
(a) So for 95% C. limits
d
d =  i = 8.80
15
s=
0 32
 (d
d  t0.025
i
 d )2
14
s
15
0 -4
10
= 10.98
18
Contd.
Value for t0.025 based on d.o.f. = 14. From t-table, find t0.025 = 2.145
10.98
10.98

 D  8.80  2.145
So, 95% C.I. is: P8.80  2.145
 = 0.95
15
15 

i.e. limits are 8.80  6.08 or (2.72, 14.88), so 95% confident that there is a mean
difference (reduction) in B.P. of between 2.72 and 14.88
(b) The claim is that  > 0, so we look at H0:  = 0 vs H1:  > 0 ,
So t-statistic as before, but right-tailed (one sided only) Rejection Region. For
d.o.f. = 14, t0.01 = 2.624. So calculated value from our data
d
8.80
t=
=
= 3.10
s
10.98
n
15
clearly in Rejection region, so H0
rejected in favour of H1 at a= 0.01
Reduction in B.P. after medication
strongly supported by data.
t14
Accept
0
Reject = 1%
t0.01 = 2.624.
19
Examples
Rates of preference recorded for product P1 among given age group children.
Of 113 boys tested, 34 indicate positive preference, while of 139 girls
tested, 54 are positive. Is evidence strong for a higher preference rate in
girls?
H0: p1=p2 vs H1: p1< p2 (where p1, p2 proportion boys, girls with +ve
preference respectively).
Soln.
pˆ =
34  54
= 0.349
113  139
U =
34
= 0 .3 0 1
113
54
ˆ2 =
p
= 0 .3 8 8
139
ˆ1 =
p
0.301 0.388
1 
 1
0.349 0.651


 113 139
Can not reject H0
Actual p-value = P{U ≤ -1.44) = 0.0749
= 1.44
20
Developed Examples using Standard
Distributions/sampling distributions
Lot Acceptance Sampling in SPC. Binomial frequently used. Suppose shipment
of 500 calculator chips arrives at electronics firm; acceptable if a sample of size
10 has no more than one defective chip.
What is the probability of accepting lot if, in fact, (i) 10% (50 chips) are
defective (ii) 20% (100) are defective?
n = 10 trials, each with 2 outcomes: Success = defective; Failure = not defective
P = P{Success} = 0.10, (assume constant for simplicity)
X= no. successes out of n trials = No. defective out of 10 sampled
i.e. Electronics Firm will accept shipment if X = 0 or 1
(i) P{accept} = P{0 or 1} = P {0 } + P{1} =P{X 1} (cumulative)
From tables: n=10, p=0.10, P(0}=0.349, P{1} = 0.387
So, P{accept}
= 0.736 , i.e 73.6% chance
(ii) For p=0.20, P{0} = 0.107, P{1} = 0.268, so P{accept} = 0.375 or 37.5% chance
Example contd.
Suppose have a shipment of 50 chips, similar set up to before – check for lot
acceptance, still selecting sample of size 10 and assuming 10% defective.
Success and Failure as before
Now, though, p = P{Success 1st trial} = 5/50 = 0.1 first trial , but
Conditional
P{Success 2nd trial} = 5/49 = 0.102 if 1st pick is a failure (not defective) OR
P{Success 2nd trial} = 4/49 =0.082 if 1st is defective (success). Hypergeometric
Think of two sets in shipment – one having 5 S’s, the other 45 F’s
Taking 10 chips randomly from the two sections
If x are selected from S set, then 10-x must be selected from F set, i.e. N = 50,
k = 5, n = 10
So P{1 S and 9 Fs} = P{1} =
(5 C1 )(45 C9 )
= 0.431
(50 C10 )
and P{0} from similar expression = 0.31
c.f. Binomial
Example contd.
Approximations: Poisson to Binomial
Suppose shipment = 2500 chips and want to test 100. Accept lot if sample
contains no more than one defective.
Assuming 5% defective. What is probability of accepting lot?
Note: n= 100, N=2500; ratio = 0.04 , i.e. < 5%, so can avoid the work for
hypergeometric , as approximately Binomial, n = 100, p 0.05
So Binomial random variable X here = no. defective chips out of 100
P{accept lot} = P{X1} = P{0} +P{1}
P{accept } = 100 C0 (0.05)0 (0.95)100 100 C1 (0.05)1 (0.95)0.99 = 0.037
Lot of work, not tabulated
Alternative: Poisson approx. to Binomial where n >20, np 7 works well, so
probability from Poisson table, where
m = np = (100)(0.5) = 5
P{0} = 0.0067
P{1} = 0.0337
close to result for Binomial
P{ X  1} = 0.0067 0.0337 = 0.0404
Example contd.
Approximations: Normal to discrete distribution
Supposing still want to sample 100 chips, but 10% of chips expected to be
defective.
Rule for approximation of Binomial is that n large, p small, or that np < 7.
Now p =0.10, so np = 10, Poisson not a good approximation.
However, n large and np=10, n(1-p)=90, and both > 5, so can use Normal
approximation
then X is a Binomial r.v. with m = np = (100)(0.1) = 10
s = np(1  p) = npq = 9 = 3
So have
PBinomial { X  1}  PNormal { X  1.5}
s = np(1  p) = npq = 9 = 3
1.5  10

PNormal { X  1.5} = P U 
= P{U  2.83} = 0.0023

3 

Very small chance of accepting lot with this many defectives.