DATA ANALYSIS - DCU School of Computing

Download Report

Transcript DATA ANALYSIS - DCU School of Computing

DATA ANALYSIS
Module Code: CA660
Lecture Block 3
MEASURING PROBABILITIES – RANDOM
VARIABLES & DISTRIBUTIONS
(Primer) If a statistical experiment only gives rise to real
numbers, the outcome of the experiment is called a
random variable. If a random variable X takes values
X1, X2, … , Xn with probabilities p1, p2, … , pn
then the expected or average value of X is defined
n
E[X] =
pj Xj
j 1
and its variance is
n
VAR[X] = E[X2] - E[X]2 =
pj Xj2 - E[X]2
j 1
2
Random Variable PROPERTIES
• Sums and Differences of Random Variables
Define the covariance of two random variables to be
COVAR [ X, Y] =
E [(X - E[X]) (Y - E[Y]) ] = E[X Y] - E[X] E[Y]
If X and Y are independent, COVAR [X, Y] = 0.
E[ X  Y] = E[X]  E[Y]
VAR [ X  Y] = VAR [X] + VAR [Y]
 2COVAR [X, Y]
and E[ k. X] = k .E[X] , VAR[ k. X] = k2 .E[X]
for a constant k.
Lemmas
3
Example: R.V. characteristic properties
B =1
2
R=1
8 10
2
5
7
3
6
6
Totals
19 23
E[B]
E[B2]
3
9
4
7
20
Totals
27
16
19
62
= {1(19)+2(23)+3(20) / 62 = 2.02
= {12(19)+22(23)+32(20) / 62 = 4.69
VAR[B] = ?
E[R]
= {1(27)+2(16)+3(19)} / 62 = 1.87
2
E[R ] = {12(27)+22(16)+32(19)} / 62 = 4.23
VAR[R] = ?
4
Example Contd.
E[B+R]
E[(B + R)2]
= { 2(8)+3(10)+4(9)+3(5)+4(7)+
5(4)+4(6)+5(6)+6(7)} / 62
= 3.89
= {22(8)+32(10)+42(9)+32(5)+42(7)+
52(4)+42(6)+52(6)+62(7)} / 62 = 16.47
VAR[(B+R)] = ? *
E[B R]
= {1(8)+2(10)+3(9)+2(5)+4(7)+6(4)
+3(6)+6(6)+9(7)}/ 62
= 3.77
COVAR (B, R) = ?
Alternative calculation to *
VAR[B] + VAR[R] + 2 COVAR[ B, R]
Comment?
5
DISTRIBUTIONS - e.g. MENDEL’s PEAS
6
P.D.F./C.D.F.
• If X is a R.V. with a finite countable set of possible outcomes, {x1
, x2,…..}, then the discrete probability distribution of X

P{ X  xi} if x  xi, i  1,2,....
f ( x) or pX ( xi )  

0 if x  xi

and D.F. or C.D.F.
P{ X  xi}  F ( xi ) 
 P{X  x }
i
i  xj
• While, similarly, for X a R.V. taking any value along an interval
x
of the real number line
F ( x)  P{ X  x} 

f (u )du

So if first derivative F ' ( x) exists, then
F ( x)  dF( x) dx  f ( x)

f ( x)  F ' ( x)
is the continuous pdf, with
 f ( x)dx  1

7
EXPECTATION/VARIANCE
• Clearly,
• and


xif ( xi ) discrete

 iS
E( X )   

 x f ( x)dx continuous



[ x  E ( X )] f ( x ) discrete


i
2
i
xS

Var ( X )   

2
[
x

E
(
X
)]
f ( x)dx continuous



8
Moments and M.G.F’s
• For a R.V. X, and any non-negative integer k, kth moment
about the origin is defined as expected value of X k
• Central Moments (about Mean): 1st = 0 i.e. E{X}=,
second = variance , Var{X}
• To obtain Moments, use Moment Generating Function
• If X has a p.d.f. f(x), mgf is the expected value of e tX

For a continuous variable, then
mgf ( X )  E{etX }   etX f ( x)dx
For a discrete variable, then

mgf( X )  E{etX }   etX f ( x)
x
• Generally:
r th moment of the R.V. is r th derivative evaluated at t = 0
9
PROPERTIES - Expectation/Variance
etc. Prob. Distributions (p.d.f.s)
• As for R.V.’s generally. For X a discrete R.V. with
p.d.f. p{X}, then for any real-valued function g
• e.g.
E{g ( X )} 
E{ X  Y }  E{ X }  E{Y }
 g(x) p{X }
E{ XY}  E{ X }E{Y )
Applies for more than 2 R.V.s also
• Variance - again has similar properties to previously:
• e.g.

V {aX  b}  a 2V {X }  a 2 E{X 2}  [ E{X }]2

10
MENDEL’s Example
• Let X record the no. of dominant A alleles in a
randomly chosen genotype, then X= a R.V. with
sample space S = {0,1,2}
• Outcomes in S correspond to events
0 if aa

X   1 if aA, Aa
2 if AA

• Note: Further, any function of X is also a R.V.
0 if aa ( X  0)
e.g. Z  g ( X )  
 1 if AA, Aa, aA ( X  0)
• Where Z is a variable for seed character phenotype
11
Example contd.
0 Wrinkled
• So that, for Mendel’s data, Z  
 1 Round
P{Z  0}  1

4
• And f ( z )  
with
E(Z )  3
4
3
P{Z  1} 

4

• And Var( Z )  [ zi  E ( Z )]2 f ( zi )
i
 (0  3 ) 2  1  (1  3 ) 2  3  3
4
4
4
4
16
• Note: Z = ‘dummy’ or indicator. Could have chosen e.g. Q
as a function of X s.t. Q = 0 round, (X >0), Q = 1 wrinkled,
(X=0). Then probabilities for Q opposite to those for Z with
2
• E (Q)  1 4 and Var(Q)  [qi  E (Q)] f (qi )
i
 (0  1 ) 2  3  (1  1 ) 2  1  9
4
4
4
4
16
12
JOINT/MARGINAL DISTRIBUTIONS
• Joint cumulative distribution of X and Y, marginal
cumulative for X, without regard to Y and joint distribution
(p.d.f.) of X and Y then, respectively
FX ( x) 
F ( x, y )  P{ X  x, Y  y}
(1)
 P{X  x,Y  y}  F (x)
(2)
p( x, y )  P{ X  x, Y  y}
(3)
y
with
 p(x , y )  1
i
i
j
j
• where similarly for continuous case e.g. (2) becomes
x




F 1( x)   f (u, v)dvdu  f 1(u )du  F ( x)


   


x


( 2a )
13
Example: Backcross 2 locus model (AaBb 
aabb)
Observed and Expected frequencies
Genotypic S.R 1:1 ; Expected S.R. crosses 1:1:1:1
Cross
Genotype
Frequency
2
3
4
AaBb 310(300) 36(30) 360(300) 74(60)
Aabb 287(300) 23(30) 230(300) 50(60)
aaBb 288(300) 23(30) 230(300) 44(60)
aabb 315(300) 38(30) 380(300) 72(60)
Marginal A
Aa
aa
Marginal B
Bb
bb
Sum
1
597(600)
603(600)
59(60)
61(60)
Pooled
780(690)
590(690)
585(690)
805(690)
590(600) 124(120)
610(600) 116(120)
1370(1380)
1390(1380)
598(600) 59(60) 590(600) 118(120)
602(600) 61(60) 610(600) 122(120)
1200
120
1200
240
1365(1380)
1395(1380)
2760
14
CONDITIONAL DISTRIBUTIONS
• Conditional distribution of X, given that Y=y
p ( x, y )
p( x / y ) 
 P{ X  x / Y  y}
p( y )
P{ X  x, Y  y}

P{Y  y}
i.e. JOINT
and sim ilarly p( y / x)
• where for X and Y independent p( x / y)  p( x) and p( y / x)  p( y)
• Example: Mendel’s expt. Probability that a round seed (Z=1) is
a homozygote AA i.e. (X=2)
AND - i.e. joint or
intersection as above
1 *3
1
P{x  2, z  1}
P{X  2 / Z  1} 
 3 4 41
3
3
3
P{z  1}
4
4
15
Standard Statistical Distributions
Importance
Modelling practical applications
Mathematical properties are known
Described by few parameters, which have natural interpretations.
Bernoulli Distribution.
This is used to model a trial/expt. which gives
rise to two outcomes:
success/ failure: male/ female, 0 / 1..…
Let p be the probability that the outcome is one
and q = 1 - p that the outcome is zero.
Prob
1
p
1-p
0
p
1
E[X] = p (1) + (1 - p) (0) = p
VAR[X] = p (1)2 + (1 - p) (0)2 - E[X]2 = p (1 - p).
16
Standard distributions - Binomial
Binomial Distribution.
Suppose that we are interested in the number of successes X
in n independent repetitions of a Bernoulli trial, where the
probability of success in an individual trial is p. Then
Prob
Prob{X = k} = nCk pk (1-p)n - k, (k = 0, 1, …, n)
E[X] = n p
VAR[X] = n p (1 - p)
(n=4, p=0.2)
1
np
4
This is the appropriate distribution to model
e.g. Number of recombinant gametes produced by a heterozygous
parent for a 2-locus model . Extension for  3 loci is multinomial
17
Standard distributions - Poisson
Poisson Distribution.
The Poisson distribution arises as a limiting case of the binomial
distribution, where n  , p  0 in such a way that np  ( Constant)
P{X = k} = exp ( - )k /k!(k0,1,2,… ).
E [X] = 
1
VAR [X] = .
Poisson is used to model No.of
occurrences of a certain phenomenon in a
fixed period of time or space, e.g.
5
X
O particles emitted by radioactive source in fixed direction for interval  T
O people arriving in a queue in a fixed interval of time
O genomic mapping functions, e.g. cross over as a random event
18
Other Standard examples: e.g.
Hypergeometric, Exponential….
• Consider a population of M items, of which W are deemed to be
successes. Let X be the number of successes that occur in a
sample of size n, drawn without replacement from the finite
population
Prob { X = k} = WCk M-WCn-k / MCn
( k = 0, 1, 2, … )
• Then
E [X] = n W / M
VAR [X] = n W (M - W) (M - n) / { M2 (M - 1)}
• Exponential : special case of the Gamma distribution with n = 1
used e.g. to model inter-arrival time of customers or time to
arrival of first customer in a simple queue, e.g. fragment lengths
in genome mapping etc.
• The p.d.f. is
f (x)=  exp ( -  x ),
x 0,0
=0
otherwise
19
Standard p.d.f.’s - Gaussian/ Normal
• A random variable X has a normal distribution with mean  and
standard deviation s if it has density

 1  x   2 
1
exp 
    x  

2
s
s
2


 
f ( x)  



0 otherwise
with E (X )   and V ( X )  s 2
• Arises naturally as the limiting distribution of the average of a set
of independent, identically distributed random variables with
finite variances.
• Plays a central role in sampling theory and is a good
approximation to a large class of empirical distributions.
Default assumption in many empirical studies is that each
observation is approx. ~ N(,s 2)
• Statistical tables of the Normal distribution are of great
importance in analysing practical data sets. X is said to be a
Standardised Normal variable if  = 0 and s = 1.
20
Standard p.d.f.’s :
Student’s t-distribution
• A random variable X has a t -distribution with ‘n’ d.o.f. ( tn ) if it
has density  (  1) 
 ( 1) 2
2
2  t 
1  


 
   
 2
f (t )  
  t  
0
otherwise.
Symmetrical about origin, with E[X] = 0 & V[X] = n / (n -2).
• For small n, the tn distribution is very flat.
• For n  25, the tn distribution  standard normal curve.
• Suppose Z a standard Normal variable, W has a cn2
distribution and Z and W independent then r.v. form X  Z
=
W n
• If x1, x2, … ,xn is a random sample from N(, s2) , and, if define
( xi  x ) 2
then ( x   )
2
~ tn  1
s 
s n
n 1

21
Chi-Square Distribution
• A r.v. X has a Chi-square distribution with n degrees of freedom; (n a
positive integer) if it is a Gamma distribution with  = 1, so its p.d.f. is
f ( x)  x n 1 exp( x) (n  1)! x  0
Prob
c2 ν (x)
0 otherwise
E[X] =n ; Var [X] =2n
• Two important applications:
- If X1, X2, … , Xn a sequence of independently distributed
Standardised Normal Random Variables, then the sum of squares
X12 + X22 + … + Xn2 has a c2 distribution (n degrees of freedom).
X
- If x1, x2, … , xn is a random sample from N(,s2), then

x
i 1

n
n
xi
n
and
s 
2
i 1
( xi  x ) 2
s2
and
s2 has c2 distribution, n - 1 d.o.f., with r.v.’s x and s2 independent.
22
F-Distribution
• A r.v. X has an F distribution with m and n d.o.f. if it has a density
function = ratio of gamma functions for x>0 and = 0 otherwise.
•
E[ X ]  n (n  2) if n  4
2n 2 (m  n  2)
Var[ X ] 
if n  4
2
m(n  4)(n  2)
X m
• For X andY independent r.v.’s, X ~ cm2 and Y~ cn2 then Fm , n 
Y n
• One consequence: if x1, x2, … , xm ( m 2) is a random sample
from N(1, s12), and y1, y2, … , yn ( n 2) a random sample from
N(2,s22), then

( y  y)
( xi  x ) 2 (m  1)
i
2
(n  1)
~ Fm  1, n  1
23