Lecture 3: Resemblance Between Relatives

Download Report

Transcript Lecture 3: Resemblance Between Relatives

Lecture 1:
Basic Statistical Tools
Discrete and Continuous Random Variables
A random variable (RV) = outcome (realization) not a set
value, but rather drawn from some probability distribution
A discrete RV x --- takes on values X1, X2, … Xk
S
Probability
= Xi)value in some
i = Pr(x
A continuous
can
possible
Pi > 0, RV xdistribution:
Pi =take
1 on Pany
interval (or set of intervals)
The probability distribution is defined by the
probability density function, p(x)
> x0•
< xand
p(x)
P (x
••
1 <
2) =
Z
x1 2
p(x)dx
p(x)
dx = 1
°x 11
Joint and Conditional Probabilities
The probability for a pair (x,y) of random variables is
specified by the joint probability density function, p(x,y)
Z y2 Z x 2
The marginal density of x, p(x)
< y2 ; x 1 •< x •< x2 ) =
P ( y1 <• y •
p(x; y) dx dy
Z1
y1
p(y|x), the conditional density y given
xx 1
p(x) =
p((x;
y)
dy
Z
y 2 p(y|x)
Relationships among p(x),
° 1 p(x,y),
< y2 j x ) =
< y•
P ( y1 •
p( y j x ) dy p(x; y)
= p(
y j xto) p(x);
hence
p(
j x ) == p(x)p(y)
xp(x;
and y)
y are
said
be independent
if y
p(x,y)
y1
p(x)
Note that p(y|x) = p(y) if x and y are independent
Bayes’ Theorem
Suppose an unobservable RV takes on values b1 .. bn
Suppose that we observe the outcome A of an RV correlated
with b. What can we say about b given A?
Bayes’ theorem:
Pr(bj ) Pr(A j bj )
Pr(bj ) Pr(A j bj )
Pr(bj j A) =
= n
X
Pr(A)
Pr(bi ) Pr(A j bi )
i= 1
A typical application in genetics is that A is some
phenotype and b indexes some underlying (but unknown)
genotype
Genotype
QQ
Qq
qq
Freq(genotype)
0.5
0.3
0.2
Pr(height >70 | genotype)
0.3
0.6
0.9
Pr(height > 70) = 0.3*0.5 +0.6*0.3 + 0.9*0.2 = 0.51
Pr(QQ | height > 70) =
Pr(QQ) * Pr (height > 70 | QQ)
Pr(height > 70)
= 0.5*0.3 / 0.51 = 0.294
Expectations of Random Variables
The expected value, E [f(x)], of some function x of the
random variable x is just the average value of that function
E[x] = the (arithmetic)
mean, m, of a random variable x
XZ + 1
x discrete
(x)]==
f (x)p(x)dx
EE[f[f(x)]
P r(x
= X i )fZ(X i )1x continuous
+
°
1
i
E (x) = š =
x p(x) dx
E[ (x - m)2 ] = s 2, the variance
° 1of x
Z +1
£
§
More generally, the rth2moment
about the mean
is given
2
2
E
(x
°
š
)
=
æ
=
(x
°
š
)
p(x)
dx
]
[
r
by E[ (x - m) ]
° 1
Useful
properties
expectations
r ==2:
=3:
4: variance
skew
(scaled)ofkurtosis
E [g(x) + f (y)] = E [g(x)] + E [f (y)]
E (c x) = c E (x)
The Normal (or Gaussian) Distribution
A normal RV with mean m and variance s2 has density:
µ
p(x) = p
1
(x ° š
exp °
2æ2
2º æ2
(
)2
Ž
)
(m) =ispeak
of
TheMean
variance
a measure
of
distribution
spread
about the mean. The
smaller s2, the narrower the
distribution about the mean
The truncated normal
Only consider values of T or above in a normal
Mean
Density
of function
truncated
= distribution
p(x | x > T)
Z
1
z p(z)p(z)
æ ¢pT
p(z)
E [z j z > T ] =
dz = š +
R
=
1
º
ºT
T
P r(z > T )t
p(z) dz
T
Here pT is the height of the normal
Variance
at
the truncation point,
T
Let pT = Pr(z > T)
"
•
•
#
2
µ
Ž
(T °2 š )
2 exp °pT
pTpT= ¢(z
(2º °) ° š1 =)=æ
2
1+
°
2 æ2æ
ºT
ºT
Covariances
• Cov(x,y) = E [(x-mx)(y-my)]
•
= E [x*y] - E[x]*E[y]
Cov(x,y)
Cov(x,y)
>=<0,
0,
negative
(linear)
(linear)
association
between
between
Cov(x,y)
Cov(x,y)
0,positive
no
= 0linear
DOES
association
NOTassociation
imply
between
no association
x & y x x&&y y
cov(X,Y)
cov(X,Y)
> 0=<00= 0
cov(X,Y)
cov(X,Y)
Y
Y Y
Y
X
X X
X
Correlation
Cov = 10 tells us nothing about the strength of an
association
What is needed is an absolute measure of association
This is provided by the correlation, r(x,y)
r (x; y) = p
Cov(x; y)
V a r (x) Va r (y)
r = 1 implies a perfect (positive) linear association
r = - 1 implies a perfect (negative) linear association
Useful Properties of Variances and Covariances
• Symmetry, Cov(x,y) = Cov(y,x)
• The covariance of a variable with itself is the
variance, Cov(x,x) = Var(x)
• If a is a constant, then
– Cov(ax,y) = a Cov(x,y)
• Var(a x) = a2 Var(x).
– Var(ax) = Cov(ax,ax) = a2 Cov(x,x) = a2Var(x)
• Cov(x+y,z) = Cov(x,z) + Cov(y,z)
More generally
0
1
Xn
Xm
Xn Xm
C ov @
xi ;
yj A =
Cov(x i ; yj )
i= 1
j= 1
i= 1 j = 1
Var (x + y) = V ar (x ) + Var (y) + 2Cov(x; y)
Hence, the variance of a sum equals the sum of the
Variances ONLY when the elements are uncorrelated
Regressions
Consider the best (linear) predictor of y given we know x
yb = y + by j x ( x
x)
The slope of this linear regression is a function of Cov,
by j x
Cov(x; y)
=
Va r (x)
The fraction of the variation in y accounted for by knowing
x, i.e,Var(yhat - y), is r2
r2 = 0.9
0.6
Relationship between the correlation and the regression
Slope:
s
C ov(x; y)
r (x; y) = p
= by jx
Va r (x)Va r (y)
Va r (x)
V a r (y)
If Var(x) = Var(y), then by|x = b x|y = r(x,y)
In the case, the fraction of variation accounted for
by the regression is b2
Properties of Least-squares Regressions
The slope and intercept obtained by least-squares:
minimize the sum of squared residuals:
• The regression line passes through the means
X
X
X
2
2
of both x and
e i =y
( yi ° y^i ) =
( yi ° a ° bx i )2
• The average value of the residual is zero
• The LS solution maximizes the amount of variation in
y that can be explained by a linear regression on x
• Fraction of variance in y accounted by the regression
is r2
• The residual errors around the least-squares regression
are uncorrelated with the predictor variable x
• Homoscedastic vs. heteroscedastic residual variances
Maximum Likelihood
p(x1,…, xn | q ) = density of the observed data (x1,…, xn)
given the (unknown) distribution parameter(s) q
Fisher (yup, the same one) suggested the method of
maximum likelihood --- given the data (x1,…, xn) find the
value(s) of q that maximize p(x1,…, xn | q )
We usually express p(x1,…, xn | q) as a likelihood
function l ( q | x1,…, xn ) to remind us that it is dependent
on the observed data
The Maximum Likelihood Estimator (MLE) of q are the
value(s) that maximize the likelihood function l given
the observed data x1,…, xn .
MLE of q
l (q | x)
This curvature
The
is formalize
of by
thelooking
likelihood
at the
surface
log-likelihood
in the neighborhood
surface,
of=the
us as
precisionfunction,
of the estimator
L
ln [lMLE
(q | informs
x) ]. Since
lnto
is the
a monotonic
the
A narrow
high precision.
A board peak
value
of q peak
that =maximizes
l also maximizes
L = lower
precision
Var(MLE) = -1
@2 L(µj z)
@µ2
Negative curvature
larger the curvature, the smaller
at a maximum The 2nd
derivative = curvature
the variance
Likelihood Ratio tests
Hypothesis testing in the ML frameworks occurs
through likelihood-ratio (LR) tests
Ž
h
i
b
`( £q r j z)
b r j z) ° L( £bq j z)
LR = ° 2 ln
= ° 2 L( £q
b j z)
`( £q
µ
Maximum value of the likelihood function under the
Ratio
For
large
of the
sample
valuesizes
of the
(generally)
maximumLR
value
approaches
of the a
null hypothesis (typically r parameters assigned fixed
likelihood function
Chi-square
distribution
under
with
thernull
df (r
hypothesis
= numbervs.
of
values)
the alternative
parameters
assigned fixed values under null)
Bayesian Statistics
Instead
of simply
estimatingisaBayesian
point estimate
(e.g.., the
An extension
of likelihood
statistics
MLE), the goal is the estimate the entire distribution
for the unknown parameter q given the data x
p(q | x) = C* l(x | q) p(q)
posterior distribution
of
Likelihood
function
prior distribution
for
q the posterior
for q
Why Bayesian?
The
appropriate
constant
so
that
q given x
integrates
to one.
• Exact for
any sample
size
• Marginal posteriors
• Efficient use of any prior information
• MCMC (such as Gibbs sampling) methods