Transcript Document

Two Discrete Random Variables
The probability mass function (pmf) of a single discrete rv X
specifies how much probability mass is placed on each
possible X value.
The joint pmf of two discrete rv’s X and Y describes how
much probability mass is placed on each possible pair of
values (x, y).
Definition
Let X and Y be two discrete rv’s defined on the sample
space of an experiment. The joint probability mass
function p(x, y) is defined for each pair of numbers (x, y)
by
p(x, y) = P(X = x and Y = y)
1
Two Discrete Random Variables
It must be the case that p(x, y)  0 and
p(x, y) = 1.
Now let A be any set consisting of pairs of (x, y) values
(e.g., A = {(x, y): x + y = 5} or {(x, y): max(x, y)  3}).
Then the probability P[(X, Y)  A] is obtained by summing
the joint pmf over pairs in A:
P[(X, Y)  A] =
p(x, y)
2
Two Discrete Random Variables
Definition
The marginal probability mass function of X, denoted by
pX (x), is given by
pX (x) =
p(x, y) for each possible value x
Similarly, the marginal probability mass function of Y is
pY (y) =
p(x, y) for each possible value y.
3
Two Continuous Random Variables
The probability that the observed value of a continuous rv X
lies in a one-dimensional set A (such as an interval) is
obtained by integrating the pdf f(x) over the set A.
Similarly, the probability that the pair (X, Y) of continuous
rv’s falls in a two-dimensional set A (such as a rectangle) is
obtained by integrating a function called the joint density
function.
4
Two Continuous Random Variables
Definition
Let X and Y be continuous rv’s. A joint probability density
function f(x, y) for these two variables is a function
satisfying f(x, y)  0 and
Then for any two-dimensional set A
5
Two Continuous Random Variables
In particular, if A is the two-dimensional rectangle
{(x, y): a  x  b, c  y  d}, then
We can think of f(x, y) as specifying a surface at height
f(x, y) above the point (x, y) in a three-dimensional
coordinate system.
Then P[(X, Y)  A] is the volume underneath this surface
and above the region A, analogous to the area under a
curve in the case of a single rv.
6
Two Continuous Random Variables
This is illustrated in Figure 5.1.
P[(X, Y )  A] = volume under density surface above A
Figure 5.1
7
Two Continuous Random Variables
The marginal pdf of each variable can be obtained in a
manner analogous to what we did in the case of two
discrete variables.
The marginal pdf of X at the value x results from holding x
fixed in the pair (x, y) and integrating the joint pdf over y.
Integrating the joint pdf with respect to x gives the marginal
pdf of Y.
8
Two Continuous Random Variables
Definition
The marginal probability density functions of X and Y,
denoted by fX(x) and fY(y), respectively, are given by
9
Independent Random Variables
Here is an analogous definition for the independence of two
rv’s.
Definition
Two random variables X and Y are said to be independent
if for every pair of x and y values
p(x, y) = pX (x)  pY (y)
or
f(x, y) = fX(x)  fY(y)
when X and Y are discrete
(5.1)
when X and Y are continuous
If (5.1) is not satisfied for all (x, y), then X and Y are said to
be dependent.
10
Independent Random Variables
The definition says that two variables are independent if
their joint pmf or pdf is the product of the two marginal
pmf’s or pdf’s.
Intuitively, independence says that knowing the value of
one of the variables does not provide additional information
about what the value of the other variable might be.
11
Independent Random Variables
Independence of two random variables is most useful when
the description of the experiment under study suggests that
X and Y have no effect on one another.
Then once the marginal pmf’s or pdf’s have been specified,
the joint pmf or pdf is simply the product of the two marginal
functions. It follows that
P(a  X  b, c  Y  d) = P(a  X  b)  P(c  Y  d)
12
More Than Two Random Variables
To model the joint behavior of more than two random
variables, we extend the concept of a joint distribution of
two variables.
Definition
If X1, X2, . . ., Xn are all discrete random variables, the joint
pmf of the variables is the function
p(x1, x2, . . . , xn) = P(X1 = x1, X2 = x2, . . . , Xn = xn)
13
More Than Two Random Variables
If the variables are continuous, the joint pdf of X1, . . ., Xn is
the function f(x1, x2, . . ., xn) such that for any n intervals
[a1, b1], . . . , [an, bn],
In a binomial experiment, each trial could result in one of
only two possible outcomes.
Consider now an experiment consisting of n independent
and identical trials, in which each trial can result in any one
of r possible outcomes.
14
More Than Two Random Variables
Let pi = P(outcome i on any particular trial), and define
random variables by Xi = the number of trials resulting in
outcome i (i = 1, . . . , r).
Such an experiment is called a multinomial experiment,
and the joint pmf of X1, . . . , Xr is called the multinomial
distribution.
15
More Than Two Random Variables
The notion of independence of more than two random
variables is similar to the notion of independence of more
than two events.
Definition
The random variables X1, X2, . . . , Xn are said to be
independent if for every subset
of the variables
(each pair, each triple, and so on), the joint pmf or pdf of
the subset is equal to the product of the marginal pmf’s or
pdf’s.
16
Conditional Distributions
Definition
Let X and Y be two continuous rv’s with joint pdf f(x, y) and
marginal X pdf fX(x). Then for any X value x for which
fX(x) > 0, the conditional probability density function of
Y given that X = x is
If X and Y are discrete, replacing pdf’s by pmf’s in this
definition gives the conditional probability mass function
of Y when X = x.
17
Conditional Distributions
Notice that the definition of fY | X(y | x) parallels that of
P(B | A), the conditional probability that B will occur, given
that A has occurred.
Once the conditional pdf or pmf has been determined,
questions of the type posed at the outset of this subsection
can be answered by integrating or summing over an
appropriate set of Y values.
18
Expected Values, Covariance, and Correlation
Proposition
Let X and Y be jointly distributed rv’s with pmf p(x, y) or
pdf f(x, y) according to whether the variables are discrete or
continuous.
Then the expected value of a function h(X, Y), denoted by
E[h(X, Y)] or h(X, Y), is given by
if X and Y are discrete
if X and Y are continuous
19
Covariance
When two random variables X and Y are not independent,
it is frequently of interest to assess how strongly they are
related to one another.
Definition
The covariance between two rv’s X and Y is
Cov(X, Y) = E[(X – X)(Y – Y)]
X, Y discrete
X, Y continuous
20
Covariance
Then most of the probability mass or density will be
associated with (x – X) and (y – Y), either both positive
(both X and Y above their respective means) or both
negative, so the product (x – X)(y – Y) will tend to be
positive.
Thus for a strong positive relationship, Cov(X, Y) should be
quite positive.
For a strong negative relationship, the signs of (x – X) and
(y – Y) will tend to be opposite, yielding a negative
product.
21
Covariance
Thus for a strong negative relationship, Cov(X, Y) should
be quite negative.
If X and Y are not strongly related, positive and negative
products will tend to cancel one another, yielding a
covariance near 0.
22
Covariance
Figure 5.4 illustrates the different possibilities. The
covariance depends on both the set of possible pairs and
the probabilities. In Figure 5.4, the probabilities could be
changed without altering the set of possible pairs, and this
could drastically change the value of Cov(X, Y).
p(x, y) = 1/10 for each of ten pairs corresponding to indicated points:
(a) positive covariance;
(b) negative covariance;
Figure 5.4
(c) covariance near zero
23
Covariance
The following shortcut formula for Cov(X, Y) simplifies the
computations.
Proposition
Cov(X, Y) = E(XY) – X  Y
According to this formula, no intermediate subtractions are
necessary; only at the end of the computation is X  Y
subtracted from E(XY). The proof involves expanding
(X – X)(Y – Y) and then taking the expected value of each
term separately.
24
Correlation
Definition
The correlation coefficient of X and Y, denoted by
Corr(X, Y), X,Y, or just , is defined by
25
Correlation
The following proposition shows that  remedies the defect
of Cov(X, Y) and also suggests how to recognize the
existence of a strong (linear) relationship.
Proposition
1. If a and c are either both positive or both negative,
Corr(aX + b, cY + d) = Corr(X, Y)
2. For any two rv’s X and Y, –1  Corr(X, Y)  1.
26
Correlation
If we think of p(x, y) or f(x, y) as prescribing a mathematical
model for how the two numerical variables X and Y are
distributed in some population (height and weight, verbal
SAT score and quantitative SAT score, etc.), then  is a
population characteristic or parameter that measures how
strongly X and Y are related in the population.
We will consider taking a sample of pairs (x1, y1), . . . , (xn, yn)
from the population.
The sample correlation coefficient r will then be defined and
used to make inferences about .
27
Correlation
The correlation coefficient  is actually not a completely
general measure of the strength of a relationship.
Proposition
1. If X and Y are independent, then  = 0, but  = 0 does
not imply independence.
2.  = 1 or –1 iff Y = aX + b for some numbers a and b with
a  0.
28
Correlation
This proposition says that  is a measure of the degree of
linear relationship between X and Y, and only when the
two variables are perfectly related in a linear manner will
 be as positive or negative as it can be.
A  less than 1 in absolute value indicates only that the
relationship is not completely linear, but there may still be a
very strong nonlinear relation.
29
Correlation
Also,  = 0 does not imply that X and Y are independent,
but only that there is a complete absence of a linear
relationship. When  = 0, X and Y are said to be
uncorrelated.
Two variables could be uncorrelated yet highly dependent
because there is a strong nonlinear relationship, so be
careful not to conclude too much from knowing that  = 0.
30
Correlation
A value of  near 1 does not necessarily imply that
increasing the value of X causes Y to increase. It implies
only that large X values are associated with large Y values.
For example, in the population of children, vocabulary size
and number of cavities are quite positively correlated, but it
is certainly not true that cavities cause vocabulary
to grow.
Instead, the values of both these variables tend to increase
as the value of age, a third variable, increases.
31
Correlation
For children of a fixed age, there is probably a low
correlation between number of cavities and vocabulary
size.
In summary, association (a high correlation) is not the same
as causation.
32
The Distribution of the Sample Mean
The importance of the sample mean springs from its use
in drawing conclusions about the population mean . Some
of the most frequently used inferential procedures are
based on properties of the sampling distribution of .
A preview of these properties appeared in the calculations
and simulation experiments of the previous section, where
we noted relationships between E( ) and  and also
among V( ),  2, and n.
33
The Distribution of the Sample Mean
Proposition
Let X1, X2, . . . , Xn be a random sample from a distribution
with mean value  and standard deviation . Then
1.
2.
In addition, with T0 = X1+ . . . + Xn (the sample total),
34
The Distribution of the Sample Mean
The standard deviation
is often called the
standard error of the mean; it describes the magnitude of a
typical or representative deviation of the sample mean from
the population mean.
35
The Case of a Normal Population Distribution
Proposition
Let X1, X2, . . . , Xn be a random sample from a normal
distribution with mean  and standard deviation . Then for
any n, is normally distributed (with mean  and standard
deviation
, as is To (with mean n and standard
Deviation
).
We know everything there is to know about the and To
distributions when the population distribution is normal. In
particular, probabilities such as P(a   b) and
P(c  To  d) can be obtained simply by standardizing.
36
The Central Limit Theorem
When the Xi’s are normally distributed, so is
sample size n.
for every
Even when the population distribution is highly nonnormal,
averaging produces a distribution more bell-shaped than
the one being sampled.
A reasonable conjecture is that if n is large, a suitable
normal curve will approximate the actual distribution of .
The formal statement of this result is the most important
theorem of probability.
37
The Central Limit Theorem
Theorem
The Central Limit Theorem (CLT)
Let X1, X2, . . . , Xn be a random sample from a distribution
with mean  and variance  2. Then if n is sufficiently large,
has approximately a normal distribution with
and
and To also has approximately a normal
distribution with
The larger the value of
n, the better the approximation.
38
The Central Limit Theorem
According to the CLT, when n is large and we wish to
calculate a probability such as P(a   b), we need only
“pretend” that is normal, standardize it, and use the
normal table.
The resulting answer will be approximately correct. The
exact answer could be obtained only by first finding the
distribution of , so the CLT provides a truly impressive
shortcut.
39
Other Applications of the Central Limit Theorem
The CLT can be used to justify the normal approximation to
the binomial distribution discussed earlier.
We know that a binomial variable X is the number of
successes in a binomial experiment consisting of n
independent success/failure trials with p = P(S) for any
particular trial. Define a new rv X1 by
and define X2, X3, . . . , Xn analogously for the other n – 1
trials. Each Xi indicates whether or not there is a success
on the corresponding trial.
40
Other Applications of the Central Limit Theorem
Because the trials are independent and P(S) is constant
from trial to trial, the Xi ’s are iid (a random sample from a
Bernoulli distribution).
The CLT then implies that if n is sufficiently large, both the
sum and the average of the Xi’s have approximately normal
distributions.
41
Other Applications of the Central Limit Theorem
When the Xi’s are summed, a 1 is added for every S that
occurs and a 0 for every F, so X1 + . . . + Xn = X. The
sample mean of the Xi’s is X/n, the sample proportion of
successes.
That is, both X and X/n are approximately normal when n is
large.
42
Other Applications of the Central Limit Theorem
The necessary sample size for this approximation depends
on the value of p: When p is close to .5, the distribution of
each Xi is reasonably symmetric (see Figure 5.19),
whereas the distribution is quite skewed when p is near
0 or 1. Using the approximation only if both np  10 and
n(1  p)  10 ensures that n is large enough to overcome
any skewness in the underlying Bernoulli distribution.
(b)
(a)
Two Bernoulli distributions: (a) p = .4 (reasonably symmetric); (b) p = .1 (very skewed)
Figure 5.19
43
The Distribution of a Linear Combination
The sample mean X and sample total To are special cases
of a type of random variable that arises very frequently in
statistical applications.
Definition
Given a collection of n random variables X1, . . . , Xn and
n numerical constants a1, . . . , an, the rv
(5.7)
is called a linear combination of the Xi’s.
44
The Distribution of a Linear Combination
Proposition
Let X1, X2, . . . , Xn have mean values 1, . . . , n,
respectively, and variances
respectively.
1. Whether or not the Xi’s are independent,
E(a1X1 + a2X2 + . . . + anXn) = a1E(X1) + a2E(X2) + . . .
+ anE(Xn)
= a11 + . . . + ann
(5.8)
2. If X1, . . . , Xn are independent,
V(a1X1 + a2X2 + . . . + anXn)
(5.9)
45
The Distribution of a Linear Combination
And
(5.10)
3.
For any X1, . . . , Xn,
(5.11)
46