A Brief Review of Some Important Statistical Concepts
Download
Report
Transcript A Brief Review of Some Important Statistical Concepts
Lecture 3
A Brief Review of Some Important
Statistical Concepts
The Meaning of a Variable
A variable refers to any quantity that may take on more than one value
Population is a variable because it is not fixed or constant –
changes over time
The unemployment rate is a variable because it may take on any
value from 0-100%
A random variable can be thought of as an unknown value that may
change every time it is inspected.
A random variable either may be discrete or continuous
A variable is discrete if its possible values have jumps or breaks
Population - measured in integers or whole units: 1, 2, 3, …
A variable is continuous if there are no jumps or breaks
Unemployment rate – needs not be measured in whole units:
1.77, .., 8.99, …
Descriptive Statistics
Descriptive statistics are used to describe the main features of a
collection of data in quantitative terms.
Descriptive statistics aim to quantitatively summarize a data set
Some statistical summaries are especially common in descriptive
analyses. For example
Frequency Distribution
Central Tendency
Dispersion
Association
Frequency Distribution
Every set of data can be described in terms of how frequently certain
values occur.
In statistics, a frequency distribution is a tabulation of the values that
one or more variables take in a sample.
Consider the hypothetical prices of Dec CME Live Cattle Futures
Month
Price (cents/lb)
May
67.05
June
66.89
July
67.45
August
68.39
September
67.45
October
70.10
November
68.39
Frequency Distribution
Univariate frequency distributions are often presented as lists
ordered by quantity showing the number of times each value appears.
A frequency distribution may be grouped or ungrouped
For a small number of observations - ungrouped frequency distribution
For a large number of observations - grouped frequency distribution
Ungrouped
Grouped
Price (X)
Frequency
Price (X)
Frequency
67.05
1
65.00-66.99
1
66.89
1
67.00-68.99
4
67.45
2
69.00-70.99
1
68.39
2
71.00-72.99
0
70.10
1
73.00-74.99
0
Central Tendency
In statistics, the term central tendency relates to the way in which
quantitative data tend to cluster around a “central value”.
A measure of central tendency is any of a number of ways of
specifying this "central value.“
There are three important descriptive statistics that gives
measures of the central tendency of a variable:
The Mean
The Median
The Mode
The Mean
The arithmetic mean is the most commonly-used type of average and
is often referred to simply as the average.
In mathematics and statistics, the arithmetic mean (or simply the
mean) of a list of numbers is the sum of all numbers in the list divided
by the number of items in the list.
If the list is a statistical population, then the mean of that population
is called a population mean.
If the list is a statistical sample, we call the resulting statistic a
sample mean.
If we denote a set of data by X = (x1, x2, ..., xn), then the sample mean
is typically denoted with a horizontal bar over the variable ( X ,
enunciated "x bar").
The Greek letter μ is used to denote the arithmetic mean of an entire
population.
The Sample Mean
In mathematical notation, the sample mean of a set of data denoted as
X = (x1, x2, ..., xn) is given by
1 n
1
X X i ( X 1 X 2 ... X n )
n i 1
n
To calculate the mean, all of the observations (values) of X are added
and the result is divided by the number of observations (n)
In the previous example, the mean price of Dec CME Live Cattle futures
contract is
1 n
1
X X i (67.05 66.89 ... 68.39) 67.96
n i 1
7
The Median
In statistics, a median is described as the numeric value separating
the higher half of a sample or population from the lower half.
The median of a finite list of numbers can be found by arranging all
the observations from lowest value to highest value and picking the
middle one.
If there is an even number of observations, then there is no single
middle value, so one often takes the mean of the two middle values.
Organize the price data in the previous example in ascending order
67.05, 66.89, 67.45, 67.45, 68.39, 68.39, 70.10
The median of this price series is 67.45
The Mode
In statistics, the mode is the value that occurs the most frequently in
a data set.
The mode is not necessarily unique, since the same maximum
frequency may be attained at different values.
Organize the price data in the previous example in ascending order
67.05, 66.89, 67.45, 67.45, 68.39, 68.39, 70.10
There are two modes in the given price data – 67.45 and 68.39
Thus the mode of the sample data is not unique
The sample price dataset may be said to be bimodal
A population or sample data may be unimodal, bimodal, or
multimodal
Statistical Dispersion
In statistics, statistical dispersion (also called statistical variability
or variation) is the variability or spread in a variable or probability
distribution.
In particular, a measure of dispersion is a statistic (formula) that
indicates how disperse (i.e., spread) the values of a given variable are
Common measures of statistical dispersion are
The Variance, and
The Standard Deviation
Dispersion is contrasted with location or central tendency, and
together they are the most used properties of distributions
The Variance
In statistics, the variance of a random variable or distribution is the
expected (mean) value of the square of the deviation of that variable
from its expected value or mean.
Thus the variance is a measure of the amount of variation within the
values of that variable, taking account of all possible values and their
probabilities.
If a random variable X has the expected (mean) value E[X]=μ, then
the variance of X can be given by:
Var( X ) E[(X )2 ] x2
The Variance
The above definition of variance encompasses random variables that
are discrete or continuous. It can be expanded as follows:
Var ( X ) E[( X ) 2 ]
E[ X 2 2X 2 ]
E[ X 2 ] 2E[ X ] 2
E[ X 2 ] 2 2 2
E[ X 2 ] 2
E[ X 2 ] ( E[ X ])2
The Variance: Properties
Variance is non-negative because the squares are positive or zero.
The variance of a constant a is zero, and the variance of a variable
in a data set is 0 if and only if all entries have the same value.
Var (a) 0
Variance is invariant with respect to changes in a location
parameter. That is, if a constant is added to all values of the
variable, the variance is unchanged.
Var( X a) Var( X )
If all values are scaled by a constant, the variance is scaled by the
square of that constant.
Var (aX ) a 2Var ( X )
Var (aX b) a 2Var ( X )
The Sample Variance
If we have a series of n measurements of a random
variable X as Xi, where i = 1, 2, ..., n, then the sample
variance, can be used to estimate the population variance
of X = (x1, x2, ..., xn), The sample variance is calculated as
X X
n
S x2
i 1
2
i
n 1
1
2
2
2
X1 X X 2 X ... X n X
n 1
The Sample Variance
The denominator, (n-1) is known as the degrees of freedom in
calculating
sx2 : Intuitively, once X is known, only
values are free to vary, one is predetermined by
n-1 observation
X
When n = 1 the variance of a single sample is obviously zero
regardless of the true variance. This bias needs to be corrected for
when n is small.
X
n
S
2
x
i 1
X
2
i
n 1
1
2
2
2
X1 X X 2 X ... X n X
n 1
The Sample Variance
For the hypothetical price data for Dec CME Live Cattle futures
contract, 67.05, 66.89, 67.45, 67.45, 68.39, 68.39, 70.10, the sample
variance can be calculated as
X X
n
S x2
2
i
i 1
n 1
1
67.05 67.962 ... 70.10 67.962
7 1
1.24
The Standard Deviation
In statistics, the standard deviation of a random variable
or distribution is the square root of its variance.
If a random variable X has the expected value (mean)
E[X]=μ, then the standard deviation of X can be given by:
x E [( X ) ]
2
x
2
That is, the standard deviation σ (sigma) is the square root
of the average value of (X − μ)2.
The Standard Deviation
If we have a series of n measurements of a random
variable X as Xi, where i = 1, 2, ..., n, then the sample
standard deviation, can be used to estimate the
population standard deviation of X = (x1, x2, ..., xn). The
sample standard deviation is calculated as
X X
n
Sx S
2
x
i 1
2
i
n 1
1.24 1.114
The Mean Absolute Deviation
The mean or average deviation of X from its mean
di
n
(X
X)
n
i
is always zero. The positive and negative deviations cancel out in the
summation, which makes it a useless measure of dispersion.
The mean absolute deviation (MAD), calculated by:
d i
n
(X
X )
n
i
solves the “canceling out” problem.
The MSD and RMSD
The alternative way to address the canceling out problem is by
squaring the deviations from the mean to obtain the mean squared
deviation (MSD):
di
2
n
X X
2
i
n
The problem of squaring can be solved by taking the square root of
the MSD to obtain the root mean squared deviation (RMSD):
X
n
RMSD MSD
i 1
X
2
i
n
RMSD vs. Standard Deviation
When calculating the RMSD, the squaring of the deviations gives a
greater importance to the deviations that are larger in absolute value,
which may or may not be desirable.
For statistical reasons, it turns out that a slight variation of the RMSD,
known as the standard deviation (SX), is more desirable as a measure
of dispersion.
X
n
RMSD MSD
i 1
i X
2
n
X
n
Sx
i 1
X
2
i
n 1
Variance vs. MSD
Standard Deviation vs. RMSD
Price (X)
67.05
66.89
67.45
68.39
67.45
70.10
68.39
Total
Variance =
Std. Dev. =
Mean
67.96
67.96
67.96
67.96
67.96
67.96
67.96
1.24
1.11
(Xi−Mean) |Xi−Mean| |Xi−Mean|2
-0.91
0.91
0.83
-1.07
1.07
1.14
-0.51
0.51
0.26
0.43
0.43
0.18
-0.51
0.51
0.26
2.14
2.14
4.58
0.43
0.43
0.18
0.00
6.00
7.44
MAD =
MSD =
RMSD =
0.86
1.06
1.03
p 53
Association
Bivariate statistics can be used to examine the degree in
which two variables are related or associated, without
implying that one causes the other
Multivariate statistics can be used to examine the degree in
which multiple variables are related or associated, without
implying that one causes any or some of the others
Two common measures of bivariate and multivariate statistics are
Covariance
Correlation Coefficient
24
p 54
Association: Bivariate Statistics
In Figure 3.3 (a) Y and X are positively but weakly correlated while
in 3.3 (b) they are negatively and strongly correlated
25
The Covariance
The covariance between two real-valued random variables X and Y,
with mean (expected values) X and Y v , is
Cov( X , Y ) E[( X X ).(Y Y )] E[( X ).(Y v)]
E[ X .Y Y vX v]
E[ X .Y ] E[Y ] vE[ X ] v
E[ X .Y ] v v v
E[ X .Y ] v
Cov(X, Y) can be negative, zero, or positive
Random variables with covariance is zero are called uncorrelated
or independent
Covariance
If X and Y are independent, then their covariance is zero. This
follows because under independence,
E[ X .Y ] E[ X ].E[Y ] v
Recalling the final form of the covariance derivation given above,
and substituting, we get
Cov( X , Y ) v v 0
The converse, however, is generally not true: Some pairs of random
variables have covariance zero although they are not independent.
The Covariance: Properties
If X and Y are real-valued random variables and a and b are
constants ("constant" in this context means non-random), then the
following facts are a consequence of the definition of covariance:
Cov( X , a) 0
Cov( X , X ) Var ( X )
Cov( X , Y ) Cov(Y , X )
Cov(aX , bY ) abCov( X , Y )
Cov( X a, Y b) Cov( X , Y )
Variance of the Sum of Correlated
Random Variables
If X and Y are real-valued random variables and a and b are
constants ("constant" in this context means non-random), then the
following facts are a consequence of the definition of variance and
covariance:
Var( X Y ) Var( X ) Var(Y ) 2Cov( X , Y )
Var(aX bY ) a 2Var( X ) b 2Var(Y ) 2abCov( X , Y )
The variance of a finite sum of uncorrelated random variables is
equal to the sum of their variances.
Var( X Y ) Var( X ) Var(Y )
This is because, if X and Y are uncorrelated, their covariance is 0.
p 53
The Sample Covariance
The covariance is one measure of how closely the values taken by
two variables X and Y vary together:
If we have a series of n measurements of X and Y written as Xi and
Yi where i = 1, 2, ..., n, then the sample covariance can be used to
estimate the population covariance between X=(X1, X2, …, Xn) and
Y=(Y1, Y2, …, Yn). The sample covariance is calculated as
X
n
S x, y
i 1
i
X Yi Y
n 1
30
Correlation Coefficient
A disadvantage of the covariance statistic is that its magnitude
can not be easily interpreted, since it depends on the units in
which we measure X and Y
The related and more used correlation coefficient remedies
this disadvantage by standardizing the deviations from the
mean:
x, y
X ,Y
Cov( X , Y )
Var( X ) Var(Y ) X . Y
The correlation coefficient is symmetric, that is
x, y y , x
Correlation Coefficient
If we have a series of n measurements of X and Y written as Yi
and Yi, where i = 1, 2, ..., n, then the sample correlation
coefficient, can be used to estimate the population correlation
coefficient between X and Y. The sample correlation coefficient
is calculated as
n
rx , y
(X
i 1
i
X )(Yi Y )
(n 1) S x S y
Correlation Coefficient
The value of correlation coefficient falls between −1 and 1:
1 rx, y 1
rx,y= 0 => X and Y are uncorrelated
rx,y= 1 => X and Y are perfectly positively correlated
rx,y = −1 => X and Y are perfectly negatively correlated