DATA ANALYSIS - DCU School of Computing

Transcript DATA ANALYSIS - DCU School of Computing

DATA ANALYSIS
Module Code :CA660
(Application Areas: Bio-, Business,
Social, Environment etc.)
STRUCTURE of Investigation/DA
1,2, many samples
Replication,
E.D., Regn., C.T.
Assays, Counts
Estimation/H.T.
H.T.
Study techniques
Lab. techniques
Non-Parametric
Parametric
Distributional Assumptions, Probability , Estimation properties
Basis: Size/Type of Data Set/Tools
Level of
Measurement
2
Probability & Statistics Primer
-overview
Note: Short overview. Other statistical distributions in lectures
3
Summary Statistics- Descriptive
In analysis of practical sets of data, useful to define a small number of values that
summarise main features present. We derive (i) representative values, (ii)
measures of spread and (iii) measures of skewness and other characteristics.
Representative Values
Sometimes called measures of location or measures of central tendency.
1. Random Value
Given a set of data S = { x1, x2, … , xn }, we select a random number, say k, in the
range 1 to n and return the value xk. This method of generating a representative
value is straightforward, but it suffers from the fact that extreme values can occur
and successive values could vary considerably from one another.
2. Arithmetic Mean
For the set S above, the arithmetic mean (or just mean) is
x = {x1 + x2 + … + xn }/ n.
If x1 occurs f1 times, x2 occurs f2 times and so on, we get the formula
x = { f 1 x 1 + f2 x 2 + … + f n x n } / { f 1 + f 2 + … + f n } ,
written
x   i f i xi

i
fi
4
Example 1.
Data are student marks in an examination. Find the average mark for the class.
Note 1: Marks are given as ranges, so care
needed in range interpretation
All intervals must be of equal rank and there
must be no gaps in the classification
We interpret the range 0 - 19 to contain marks
greater than 0 and less than or equal to 20.
Thus, mid-point is 10. The other intervals are
are interpreted accordingly.
Mark
Mid-Point
Number
of Range of Students
0 - 19
21 - 39
40 - 59
60 - 79
80 - 99
Sum
xi
fi
f i xi
10
30
50
70
90
-
2
6
12
25
5
50
20
180
600
1750
450
3000
The arithmetic mean is x = 3000 / 50 = 60 marks.
Note 2: Pivot. If weights of size fi are suspended
from a metre stick at the points xi, then the
average is the centre of gravity of the
distribution. Consequently, it is very sensitive
to outlying values.
x1
f1
x2
x
f2
xn
fn
Note 3: Population should be homogenous for average to be meaningful. For example, if
assume that typical height of girls in a class is less than that of boys, then average height
5
of all students is neither indicative of the girls nor of the boys.
3. The Mode
Frequency
This is the value that occurs most
50
frequently. By common agreement,
it is calculated from the histogram using linear
interpolation on the modal class.
13 20
25
The various similar triangles in the diagram
generate the common ratios. In our case,
the mode is
60 + 13 / 33 (20) = 67.8 marks.
4. The Median
13
20
12
6
2
20
5
40
60
80
40
60
80
100
Cumulative
Frequency
50
The middle point of the distribution.
If { x1, x2, … , xn } are marks of students
in a class, arranged in nondecreasing
25.5
order, then the median is the mark of
the (n + 1)/2 student.
Often use the ogive or cumulative frequency
Diagram to calculate. In our case,
the median is
60 + 5.5 / 25 (20) = 64.4 marks.
20
100
6
Measures of Dispersion or Scattering
Example 2. The distribution shown has the same
arithmetic mean as Example 1, but values are more
dispersed. Illustrates that an average value alone
may not adequately describe statistical
distributions.
To devise a formula that captures degree to which a
distribution is concentrated about the average, we
consider the deviations of the values from the average.
If distribution is concentrated around the mean,
then deviations will be small, while if it is very scattered,
then deviations will be large.
The average of the squares of the deviations is called
the variance and this is used as a measure of dispersion.
The square root of the variance is the standard
deviation , has same units of measurement as
the original values and is the preferred measure of
dispersion in many applications.
Marks
xj
Frequency
fj
fj xj
10
30
50
70
90
Sums
6
8
6
15
15
50
60
240
300
1050
1350
3000
x6
x5
x4
x3
x2
x1
x
7
Variance & Standard Deviation
s2  VAR[X] = Average of the Squared Deviations
= S f { Squared Deviations } / S f
= S fi { x i - x } 2 / S fi
= S f xi 2 / S f - x 2 , called the product moment formula.
s  Standard Deviation =  Variance
Example 1
f
2
6
12
25
5
50
x
10
30
50
70
90
Example 2
fx
20
180
600
1750
450
3000
VAR [X] = 198600 / 50 - (60) 2
= 372 marks2
f x2
200
5400
30000
122500
40500
198600
f
6
8
6
15
15
50
x
10
30
50
70
90
fx
60
240
300
1050
1350
3000
f x2
600
7200
15000
73500
121500
217800
VAR [X] = 217800 / 50 - (60)2
= 756 marks2
8
Other Summary Statistics
Skewness
An important attribute of a statistical distribution is its degree of symmetry. The “skew”
means a tail, so distributions with a large tail of outlying values on the right-hand-side
are positively skewed or skewed to the right. The notion of negative skewness is defined
similarly. A simple formula for skewness is
Skewness = ( Mean - Mode ) / Standard Deviation
which for Example 1 is:
Skewness = (60 - 67.8) / 19.287 = - 0.4044.
Coefficient of Variation
This formula was devised to ‘standardise’ the arithmetic mean so comparisons can be
drawn between different distributions. Not universally used.
Coefficient of Variation = Mean / Standard Deviation.
Semi-Interquartile Range
The Median is the mid or 0.5 point in a distribution. The quartiles Q1, Q2, Q3 correspond
to the 0.25, 0.50 and 0.75 points. An alternative measure of dispersion is thus
Semi-Interquartile Range = ( Q3 - Q1 ) / 2.
Geometric Mean
For data that grow geometrically, e.g. economic data with high inflation effect, another
mean is sometimes used. The G.M. is defined for a product of frequencies, where N = S f
G. M. = N x1f1 x2 f2 … xk fk
9
Regression
[Example 3.] As a motivating example, suppose we model sales data over time.
SALES
TIME
3
1990
5
1991
4
1992
5
1993
Y
Want the straight line “Y = m X + c” that best
approximates the data. “Best” in this case is
the line which minimizes the sum of squares
of vertical deviations of points from the line:
SSQ = SS = S ( Yi - [ mXi + c ] ) 2
Setting partial derivatives of SS w.r.t. m
and c to zero leads to the “Normal Equations”
SY =mS X+nc
S X Y= m S X2 + c S X , where n = # points
Let 1990 correspond to Year 0.
X.X X
X.Y Y
Y.Y
0 0
0 3
9
1 1
5 5
25
4 2
8 4
16
9 3
15 5
25
16 4
24 6
36
25 5
35 7
49
55 15
87 30
160
6
1994
Yi
7
1995
Yi = m Xi + c
m Xi + c
0
Xi
X
Sales
10
5
0
10
5 Time
Example 3 - Working
The normal equations are:
30 = 15 m + 6 c
87 = 55 m + 15 c
=>
24 = 35 m
=>
=>
150 = 75 m + 30 c
174 = 110 m + 30 c
30 = 15 (24 / 35) + 6 c => c = 23/7
Thus the regression line of Y on X is
Y = (24/35) X + (23/7)
and to plot the line just need two points, so
X=0
=> Y = 23/7
and X = 5 => Y = (24/35) 5 + 23/7 = 47/7.
Easy to see that ( X, Y ) satisfies the normal equations, so that the regression line
of Y on X passes through “Centre of Gravity” of the data. By expanding terms, get
S ( Yi - Y ) 2 = S( Yi - [ m Xi + c ] ) 2 + S ( [ m Xi + c ] - Y ) 2
Total Sum
of Squares
SST
Error Sum
of Squares
= SSE
Regression Sum
of Squares
+
SSR
Distinguish the independent and dependent variables
(X and Y respectively)
Y
Yi
mXi +C
Y
Y
X
11
X
Correlation
The coefficient of determination r2 ( which takes values in the range 0 to 1) is a
measure of the proportion of the total variation that is associated with the
regression process:
r2 =
SSR/ SST
=
1 - SSE / SST.
The coefficient of correlation ‘r’ (values in the range -1 to +1 ) is a more common
measure of the degree to which a mathematical relationship exists between X
and Y. It can be calculated as:
r =
(X-X)(Y-Y)
 ( X - X )2 ( Y - Y ) 2
=
nXY-XY
 [{ n X 2 - ( X ) 2 } { n  Y 2 - ( Y ) 2 }]
Example. In our case, r = {6(87) - (15)(30)}/  { 6(55) - (15)2 } { 6 (160) - (30)2 } = 0.907.
r=-1
r=0
r=+1
12
Collinearity
For correlation coefficient value > 0.9 or < - 0.9, we would take this to mean that
there is a mathematical relationship between the variables. Does not imply that
a cause-and-effect relationship exists.
E.g. consider a country with a slowly changing population size, where a certain political
party retains a relatively stable percentage of the poll in elections. Let
X = Number of people that vote for the party in an election
Y = Number of people that die of a given disease in a year
Z = Population size.
Then, correlation coefficient between X and Y is ~1, indicating a mathematical relationship
between them (i.e.) X is a function of Z and Y is a function of Z also. It would clearly be silly
to suggest that the incidence of disease is caused by the number of people that vote for the
given political party. This is known as the problem of collinearity.
Spotting hidden dependencies is non-trivial. Statistical experimentation can only
be used to disprove hypotheses, or to lend evidence to support the view that
reputed relationships between variables may be valid. Thus, the fact of a high
correlation coefficient between deaths due to heart failure in a given year with
the number of cigarettes consumed twenty years earlier does not establish a
cause-and-effect relationship, though may be useful to guide research.
13
Overview of Probability Theory
In statistical theory, an experiment is any operation that can be replicated infinitely often
and gives rise to a set of elementary outcomes, which are deemed to be equally likely. The
sample space S of the experiment is the set of all possible outcomes of the experiment.
Any subset E of the sample space is called an event. An event E occurs whenever any of its
elements is an outcome of the experiment. The probability of occurrence of E is
P {E} =
Number of elementary outcomes in E
Number of elementary outcomes in S
S
E
The complement E of an event E is the set of all elements that belong to S but not to E. The
union of two events E1  E2 is the set of all outcomes that belong to E1 or to E2 or to both.
The intersection of two events E1  E2 is the set of all events that belong to both E1 and E2.
Two events are mutually exclusive if occurrence of either precludes occurrence of the
other (i.e) their intersection is the empty set  . Two events are independent if
occurrence of either is unaffected by occurrence or non-occurrence of the other event.
Theorem of Total Probability.
P {E1  E2} = P{E1} + P{E2} - P{E1  E2}
Proof.
P{E1 E2} = (n1, 0 + n1, 2 + n0, 2) / n
= (n1, 0 + n1, 2) / n + (n1, 2 + n0, 2) / n - n1, 2 / n
= P{E1} + P{E2} - P{E1 E2}
Corollary.
If E1 and E2 are mutually exclusive, P{E1  E2} = P{E1} + P{E2}
- see Axioms and Addition Rule
S
E1
n1, 0
E2
n1, 2
n0, 2
n0, 0
n = n0, 0 + n1, 0 + n0, 2 + n1, 14
2
The probability P{E1 | E2} that E1 occurs, given that E2 has occurred (or must occur) is called the
conditional probability of E1. Note : only possible outcomes of the experiment are confined to
S
E2 and not to S.
E2
E1
Theorem of Compound Probability Multiplication Rule.
Proof.
P{E1 E2} = P{E1 | E2}  P{E2}.
P{E1 E2} = n1, 2 / n
= {n1, 2 / (n1, 2 + n0, 2) }  { n1, 2 + n0, 2) / n}
n1, 0
n1, 2
n0, 2
n0, 0
Corollary
If E1 and E2 are independent, P{E1 E2} = P{E1}  P{E2}. Special case of Multiplication Rule
Note: If E itself compound, expands further = Chain Rule: P{E7  E8  E9} =P{E7  (E8 E9)}
Counting possible outcomes of an event is crucial to calculating probabilities. A permutation of
size r of n different items, is an arrangement of r of the items, where order of arrangement is
important. If order is not important, the arrangement is called a combination.
Example. There are 54 permutations and 54 / (21) combinations of size 2 of A, B, C, D, E
Permutations:
AB, BA, AC, CA, AD, DA, AE, EA
CD, DC, CE, EC
BC, CB, BD, DB, BE, EB
DE, ED
Combinations:
AB, AC, AD, AE, BC, BD, BE, CD, CE, DE
Standard reference books on probability theory give a comprehensive treatment of how these
ideas are used to calculate the probability of occurrence of the outcomes of games of chance.
15
Bayes’ Rule (Theorem): For a series of mutually exclusive and exhaustive events Br,
where union of the Br = B1  B2  B3 …….Br = all possibilities for B,
Then:
PBs | A 
PA | Bs  PBs 
 PA | Br  PBr 
r
Where the denominator is the Total probability of A occurring.
Ex. Paternity indices: based on actual genotypes of mother, child, and alleged father.
Before collection of any evidence, have a prior probability of paternity P{C}. So, what
is the situation after the genetic evidence ‘E’ is in?
From Bayes’: P {man is father | E}
P[E | man is father}
P{man is father}
=

P{man not father | E} P{E | man not father}
P{man not father}
Written in terms of ratio of posterior probs. (= LHS), paternity index (L say) and ratio
of prior probs. (RHS). Rearrange and substitute in above to give prob. of an alleged
man with
particular genotype ‘C’ PC | E  
PL PC 
L  P  (1  PC )
being the true father
NB: L is a way of ‘weighting’ the genetic evidence; the issue is setting a prior.
16
Statistical Distributions- Characterisation
If a statistical experiment only gives rise to real numbers, the outcome of the experiment is
called a random variable. If a random variable X
takes values
X1, X2, … , Xn
with probabilities
p1, p2, … , pn
then the expected (average) value of X is defined to be
n
E[X] =  pj Xj
and its variance is
j 1
n
VAR[X] =
E[X2]
-
E[X]2
=
 pj Xj2 - E[X]2.
j 1
Example. Let X be a random variable measuring
the distance in Kilometres travelled by children
to a school and suppose that the following data
applies. Then the mean and variance are
E[X]
= 5.30 Kilometres
VAR[X] = 33.80 - 5.302 =5.71 km2
Similar concepts apply to continuous distributions.
The distribution function is defined by
F(t) = P{ X  t}
f(t) = d F(t) / dt
Prob.
pj
0.15
0.40
0.20
0.15
0.10
1.00
Distance
Xj
pj Xj pj Xj2
2.0
4.0
6.0
8.0
10.0
-
0.30
0.60
1.60
6.40
1.20
7.20
1.20 9.60
1.00 1.00
5.30 33.80
and its derivative is the frequency function
t
so that
F(t) =
 f(x) dx.

17
Sums and Differences of Random Variables
Define the covariance of two random variables to be
COVAR [ X, Y]
= E [(X - E[X]) (Y - E[Y]) ] = E[X Y] - E[X] E[Y].
If X and Y are independent, COVAR [X, Y] = 0.
Lemma
E[ XY]
= E[X] + E[Y]
VAR [ X Y]
= VAR [X] VAR [Y]  2 COVAR [X, Y]
E[ k. X] = k .E[X] VAR[ k. X] = k2 .E[X] for a constant k.
Example. A company records the journey time X
of a lorry from a depot to customers and
the unloading times Y, as shown.
E[X]
= {1(10)+2(13)+3(17)+4(10)}/50 = 2.54
E[X2] = {12(10+22(13)+32(17)+42(10)}/50 = 7.5
VAR[X] = 7.5 - (2.54)2 = 1.0484
E[Y]
= {1(20)+2(19)+3(11)}/50 = 1.82
VAR[Y] = 3.9 - (1.82)2 = 0.5876
X=
Y =1
2
3
1
7
2
1
Totals
10
2
5
6
2
3
4
8
5
13
17
4
4
3
3
10
Totals
20
19
11
50
E[Y2] = {12(20)+22(19)+32(11)}/50 = 3.9
E[X+Y]
= { 2(7)+3(5)+4(4)+5(4)+3(2)+4(6)+5(8)+6(3)+4(1)+5(2)+6(5)+7(3)}/50 = 4.36
2
E[(X + Y) ]
= {22(7)+32(5)+42(4)+52(4)+32(2)+42(6)+52(8)+62(3)+42(1)+52(2)+62(5)+72(3)}/50 = 21.04
VAR[(X+Y)] = 21.04 - (4.36)2 = 2.0304
E[X Y]
= {1(7)+2(5)+3(4)+4(4)+2(2)+4(6)+6(8)+8(3)+3(1)+6(2)+9(5)+12(3)}/50 = 4.82
COVAR (X, Y) = 4.82 - (2.54)(1.82) = 0.1972
VAR[X] + VAR[Y] + 2 COVAR[ X, Y] = 1.0484 + 0.5876 + 2 ( 0.1972) = 2.0304
18
Standard Statistical Distributions
Most elementary statistical books provide a survey of commonly used statistical
distributions.
Importantly, we can characterise them by their expectation and variance (as for
random variables) and by the parameters on which these are based; (see lecture notes
for those we refer to).
So, e.g. for a Binomial distribution, the parameters are p the probability of ‘success in
an individual trial’ and n the No. of trials. The probability of success remains constant –
otherwise, another distribution applies.
Use of the correct distribution is core to statistical inference – I.e. estimating what is
happening in the population on the basis of a (correctly drawn, probabilistic) sample.
The sample is then representative of the population.
Fundamental to statistical inference is the Normal (or Gaussian), with parameters, 
the mean (or formally expectation of the distribution) and s (SD) or variance (s 2).
For small samples, or when s 2 not known but must be estimated from a sample, a
slightly more conservative distribution - the Student’s T or just ‘t’ distribution, applies.
Introduces the degrees of freedom concept.
19
Student’s t Distribution
A random variable X has a t distribution with n degrees of freedom ( tn ) .
The t distribution is symmetrical about the origin, with
E[X]
=0
VAR [X] = n / (n -2).
For small values of n, the tn distribution is very flat. As n is increased the density
assumes a bell shape. For values of n  25, the tn distribution is practically
indistinguishable from the Standard Normal curve.
O If X and Y are independent random variables
If X has a standard normal distribution and Y has a cn2 distribution
then
X
has a tn distribution
(Y / n)
O If x1, x2, … , xn is a random sample from a normal distribution, with
mean  and variance s2 and if we define s2 = 1 / ( n - 1)  ( xi - x ) 2
then ( x -  ) / ( s /  n) has a tn- 1 distribution
Estimated Sample variance
- see calculators ,tables etc.
+ Many other standard distributions
20
Sampling Theory
To draw a random sample from a distribution, assign numbers 1, 2, … to the
elements of the distribution, use random number tabes or generated set to decide
which elements are included in the sample. If the same element can not be selected
more than once, we say that the sample is drawn without replacement; otherwise,
the sample is said to be drawn with replacement.
Usual convention in sampling is that lower case letters designate the sample
characteristics, with capital letters used for the (finite) parent population and greek
letters for the infinite. Thus if sample size = n, its elements are designated, x1, x2, …,
xn, its mean is x and its modified variance is s2 =  (xi - x )2 / (n - 1).
Corresponding parent population characteristics = N, X and S2 or (,  and s2)
Suppose we repeatedly draw random samples of size n (with replacement) from a
distribution with mean  and variance s2. Let x1, x2, … be the collection of sample
means and let
xi ’ =
xi - 
(i = 1, 2, … )
sn
The collection x1’, x2’, … is called the sampling distribution of means, (usual U or Z)
Central Limit Theorem.
In the limit, as sample size n tends to infinity, the sampling distribution of means has
21
a Standard Normal distribution. Basis for Statistical Inference.
Attribute and Proportionate Sampling
If sample elements are a measurement of some characteristic, then have
attribute sampling. However, if all sample elements are 1 or 0 (success/failure,
agree/ do-not agree), we have proportionate sampling.
For proportionate sampling, the sample average x and the sample proportion p
are synonymous, (just as for mean  and proportion P for the parent population).
From our results on the Binomial distribution, the sample variance is p (1 - p) and
the variance of the parent distribution is P (1 - P) in the proportionate case.
The ‘sampling distribution’ of means concept generalizes to get the sampling
distribution of any statistic. We say that a sample characteristic is an unbiased
estimator of the parent population characteristic, i.e. the expectation of the
corresponding sampling distribution is equal to the parent characteristic.
Lemma.
The sample average (proportion ) is an unbiased estimator of the parent
average (proportion):
E [ x] = ;
so E [p] = P.
The quantity  ( N - n) / ( N - 1) is called the finite population correction (fpc). If the parent
population is infinite or we have sampling with replacement the fpc = 1.
Lemma.
E [s] = S fpc for estimated sample S.D. with fpc
22
Confidence Intervals
From the statistical tables for a Standard Normal (Gaussian)
distribution, we note that
Area Under
Density Function
0.90
0.95
0.99
From
To
-1.64
-1.96
-2.58
1.64
1.96
2.58
N (0,1)
0.95
-1.96
0
+1.96
From the central limit theorem, if x and s2 are the mean and variance of a random sample
of size n (with n greater than 25) drawn from a large parent population, size N , then
the following statement ,about the unknown parent mean , applies
Prob { -1.64 
i.e.
x -     
s/n
Prob { x - 1.64 s /  n    x   s /  n }  
The range x  1.64 s /  n is called a 90% confidence interval for the parent mean .
Example [ Attribute Sampling]
A random sample of size 25 has x = 15 and s = 2. Then a 95% confidence interval for  is
15  1.96 (2 / 5)
(i.e.) 14.22 to 15.78
Example [ Proportionate Sampling]
A random sample of size n = 1000 has p = 0.40  1.96  p (1 - p) / (n - 1) = 0.03.
A 95% confidence interval for P is 0.40  0.03 (i.e.) 0.37 to 0.43.
23
Small Sampling Theory
For reference purposes, it is useful to regard the expression
x  1.96 s /  n
as “default formula” for a confidence interval and to modify it for particular circumstances.
O If we are dealing with proportionate sampling, the sample proportion is the
sample mean and the standard error (s.e.) term s / n simplifies as follows:
x -> p and s /  n ->  p(1 - p) / (n -1).
(Also n-1 -> n)
O A 90% confidence interval will bring about the swap 1.96 -> 1.64.
O For sample size n less than 25, the Normal distribution must be replaced by
Student’s t n - 1 distribution.
O For sampling without replacement from a finite population, a fpc term must be
used.
The width of the confidence interval band increases with the confidence level.
Example. A random sample of size n = 10, drawn from a large parent population, has mean x = 12 and a
standard deviation s = 2. Then a 99% confidence interval for the parent mean is
x  3.25 s /  n
(i.e.)
12 3.25 (2)/3
(i.e.)
9.83 to 14.17
and a 95% confidence interval for the parent mean is
x  2.262 s /  n
(i.e.)
12 2.262 (2)/3
(i.e.)
10.492 to 13.508.
Note: For n = 1000, 1.96  p (1 - p) / n   for values of p between 0.3 and 0.7. This gives rise to the
statement that public opinion polls have an “inherent error of 3%”.
Simplifies calculations in the case of public opinion polls for large political parties.
24
Tests of Hypothesis
[Motivational Example]. It is claimed that average grade of all 12 year olds in a country in
a particular aptitude test is 60%. A random sample of n= 49 students gives a mean x =
55% with standard deviation s = 2%. Is the sample finding consistent with the claim?
The original claim regarded as a null hypothesis (H0) which is tentatively accepted as TRUE:
H0 :   
If the null hypothesis is true, the test statistic
Z(0,1)
TS = x - 
sn
0.95
is a Random Variable with a Normal (0, 1) i.e.
-1.96
1.96
Standardised Normal Z(0,1) (or U(0,1)) distribution.
Thus
55 - 60 = - 35 / 2 = - 17.5
2/  49
rejection regions
is a random value from Z(0, 1).
But this lies outside the 95% confidence interval (falls in the rejection region), so either
(i) The null hypothesis is incorrect
or
(ii) An event with a probability of at most 0.05 has occurred.
Consequently, reject the null hypothesis, knowing a probability of 0.05 exists that we are
in error. Technically: reject the null hypothesis at the 0.05 level of significance.
The alternative to rejecting H0, is to declare the test to be inconclusive. This means that
there is some tentative evidence to support the view that H0 is approximately correct. 25
Modifications
Based on the properties of the Normal , Student ‘t’ and other distributions, we can
generalise these ideas. If the sample size n < 25, a t n-1 distribution should be used; the
level of significance of the test may also be varied or the test applied to a proportionate
sampling environment.
Example. 40% of a random sample of 1000 people in a country indicate satisfaction with
government policy. Test at the 0.01 level of significance if this consistent with the claim that 45% of
the people support government policy?
Here, H0: P = 0.45
p = 0.40 n = 1000
so
 p (1-p) / n = 0.015 test statistic = (0.40 - 0.45) / 0.015 = - 3.33
99% critical value = 2.58
so H0 is rejected at the 0.01 level of significance.
One-Tailed Tests
If the null hypothesis is of the form H0 : P > 0.45 then arbitrary large values of p are
acceptable, so that the rejection region for the test statistic lies in the left hand tail only.
Example. 40% of a random sample of 1000 people in a country indicate satisfaction with
government policy. Test at the 0.05 level of significance if this consistent with the claim that
at least 45% of the people support government policy?
N(0,1)
Here the critical value is -1.64, so
0.95
the null hypothesis H0: P  
is rejected at the 0.05 level of
-1.64
significance
Rejection region
26
Testing Differences between Means
Suppose that
x1 x2 … xm is a random sample, mean x and standard deviation s1
drawn from a distribution with mean 1 and
y1 y2 … yn is a random sample, mean y and standard deviation s2
drawn from a distribution with mean 2.
Suppose that we wish to test the null hypothesis that both samples are drawn from the
same parent population (i.e.)
H0: 1 = 2.
The pooled estimate of the parent variance is
s* 2 = sp2 = { (m - 1) s12 + (n - 1) s22 } / ( m + n - 2)
and the variance of ( x – y), is the variance of the difference of two independent random
variables, i.e.
sdiff 2 = sp2 / m + sp2 / n.
This allows us to construct the test statistic, which under H0 has a tm+n-2 distribution.
Example. A random sample of size m = 25 has mean x = 2.5 and standard deviation s1 = 2, while a
second sample of size n = 41 has mean y = 2.8 and standard deviation s2 = 1. Test at the 0.05 level of
significance if the means of the parent populations are identical.
Here
H0 : 1 = 2
x - y = - 0.3 and
sp2 = {24(4) + 40(1)} / 64 = 2.125
so the test statistic is
- 0.3 /  22  2  22     8
The 0.05 critical value for Z(0, 1) is , so the test is inconclusive
27
Paired Tests
If the sample values ( xi , yi ) are paired, such as the marks of students in two examinations,
then let di = xi - yi be their differences and treat these values as the elements of a sample to
generate a test statistic for the hypothesis
H0: 1 = 2.
The test statistic
d / sd / n
has a tn-1 distribution if H0 is true.
Example. In a random sample of 100 students in a national examination their examination mark in
English is subtracted from their continuous assessment mark, giving a mean of 5 and a standard
deviation of 2. Test at the 0.01 level of significance if the true mean mark for both components is the
same.
Here
n = 100, d = 5,
sd / n = 2/10 = 0.2
so the test statistic is then
5 / 0.2 = 10.
the 0.01 critical value for a Z(0, 1) distribution is 2.58, so H0 is rejected at the 0.01 level of significance.
Tests for the Variance.
For normally distributed random variables, given
H0: s 2 = k, a constant, then (n-1) s2 / k
has a c 2n - 1 distribution.
Example. A random sample of size 30 drawn from a normal distribution has variance s2 = 5.
Test at the 0.05 level of significance if this is consistent with H0 : s 2 = 2 .
Test statistic = (29) 5 /2 = 72.5, while the 0.05 critical value for c 229 is 45.72,
so H0 is rejected at the 0.05 level of significance.
28
Chi-Square Test of Goodness of Fit
Can be used to test the hypothesis H0 that a set of observations is consistent with a given
probability distribution. Given a set of categories with observed (Oj ) and expected (Ej )
number of observations (frequency) in each category. Under H0
Test Statistic S (Oj  Ej )2 / Ej has a c 2n - 1 distribution, with n the number of categories.
Example. A pseudo random number generator is used to used to generate 40 random numbers in the
range 1 - 100. Test, at the 0.05 level of significance, if the results are consistent with the hypothesis
that the outcomes are randomly distributed.
Range
Observed Number
Expected Number
1-25
6
10
26 - 50
12
10
51 - 75
14
10
76 - 100
8
10
Total
40
40
Test statistic = (6-10)2/10 + (12-10)2/10 + (14-10)2/10 + (8-10)2/10 = 4.
The 0.05 critical value of c 23 = 7.81, so the test is inconclusive.
Chi-Square Contingency Test
To test that two random variables are statistically independent, a set of observations can
be tabled, with m rows corresponding to categories for one random variable and
n columns for the other. Under H0, the expected number of observations for the cell in row
i and column j = appropriate (row total  column total)  (Grand total). Under H0
Test Statistic
S S (Oij  Eij )2 / Eij has a c 2(m -1)(n-1) distribution.
29
Chi-Square Contingency Test - Example
In the following table, the
figures in brackets are the
expected values.
The test statistic is
Results
Honours
Pass
Fail
Totals
Maths
100 (50)
130 (225)
70 (25)
300
History
Geography
70 (67)
30 (83)
320 (300) 450 (375)
10 (33)
20 (42)
400
500
Totals
200
900
100
1200
S [ (Oij  Eij )2 / Eij ] = (100-50)2/ 50 + (70 - 67)2/ 67 + (30-83)2/ 83 + (130-225)2/ 225
+ (320-300)2/ 300 + (450-375)2/375 + (70-25)2/ 25 + (10-33)2/ 33 + (20-42)2/ 42
= 248.976
The 0.05 critical value for c 22 * 2 is 9.49 so H0 is rejected at the 0.05 level of significance.
Note: In general, chi-squared tests tend to be very conservative vis-a-vis other tests of
hypothesis, (i.e.) they tend to give inconclusive results.
The meaning of the term “degrees of freedom” .
In simplified terms, as the chi-square distribution is the sum of, say k, squares of
independent random variables, it is defined in a k-dimensional space. When we impose a
constraint of the type that the sum of observed and expected observations in a column are
equal or estimate a parameter of the parent distribution, we reduce the dimensionality of
the space by 1. In the case of the chi-square contingency table, with m rows and n columns,
the expected values in the final row and column are predetermined, so the number of
degrees of freedom of the test statistic is (m-1)(n-1).
30
Analysis of Variance/Experimental Design
-Many samples, Means and Variances
• Analysis of Variance (AOV or ANOVA) was
originally devised for agricultural statistics
on crop yields etc. Typically, row and column
format, = small plots of a fixed size. The yield
yi, j within each plot was recorded.
1 y1, 1
y1, 2
y1, 3
2 y2, 1
y2, 2
y2, 3
3 y3, 1
y3, 2
y3, 3
y1, 4
y1, 5
One Way classification
 i ,j
yi, j =  +  i +  i, j ,
 = overall mean
 i = effect of the ith factor
 i, j = error term.
Model:
where
Hypothesis: H0: 1 = 2 = …
=
~ N (0, s2) in the limit
as sample size large
m
31
Factor 1
y1, 1 y1, 2 y1, 3
2
y2, 1 y2,, 2
m
ym, 1 ym, 2 ym, 3
y1, n1
y2, 3
y2, n2
ym, nm
y =
Overall mean

Totals
T1 =  y1, j
Means
y1. = T1 / n1
T2 =  y2, j
y2 . = T2 / n2
Tm =  ym, j
ym. = Tm / nm
where n =  ni
yi, j / n,
Decomposition (Partition) of Sums of Squares:
  (yi, j - y )2 = ni  (yi . - y )2 +

(yi, j - yi . )2
Total Variation (Q) = Between Factors (Q1) + Residual Variation (QE )
Under H0 : Q / (n-1) -> c 2n - 1,
Q1 / (m - 1) -> c
2
m - 1,
QE / (n - m) -> c
2
n-m
Q1 / ( m - 1 ) -> Fm - 1, n - m
QE / ( n - m )
AOV Table: Variation
D.F.
Sums of Squares
Mean Squares
Between
m -1
Q1=  ni ( yi. - y )2
MS1 = Q1/(m - 1)
Residual
n-m
QE= 
 (yi, j - yi .)2
MSE = QE/(n - m)
Total
n -1
Q =
 (yi, j. - y )2
Q /( n - 1)
F
MS1/ MSE
32
Two-Way Classification
Factor I
Factor II y1, 1 y1, 2 y1, 3
:
:
:
ym, 1 ym, 2 ym, 3
Means
y. 1 y. 2
Partition SSQ:

y. 3
Means
y1.
y1, n
:
ym, n
ym.
y .n
y . . So we write as y
(yi, j - y )2 = n  (yi . - y )2 + m (y . j - y )2 + 
Total
Variation
Between
Rows
Model:
yi, j =  +  i +
 j +  i, j
H0:
All  i are equal.
H0: all
AOV
Variation
D.F.
Between
Columns
with
 i, j

(yi, j - yi . - y . j + y )2
Residual
Variation
~ N ( 0, s2)
 j are equal
Sums of Squares
Mean Squares
F
Between m -1
Q1= n  (yi . - y )2
MS1 = Q1/(m - 1) MS1/ MSE
Rows
Between n -1
Q2= m  (y. j - y )2
MS2 = Q2/(n - 1)
MS2/ MSE
Columns
Residual (m-1)(n-1) QE=   (yi, j - yi . - y. j + y)2 MSE = QE/(m-1)(n-1)
Total
mn -1
Q =   (yi, j. - y )2
Q /( mn - 1)
33
Two-Way Example
Factor I
1
2
3
4
5 Totals Means
Fact II 1 20
18
21
23
20
102 20.4
2 19
18 17
18
18
90 18.0
3 23
21 22
23
20
109 21.8
4 17
16 18
16
17
84 16.8
Totals
79
73 78
80
75 385
Means
19.75 18.25 19.50 20.00 18.75
19.25
ANOVA outline
Variation d.f.
SSQ MSQ F
Rows
3 76.95 25.65 18.86**
Columns
4
8.50 2.13 1.57
Residual 12 16.30
Total
19 101.75
FYI software such as R,SAS,SPSS, MATLAB is designed for analysing these
data, e.g. SPSS as spreadsheet recorded with variables in columns and
individual observations in the rows. Thus the ANOVA data above would be
written as a set of columns or rows, e.g.
Var. value 20 18 21 23 20 19 18 17 18 18 23 21 22 23 20 17 16 18 16 17
Factor 1
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Factor 2
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
34
DA APPLICATIONS CONTEXT e.g. BIO
• GENETICS : 5 branches; aim = ‘Laws’ of Chemistry,
Physics, Maths. for Biology
• GENOMICS : Study of Genomes (complete set of DNA
carried by Gamete) by integration of 5 branches of
Genetics with ‘Informatics and Automated systems’
• PURPOSE of GENOME RESEARCH : Info. on Structure,
Function, Evolution of all Genomes – past and present
• Techniques of Genomics from molecular, quantitative,
population genetics: Concepts and Terminology from
Mendelian genetics and cytogenetics
35
CONTEXT: GENETICS - BRANCHES
• Classical Mendelian – Gene and Locus, Allele, Segregation,
Gamete, Dominance, Mutation
• Cytogenetics – Cell, Chromasome, Meiosis and Mitosis,
Crossover and Linkage
• Molecular – DNA sequencing, Gene Regulation and
Transcription, Translation and Genetic Code Mutations
• Population – Allelic/Genotypic Frequencies, Equilibrium,
Selection, Drift, Migration, Mutation
• Quantitative – Heritability/Additive, Non-additive Genetic
Effects, Genetic by Environment Interaction, Plant and
Animal Breeding
36
CONTEXT+ : GENOMICS -LINKAGES
Mendelian
Cytogenetics
Molecular
GENOMICS
Genetic markers
DNA Sequences
Linkage/Physical Maps
Gene Location
Population
QTL Mapping
Quantitative
37
GENOMICS – some KEY QUESTIONS
• HOW do Genes determine total phenotype?
• HOW MANY functional genes necessary and sufficient in a
given system?
• WHAT are necessary Physical/Chemical aspects of gene
structure?
• IS gene location in Genome specific?
• WHAT DNA sequences/structures are needed for genespecific functions?
• HOW MANY different functional genes in whole
biosphere?
• WHAT MEASURES of essential DNA sameness in different
species?
38
‘DATA’ : STATISTICAL GENOMICS
Some UNUSUAL/SPECIAL FEATURES
• Size – databases very large e.g. molecular marker and DNA
/ protein sequence data; unreconciled, Legacy
• Mixtures of variables - discrete/continuous e.g.
combination of genotypes of genetic markers (D) and
values quantitative traits (C)
• Empirical Distributions needed for some Test Statistics
e.g. QTL analysis, H.T. of locus order
• Intensive Computation e.g. Linkage Analysis, QTL and
computationally greedy algorithms in locus ordering,
derivation of empirical distributions, sequence match etc.
• Likelihood Analysis - Linear Models typically insufficient
alone
39
DA APPLICATIONS CONTEXT e.g. BUSINESS/ FINANCE
• http://big.computing.dcu.ie/; http://sci-sym.dcu.ie
•
•
•
•
•
Data-rich environments – under-utilisation of resources
RAW DATA into useful information and knowledge
Similar underpinning: (‘Laws’)– based on analysis
Purpose – Informed decision-making
Techniques – quantitative. Concepts & Nature –
Pervasive, Dynamic, ‘Health’ subject to Internal/External
environments. Key elements - Systems and people
• Forecasting/Prediction/Trigger
40
CONTEXT+ : FACTORS
Supply
Chain
Capital
Knowledge
& Systems
HEALTH of ENTERPRISE
(governmental, corporate,
educational, non-profit)
Adaptability
Labour
Globalisation,
technology
41
FRAMEWORK
• Status: Huge array of information systems &product software.
 Challenges: include development, delivery, adoption, and
implementation of IT solutions into usable and effective systems that
mimic/support organisational processes. ‘KS alignment with work
practice.’ (Toffler & Drucker – 80’s : organisations of 20th Century ->
knowledge- based. Greater autonomy, revised management structures).
 Opportunities: KM popularity grew through 90’s, spawned ideas of 'KM
models', ‘KM strategy', concepts of 'organisational learning', 'knowledge
/practice networks', 'knowledge discovery', ‘intellectual capital‘).
• Objectives: To Plan, develop, implement, operate, optimise,
cost information /communication systems and interpret use.
• Starting point : understanding ICT opportunities requires both
technological and organisational perspective + understanding
of benefits associated with data capture and analysis.
42
Data Mining & KM
•
•
•
•
•
The Knowledge Discovery Process
Classification e.g. clusters, trees
Exploratory Data Analysis
Models (including Bayesian Networks), Graphical or other.
Frequent Pattern Mining and special groups/subgroups
Key Features:
• ‘Learning models’ from data can be an important part of
building an intelligent decision support system.
• Sophistication of analyses – computationally expensive data
mining methods, complexity of algorithms, interpretation and
application of models.
43
Hot Topics in BI
•
•
•
•
•
•
Business Process Management and Modelling
Supply Chain Management and Logistics
Innovation and ICT
Analytical Information Systems, Databases and Data Warehousing
Knowledge Management and Discovery
Social Networks and Knowledge Communities
•
•
•
•
•
Performance Indicators &Measurement systems/Information Quality
Data Analytics, Integration and Interpretation
Cost-benefit and Impact Analysis
Reference Models and Modelling
Process Simulation and Optimization
• Security and Privacy
• IT and IS Architectures/Management
• Info. Sys. development, Tools and Software Engineering
44
Example Questions
• What are the characteristics of internet purchases for a given agegroup? How can this be used to develop further E-business?
• What are key risk factors for profit/loss on a product on the basis of
historical data and demographic variables?
• Can we segment into/identify groups of similar on the basis of their
characteristics and purchase behaviour?
• Which products are typically bought together in one transaction by
customers?
• What are financial projections, given market volatility and knock-on
for recent shock?
• What data should an in-house information system collect? What
design principles are involved for a large database?
• What is involved in modelling and IT-supported optimisation of key
business processes?
45

DATA ANALYSIS - DCU School of Computing

Transcript DATA ANALYSIS - DCU School of Computing

Directory