DATA ANALYSIS - DCU School of Computing

Download Report

Transcript DATA ANALYSIS - DCU School of Computing

DATA ANALYSIS Module Code: CA660

STRUCTURE of Investigation/DA

1,2, many samples E.D., Reg

n.

, C.T.

Estimation/H.T.

Study techniques Replication, Assays, Counts H.T.

Lab. techniques Parametric Non-Parametric Distributional Assumptions, Probability Basis: Size/Type of Data Set DESCRIPTIVE ORDERED

2

‘BIO’ CONTEXT here

• • • •

GENETICS :

5 branches; aim = ‘Laws’ of Chemistry, Physics, Maths. for Biology

GENOMICS :

Study of Genomes (complete set of DNA carried by Gamete) by integration of 5 branches of Genetics with ‘Informatics and Automated systems’

PURPOSE of GENOME RESEARCH :

Info. on Structure, Function, Evolution of all Genomes – past and present

Techniques

of Genomics from molecular, quantitative, population genetics:

Concepts and Terminology

from Mendelian genetics and cytogenetics 3

CONTEXT: GENETICS - BRANCHES

• • • • •

Classical Mendelian –

Gene and Locus, Allele, Segregation, Gamete, Dominance, Mutation

Cytogenetics –

Cell, Chromasome, Meiosis and Mitosis, Crossover and Linkage

Molecular

and Transcription, Translation and Genetic Code Mutations

DNA sequencing, Gene Regulation

Population –

Allelic/Genotypic Frequencies, Equilibrium, Selection, Drift, Migration, Mutation

Quantitative –

Heritability/Additive, Non-additive Genetic Effects, Genetic by Environment Interaction, Plant and Animal Breeding 4

CONTEXT+ : GENOMICS -LINKAGES

Mendelian Cytogenetics Molecular Population GENOMICS Genetic markers DNA Sequences Linkage/Physical Maps Gene Location QTL Mapping Quantitative

5

GENOMICS - FOCUS

CLASSICAL Genetic Markers, Linkage Analysis, Gene Ordering, Multipoint Analysis, Genetic and QTL mapping INFORMATICS Databases, Sequence Comparison,Data Communications, Automation DNA SEQUENCE ANALYSIS Sequence Assembly, Placement, Comparison

6

GENOMICS – some KEY QUESTIONS

• • • • • • •

HOW

do Genes determine total phenotype?

HOW MANY

functional genes

necessary sufficient

in a given system?

and

WHAT

are

necessary

gene structure? Physical/Chemical aspects of

IS

gene location in Genome specific?

WHAT

DNA sequences/structures are needed for gene-specific functions?

HOW MANY

different functional genes in whole biosphere?

WHAT MEASURES

different species? of essential DNA sameness in 7

‘DATA’ : STATISTICAL GENOMICS

• • • • •

Some UNUSUAL/SPECIAL FEATURES Size –

databases very large

e.g. molecular marker and DNA / protein sequence data

Mixtures of variables - discrete/continuous e.g.

combination of g enotypes of genetic markers (D) and values quantitative traits (C)

Empirical Distributions

needed for some

Test

Statistics e.g. QTL analysis, H.T. of locus order

Intensive Computation

e.g. Linkage Analysis, QTL and computationally greedy algorithms in locus ordering, derivation of empirical distributions etc.

Likelihood Analysis - Linear Models typically

insufficient alone

8

EXAMPLE – Mendelian Genetics Cytogenetics

GENE = unit of heredity.

Single gene passed between generations by Mendelian Inheritance

DIPLOID

Individual - two copies (

alleles

) of a gene (

A

)

HOMOZYGOSITY AA aa

(genotypes)

HETEROZYGOSITY Aa

(genotype) (multiple alleles possible for a gene)

PHENOTYPE

-appearance /measurement gene characteristic -

AA,Aa,aa

( co-dominant )

- AA,Aa same

(A the dominant allele) 9

Example – contd.

Common Mating schemes

Single gene haploid gene cell is a

gamete

AA A(gamete) while aa a

F1 hybrid

gives diploid genotype Aa • If F1 x 1 parent (AA or aa)

Backcross

• If F1 x F1 = (self-pollination - or sibs if 2 sexes)

F2

continuing this example 10

Mendelian Laws

1. Segregation

Single gene trait, simple heredity Genotypic segregation ratio 1:1 co-dominant(Backcross) G.S.R. 1:2:1 (F2) Phenotypic (P.S.R.) 3:1 (dominant alleles in F2)

2. Independent Assortment

(Inheritance of unlinked multiple genes).

Each pair of alleles of a gene segregate independently segregation of alleles of another gene) of the

e.g. A, B ; 2 alleles for both 9 genotypes F2 progeny AABB, AABb, AAbb, AaBB, AaBb, Aabb,aaBB,aaBb & aabb Expect G.S.R. = [1:2:1]

2

= 1:2:1:2:4:2:1:2:1 For Dominant genes there are 4 phenotypes, thus P.S.R. = 9:3:3:1 (i.e. A_B_, A_bb, aaB_, aabb) where _  either dominant or recessive 11

EXPERIMENTAL OUTCOMES

Estimation of Expected “frequency” genotype/ phenotype in population of specific

e.g.

Frequency A_b_ in F2 = 9/16 in previous example

e.g.

4

independent

loci AaBBccDd in F2

(with parental groups of cross AAbbCCdd & aaBBccDD) P{AaBBccDd} = (1/2)(1/4)(1/4)(1/2) = 1/64 Note: Basis for estimation/testing how closely observed fits expected segregation =

chi-squared (

2 )

12

Genotypes/phenotypes for genes in F2 progeny under Hardy-Weinberg

m

 p i  Genotype Expected frequency p 1 = 0.25

p 2 = 0.25

p 3 = 0.25

p 4 = 0.25

p 1 = 0.3

p 2 = 0.3

p 3 = 0.2

p 4 = 0.2

p 1 = 0.4

p 2 = 0.4

p 3 = 0.1

p 4 = 0.1

p 1 = 0.4

p 2 = 0.3

p 3 = 0.2

p 4 = 0.1

p 1 = 0.7

p 2 = 0.1

p 3 = 0.1

p 4 = 0.1

A 1 A 1 A 1 A 2 A 1 A 3 A 1 A 4 A 2 A 2 A 2 A 3 A 2 A 4 A 3 A 3 A 3 A 4 A 4 A 4 p 1 p 1 2p 1 p 2 2p 1 p 3 2p 1 p 4 p 2 p 2 2p 2 p 3 2p 2 p 4 p 3 p 3 2p 3 p 4 p 4 p 4 p H 0.0625

0.125

0.125

0.125

0.0625

0.125

0.125

0.0625

0.125

0.0625

0.75

0.09

0.18

0.12

0.12

0.09

0.12

0.12

0.04

0.08

0.04

0.74

0.16

0.32

0.08

0.08

0.16

0.08

0.08

0.01

0.02

0.01

0.66

0.16

0.24

0.16

0.08

0.09

0.12

0.06

0.04

0.04

0.01

0.70

0.49

0.14

0.14

0.14

0.01

0.02

0.02

0.01

0.02

0.01

0.48

13

Mechanisms of MENDELIAN HEREDITY:

• •

Cell division

- mitosis, meiosis

Genetic Linkage

chromasome = association of genes on same

Effects: Seg. Ratios no longer Mendelian. Result of recombination (meiosis) is the existence of non-parental chromasomes in cellular meiotic products). Each

crossover

(exchange of chromasomal segments between homologs) gives 2 reciprocal recombinant (non-

parental) gametes. Measurement : recombinant fraction • -

Recombination

generally random on chromasomes: recombination between loci associated with distance apart. (Basic premise -

genetic/genomic mapping

).

• Models: Linkage phase (co-dominance, experimental data), factors affecting recombination 14

Example

• For 2 loci, A and B, same chromasome: A a B b segregation - two alleles each locus Ab, Ab, aB, ab gametes by meiosis if AB, ab possible Parents Ab, aB non-parental recombinants • Sampling from population, observe

n r

gametes ( Ab and aB) out of total of

n

recombinant samples Recombinant Fraction

r = n r /n

Notes: Usually observe

Estimation of R.F. using phenotypic data involves constructing the likelihood phenotypic and estimation using rather than gamete frequencies. maximum likelihood More than 2 or 3 loci several R.F. & crossover interference; complex so use a Mapping Function, e.g. based on Poisson.

15

Population Genetics

Focus -

frequencies, distributions, origins of genes in populations & changes -due to mutation, migration, selection

Allelic frequency (Prob) of crossover between two parents Cross Possible A i (Allelic frequencies (Probs.)) A 1 (a) A 2 (b) A 3 (c) A 4 (d)

abxcd 4 0.25 0.25 0.25 0.25

abxcc 3 0.25 0.25 0.5 0 abxab 2 (F2) 0.5 0.5 0 0 abxaa Backcross 0.75 0.25 0 0 Aaxaa Fixed 1.0 0 0 0 16

Quantitative Genetics

• •

Focus -

inheritance of quantitative traits. As number of genes controlling a trait increases, as effects on phenotype increase, ability to model through Mendelian inheritance diminishes.

Single Gene Model

Single-locus A two alleles, A and a, so 3 possible genotypes AA, Aa & aa. Three values a,d, and -a assigned arbitrarily to the 3 genotypes. Population assumed in Hardy-Weinberg equilibrium ( gene and genotypic frequencies

constant

, generation to generation ). The two alleles have frequencies of

p

aa Aa and AA

(1-p)

=

q.

Set up : -a d a where population mean - in terms of allelic frequencies and genotypic  

p a

 2

pqd

q a

17

Quantitative Genetics measures - examples

• • • • • • •

Deviation

genotypic value

from mean of population (

a 

)

Average effect of gene substitution Breeding Value = average genotypic value of progeny Dominance deviation ( D.D.) = part of genotypic value not explained by breeding value.

Total Genetic Variance in a Population = Variance of genotypic values

= Sum of variances for B.V. and D.D.

Heritability = ratio of genotypic/phenotypic variances Trait Models

(e.g. Linear Model of what influences a trait)

y ij

  

G i

 

ij

where mean,

y G ij i

is trait type for genotype

i

in replication

j

,  the genetic effect for

i

and  ij the population the error term associated with genotype

i

in replication

j

18

Probability & Statistics Primer -appendix, if needed

Note: Short overview. Other statistical distributions in lectures

Summary Statistics- Descriptive

When analysing practical sets of data, it is useful to be able to define a small number of values that summarise the main features present. We will derive (i) representative values, (ii) measures of spread and (iii) measures of skewness and other characteristics.

Representative Values

These are sometimes called measures of location or measures of central tendency.

1. Random Value

Given a set of data S = { x 1 , x 2 , … , x n }, we select a random number, say k, in the range 1 to n and return the value x k . This method of generating a representative value is straightforward, but it suffers from the fact that extreme values can occur and successive values could vary considerably from one another.

2. Arithmetic Mean

For the set S above, the arithmetic mean (or just mean) is x = {x 1 + x 2 + … + x n }/ n.

If x 1 occurs f 1 times, x 2 occurs f 2 times and so on, we get the formula x = { f 1 x 1 + f 2 x 2 + … + f n x n } / { f 1 + f 2 + … + f n } , written

Example 1. The data refers to the marks that students in a class obtained in an examination. Find the average mark for the class. The first point to note is that the marks are presented as Mark Mid-Point Number ranges, so we must be careful in our of Range of Students interpretation of the ranges. All the intervals x must be of equal rank and their must be no gaps in the classification. In our case, we 0 - 19 i 10 f i f i x 2 20 i interpret the range 0 - 19 to contain marks greater than 0 and less than or equal to 20. Thus, its mid-point is 10. The other intervals are interpreted accordingly. The arithmetic mean is x = 3000 / 50 = 60 marks. 21 - 39 30 6 180 40 - 59 50 12 600 60 - 79 70 25 1750 80 - 99 90 5 450 Sum 50 3000 Note that if weights of size f i are suspended x 1 x 2 x x n from a metre stick at the points x i , then the average is the centre of gravity of the f 1 f n distribution. Consequently, it is very sensitive f 2 to outlying values. Equally, the population should be homogenous for the average to be meaningful. For example, if we assume that the typical height of girls in a class is less than that of boys, then the average height of all students is neither representative of the girls or the boys.

3. The Mode

This is the value in the distribution that occurs 50

most frequently

. By common agreement, it is calculated from the histogram using linear interpolation on the modal class.

Frequency The various similar triangles in the diagram generate the common ratios. In our case, the mode is 60 + 13 / 33 (20) = 67.8 marks.

2 20 6 40 12 13 13 20 25 20 60 80 5 100

4. The Median

This is the

middle point

of the distribution. It 50 is used heavily in educational applications. If { x 1 , x 2 , … , x n } are the marks of students in a class, arranged in nondecreasing order, then 25.5

the median is the mark of the (n + 1)/2 student.

It is often calculated from the

ogive

or

cumulative frequency

diagram. In our case, the median is 60 + 5.5 / 25 (20) = 64.4 marks. Cumulative Frequency 20 40 60 80 100

Measures of Dispersion or Scattering

Example 2. The following distribution has the same arithmetic mean as example 1, but the values are more Marks x Frequency f fx

dispersed

. This illustrates the point that an average value on its own may not be adequately describe statistical distributions.

10 6 60 30 8 240 50 6 300 To devise a formula that traps the degree to which a 70 15 1050 distribution is concentrated about the average, we 90 15 1350 consider the deviations of the values from the average. Sums 50 3000 If the distribution is concentrated around the mean, then the deviations will be small, while if the distribution is very scattered, then the deviations will be large. The

average of the squares

of the deviations is called the variance and this is used as a measure of dispersion. x 6 x 5 x 4 The square root of the variance is called the

standard deviation

and has the same units of measurement as the original values and is the preferred measure of dispersion in many applications. x 1 x 2 x 3 x

Variance & Standard Deviation

s 2  VAR[X] = Average of the Squared Deviations = S f { Squared Deviations } / S f = S f { x i - x } 2 / S f = S f x i 2 / S f - x 2 , called the product moment formula.

s  Standard Deviation =  Variance f Example 1 2 x 10 6 12 25 5 50 30 50 70 90 f x 20 180 600 1750 450 3000 VAR [X] = 198600 / 50 - (60) 2 = 372 marks 2 f x 2 200 5400 30000 122500 40500 198600 f Example 2 6 x 10 8 6 15 15 50 30 50 70 90 f x 60 240 300 1050 1350 3000 VAR [X] = 217800 / 50 - (60) 2 = 756 marks 2 f x 2 600 7200 15000 73500 121500 217800

Other Summary Statistics Skewness

An important attribute of a statistical distribution relates to its degree of symmetry. The word “skew” means a tail, so that distributions that have a large tail of outlying values on the right hand-side are called positively skewed or skewed to the right. The notion of negative skewness is defined similarly. A simple formula for skewness is Skewness = ( Mean - Mode ) / Standard Deviation which in the case of example 1 is: Skewness = (60 - 67.8) / 19.287 = - 0.4044.

Coefficient of Variation

This formula was devised to standardise the arithmetic mean so that comparisons can be drawn between different distributions.. However, it has not won universal acceptance.

Coefficient of Variation = Mean / standard Deviation.

Semi-Interquartile Range

Just as the median corresponds to the 0.50 point in a distribution, the quartiles Q 1 , Q 2 , Q 3 correspond to the 0.25, 0.50 and 0.75 points. An alternative measure of dispersion is Semi-Interquartile Range = ( Q 3 - Q 1 ) / 2.

Geometric Mean

For data that is growing geometrically, such as economic data with a high inflation effect, an alternative to the the arithmetic mean is preferred. It involves getting the root to the power N = S f of a product of terms Geometric Mean = N  x 1 f1 x 2 f2 … x k fk

Regression

[Example 3.] As a motivating example, suppose we are modelling sales data over time.

SALES 3 5 4 5 6 7 TIME 1990 1991 1992 We seek the straight line “Y = m X + c” that best approximates the data. By “best” in this case, we mean the line which minimizes the sum of squares 1993 Y 1994 Y i 1995 Y = m X + c of vertical deviations of points from the line: SS = S ( Y i - [ mX i + c ] ) 2 .

Setting the partial derivatives of SS with respect to m and c to zero leads to the “normal equations” S Y = m S X + n .c , where n = # points S X .Y= m S X 2 + c S X .

X i 0 m X i + c X Let 1990 correspond to Year 0.

Sales 10 X.X X X.Y Y Y.Y

0 0 0 3 9 1 1 5 5 25 4 2 8 4 16 9 3 15 5 25 16 4 24 6 36 25 5 35 7 49 5 Time 55 15 87 30 160 0 5

Example 3 - Workings.

The normal equations are: => 30 = 15 m + 6 c => 87 = 55 m + 15 c 24 = 35 m=> 150 = 75 m + 30 c 174 = 110 m + 30 c 30 = 15 (24 / 35) + 6 c => c = 23/7 Thus the regression line of Y an X is Y = (24/35) X + (23/7) and to plot the line we need two points, so X = 0 => Y = 23/7 and X = 5 => Y = (24/35) 5 + 23/7 = 47/7. It is easy to see that ( X, Y ) satisfies the normal equations, so that the regression line of Y on X passes through the “Center of Gravity” of the data. By expanding terms, we also get S ( Y i - Y ) 2 = S ( Y i - [ m X i + c ] ) 2 + S ( [ m X i + c ] - Y ) 2 Total SumErrorSum of Squares

SST

of Squares

= SSE

Regression Sum

+

of Squares

SSR

In regression, we refer to the X variable as the

variable

and Y as the

dependent variable.

Y

independent

Y X Y i mX i +C Y X

Correlation

The

coefficient of determination

r 2 ( which takes values in the range 0 to 1) is a measure of the proportion of the

total variation

that is associated with the regression process: r 2 = SSR/ SST = 1 - SSE / SST.

The

coefficient of correlation

r ( which takes values in the range -1 to +1 ) is more commonly used as a measure of the degree to which a

mathematical relationship

exists between X and Y. (Dashed lines show association).It can be calculated from the formula: r =  ( X - X ) ( Y - Y )  ( X - X ) 2 ( Y - Y ) 2 = n  X Y  X  Y  [ { n  X 2 - (  X ) 2 } { n  Y 2 - (  Y ) 2 }] Example. In our case r = {6(87) - (15)(30)}/  { 6(55) - (15) 2 } { 6 (160) - (30) 2 } = 0.907.

r = - 1 r = 0 r = + 1

Collinearity

If the value of the correlation coefficient is greater than 0.9 or less than - 0.9, we would take this to mean that there is a

mathematical relationship

variables. This does not imply that a

cause-and-effect relationship

between the exists.

Consider a country with a slowly changing population size, where a certain political party retains a relatively stable per centage of the poll in elections. Let X = Number of people that vote for the party in an election Y = Number of people that die due to a given disease in a year Z = Population size.

Then, the correlation coefficient between X and Y is likely to be close to 1, indicating that there is a mathematical relationship between them (i.e.) X is a function of Z and Y is a function of Z also. It would clearly be silly to suggest that the indicence of the disease is caused by the number of people that vote for the given political party. This is known as the problem of

collinearity.

Spotting hidden dependencies between distributions can be difficult. Statistical experimentation can only be used to disprove hypotheses, or to lend evidence to support the view that reputed relationships between variables may be valid. Thus, the fact that we observe a high correlation coefficient between deaths due to heart failure in a given year with the number of cigarettes consumed twenty years earlier does not establish a cause-and-effect relationship. However, this result may be of value in directing biological research.

Overview of Probability Theory

In statistical theory, an experiment is any operation that can be

replicated infinitely often

gives rise to a set of

elementary outcomes

, which are deemed to be

equally likely.

The and

sample space S

subset

E

of the experiment is the set of all possible outcomes of the experiment. Any of the sample space is called an

event.

We say that an event E

occurs

whenever any of its elements is an outcome of the experiment. The

probability

of occurrence of E is

P {E} = Number of elementary outcomes in E Number of elementary outcomes in S

S E The

complement union intersection

E of an event E is the set of all elements that belong to S but not to of two events E 1  E 2 of two events E is the set of all outcomes that belong to E 1 1  E 2 or to E is the set of all events that belong to both E 2 1 or to both. The and E E. The 2.

Two events are

mutually exclusive

if the occurrence of either precludes the occurrence of the other (i.e) their intersection is the empty set . Two events are

independent

if the occurrence of either is unaffected by the occurrence or non-occurrence of the other event.

Theorem of Total Probability.

S n = n 0, 0 + n 1, 0 + n 0, 2 + n 1, 2 P {E If E 1 and E 2 1  E 2 } = P{E 1 } + P{E 2 } - P{E 1  E 2 } Proof.

Corollary.

P{E 1  E 2 } = (n 1, 0 + n 1, 2 + n 0, 2 ) / n = (n 1, 0 = P{E 1 + n 1, 2 ) / n + (n 1, 2 } + P{E 2 } - P{E 1  + n E 2 } 0, 2 ) / n - n 1, 2 / n E 1 n 1, 0 n 1, 2 n 0, 2 E n 2 0, 0 are mutually exclusive, P{E 1  E 2 } = P{E 1 } + P{E 2 } - see Axioms and Addition Rule

The probability P{E 1

Corollary

| E 2 } that E 1 occurs, given that E 2 has occurred (or must occur) is called the

conditional probability

of E 1 . Note that in this case, the only possible outcomes of the experiment are confined to E 2 and not to S.

E 2 E 1

Theorem of Compound Probability

Multiplication Rule.

n 1, 0 P{E 1  E 2 } = P{E 1 | E 2 } * P{E 2 }.

n 1, 2 n 0, 2 Proof.

P{E 1  E 2 } = n 1, 2 / n = {n 1, 2 / (n 1, 2 + n 0, 2 ) } * { n 1, 2 + n 0, 2 ) / n} n 0, 0 If E 1 Note: and E 2 are independent, P{E 1  E 2 } = P{E 1 If an E itself compound, expands further = } * P{E 2 }. Special case of Multiplication Rule Chain Rule: P{E 7  E 8  E 9 } =P{E 7  (E 8  E 9 )} Ability to count possible outcomes in an event is crucial to calculating probs. By a

permutation

of size r of n different items, we mean an

arrangement

arrangement is important. If the order is not important of r of the items, where the order , the arrangement is called a of the

combination.

Example. There are 5*4 permutations and 5*4 / (2*1) combinations of size 2 of A, B, C, D, E Permutations: AB, BA, AC, CA, AD, DA, AE, EA CD, DC, CE, EC BC, CB, BD, DB, BE, EB DE, ED Combinations: AB, AC, AD, AE, BC, BD, BE, CD, CE, DE Standard reference books on probability theory give a comprehensive treatment of how these ideas are used to calculate the probability of occurrence of the outcomes of games of chance.

Bayes’ Rule (Theorem):

For a series of mutually exclusive and exhaustive events B r , where union of the B r = B 1  B 2  B 3 

…….

B r = all possibilities for B, Then:

P

B s

|

A

 

P P

A

A

| |

B s B r

  

s

  

r r

Where denominator is the Total probability of A occurring.

Ex.

Paternity indices

: based on actual genotypes of mother, child, and

alleged

father. Before collection of any evidence have a prior probability of paternity P{C}. So, what is the situation after the genetic evidence E is in?

=  P{man not father | E} P{E | man not father} P{man not father} Written in terms of ratio of posterior probs.

(= LHS), paternity index ( L say) and prior probs. (RHS). Rearrange and substitute in above to give prob. of an alleged man with

P

L

particular genotype being the true father

P

C

|

E

L

P

  ( 1 

P P C

 

) NB: L is a way of ‘weighting’ the genetic evidence; the issue is setting a prior .

Statistical Distributions Characterisation

If a statistical experiment only gives rise to real numbers, the outcome of the experiment is called a

random variable

. If a random variable X then the takes values with probabilities expected X 1 , X 2 , … , X n p 1 , p 2 , … , p n (average) value of X is defined to be

j n

  1 j X j and its variance is VAR[X] = E[X 2 ] - E[X] 2

j n

  1 j X j 2 - E[X] 2 . Example. Let X be a random variable measuring the distance in Kilometres travelled by children to a school and suppose that the following data applies. Then the mean and variance are E[X] = 5.30 Kilometres VAR[X] = 33.80 - 5.30

2 = 5.71 Kilometres 2 Prob. Distance p j 0.15

0.40

0.20

0.15

X j p j X j p j X j 2 2.0 0.30 0.60

4.0 1.60 6.40

6.0 1.20 7.20

8.0 1.20 9.60

0.10 10.0 1.00 1.00

1.00

5.30 33.80

Similar concepts apply to continuous distributions. The

distribution function

is defined by and its

derivative

is the

frequency function

f(t) = d F(t) / dt so that 

t

F(t) = f(x) dx.



Sums and Differences of Random Variables

Define the

covariance

of two random variables to be COVAR [ X, Y] = E [(X - E[X]) (Y - E[Y]) ] = E[X Y] - E[X] E[Y].

If X and Y are independent , COVAR [X, Y] = 0.

Lemma

E[ X + Y] = E[X] + E[Y] VAR [ X + Y] = VAR [X] + VAR [Y] + 2 COVAR [X, Y] E[ k. X] = k .E[X] VAR[ k. X] = k 2 .E[X] for a constant k.

Example. A company records the journey time X of a lorry from a depot to customers and Y =1 X= 1 7 the unloading times Y, as shown. E[X] = { 1 ( 10 )+ 2 ( 13 )+ 3 ( 17 )+ 4 ( 10 )}/ 50 = 2.54

2 2 3 4 5 4 4 2 6 8 3 3 1 2 5 3 E[X 2 ] = {1 2 (10+2 2 (13)+3 2 (17)+4 2 (10)}/50 = 7.5 VAR[X] = 7.5 - (2.54) 2 = 1.0484

Totals 10 13 17 10 E[Y] = {1(20)+2(19)+3(11)}/50 = 1.82 E[Y 2 ] = {1 2 (20)+2 2 (19)+3 2 (11)}/50 = 3.9

VAR[Y] = 3.9 - (1.82) 2 = 0.5876 E[X+Y] = { 2( 7 )+3( 5 )+4( 4 )+5( 4 )+3( 2 )+4( 6 )+5( 8 )+6( 3 )+4( 1 )+5( 2 )+6( 5 )+7( 3 )}/50 = 4.36 E[(X + Y) 2 ] = {2 2 (7)+3 2 (5)+4 2 (4)+5 2 (4)+3 2 (2)+4 2 (6)+5 2 (8)+6 2 (3)+4 2 (1)+5 2 (2)+6 2 (5)+7 2 (3)}/50 = 21.04

VAR[(X+Y)] = 21.04 - (4.36) 2 = 2.0304 E[X Y] = {1( 7 )+2( 5 )+3( 4 )+4( 4 )+2( 2 )+4( 6 )+6( 8 )+8( 3 )+3( 1 )+6( 2 )+9( 5 )+12( 3 )}/50 = 4.82

COVAR (X, Y) = 4.82 - (2.54)(1.82) = 0.1972

VAR[X] + VAR[Y] + 2 COVAR[ X, Y] = 1.0484 + 0.5876 + 2 ( 0.1972) = 2.0304

Totals 20 19 11 50

Standard Statistical Distributions

Most elementary statistical books provide a survey of commonly used statistical distributions. Importantly, we can characterise them by their expectation and variance (as for random variables) and by the parameters on which these are based; (see lecture notes for those we will refer to). So, e.g. for a Binomial distribution, the parameters are

p

the probability of success in an individual trial and

n

the No. of trials.

The probability of success remains

constant –

otherwise, another distribution applies.

Use of the correct distribution is core to statistical

inference

– I.e. estimating what is happening in the population on the basis of a (correctly drawn, probabilistic ) sample.

The sample is then representative of the population.

Fundamental to statistical inference is the Normal ( or Gaussian), with parameters,  mean (or more formally s 2 expectation of the distribution) and s (or ), the Standard deviation ( Variance ). For small samples, or when s 2 the not known but must be estimated from the sample, a slightly more conservative distribution applies = the Student’s T or just ‘t’ distribution. Introduces the degrees of freedom concept.

Student’s t Distribution

A random variable X has a t distribution with n degrees of freedom ( t n ) .

The t distribution is symmetrical about the origin, with E[X] = 0 VAR [X] = n / (n -2).

For small values of n, the t n distribution is very flat. As n is increased the density assumes a bell shape. For values of n  25, the t n distribution is practically indistinguishable from the standard normal curve . O If X and Y are independent random variables If X has a standard normal distribution and Y has a  n 2 then X  (Y / n) has a t n distribution distribution O If x 1 , x 2 , … , x n is a random sample from a normal distribution, with mean  and variance s 2 and if we define s 2 = 1 / ( n - 1)  ( x i - x ) 2 then ( x  ) / ( s /  n) has a t n- 1 distribution Estimated Sample variance see W,Y and S tables +

Many other standard distributions

Sampling Theory

The procedure for drawing a random sample a distribution is that numbers 1, 2, … are assigned to the elements of the distribution and tables of random numbers are then used to decide which elements are included in the sample. If the same element can not be selected more than once, we say that the sample is drawn

without replacement

; otherwise, the sample is said to be drawn with replacement.

The usual convention in sampling is that lower case letters are used to designate the sample characteristics, with capital letters being used for the parent population. Thus if the sample size is n, its elements are designated, x 1 , x 2 , …, x n , its mean is x and its modified variance is s 2 =  (x i - x ) 2 / (n - 1).

The corresponding parent population characteristics are N (or infinity), X and S 2 .

Suppose that we repeatedly draw random samples of size n (with replacement) from a distribution with mean  and variance s 2 . Let x 1 , x 2 , … be the collection of sample means and let x i ’ = x i s    n (i = 1, 2, … ) The collection x 1 ’, x 2 ’, … is called the

sampling distribution of means.

Central Limit Theorem.

In the limit, as sample size n tends to infinity, the sampling distribution of means has a standard normal distribution . Basis for statistical inference .

Attribute and Proportionate Sampling

If the sample elements area measurement of some characteristic, we are said to have

attribute sampling

. On the other hand if all the sample elements are 1 or 0 (success/failure, agree/ no-not-agree), we have

proportionate sampling

. For proportionate sampling, the sample average x and the sample proportion p are synonymous, just as are the mean  and proportion P for the parent population. From our results on the Binomial distribution, the sample variance is p (1 - p) and the variance of the parent distribution is P (1 - P).

We can generalise the concept of the sampling distribution of means to get the sampling distribution of any statistic. We say that a sample characteristic is an unbiased estimator of the parent population characteristic, i.e.

the expectation of the corresponding sampling distribution is equal to the parent characteristic.

Lemma.

The sample average (proportion ) is an unbiased estimator of the parent average (proportion): E [ x] = ; so E [p] = P.

The quantity  ( N - n) / ( N - 1) is called the

finite population correction (fpc).

the parent population is infinite or we have sampling with replacement If the fpc = 1.

Lemma.

E [s] = S * fpc for estimated sample S.D. with fpc

Confidence Intervals

From the statistical tables for a standard normal distribution, we note that Area Under Density Function 0.90

0.95

0.99

From -1.64

-1.96

-2.58

To 1.64

1.96

2.58

0.95

n (0,1) -1.96

0 +1.96

From the central limit theorem , if x and s 2 are the mean and variance of a random sample of size n (with n greater than 25) drawn from a large parent population, then we can make the following statement about the unknown parent mean  i.e. Prob { -1.64  x s /  Prob { x - 1.64 s /       n n    x   s /  n }   The range x + 1.64 s /  n is called a

90% confidence interval

for the parent mean  .

Example [ Attribute Sampling] A random sample of size 25 has x = 15 and s = 2. Then a 95% confidence interval for  is 15 + 1.96 (2 / 5) (i.e.) 14.22 to 15.78

Example [ Proportionate Sampling] A random sample of size n = 1000 has p = 0.40  1.96  p (1 - p) / (n - 1) = 0.03.

A 95% confidence interval for P is 0.40 + 0.03 (i.e.) 0.37 to 0.43.

Small Sampling Theory

For reference purposes, it is useful to regard the expression x + 1.96 s /  n as the “default formula” for a confidence interval and to modify it to suit particular circumstances. O If we are dealing with proportionate sampling, the sample proportion is the sample mean and the x -> p and

standard error

s /  n ->  (s.e.) term s /  n simplifies as follows: p(1 - p) / (n -1). (Also n-1 -> n) O A 90% confidence interval will bring about the swap 1.96 -> 1.64. O If the sample size n is less than 25, the normal distribution must be replaced by Student’s t n - 1 distribution. O For sampling without replacement from a finite population, a fpc term must be used.

The width of the confidence interval band increases with the confidence level.

Example. A random sample of size n = 10, drawn from a large parent population, has a mean x = 12 and a standard deviation s = 2. Then a 99% confidence interval for the parent mean is x + 3.25 s /  n (i.e.) 12 + 3.25 (2)/3 (i.e.) 9.83 to 14.17

and a 95% confidence interval for the parent mean is x + 2.262 s /  n (i.e.) 12 + 2.262 (2)/3 (i.e.) 10.492 to 13.508.

Note that for n = 1000, 1.96  p (1 - p) / n   for values of p between 0.3 and 0.7. This gives rise to the statement that public opinion polls have an “inherent error of 3%”. This simplifies calculations in the case of public opinion polls for large political parties.

Tests of Hypothesis

[Motivational Example]. It is claimed that the average grade of all 12 year old children in a country in a particular aptitude test is 60%. A random sample of n= 49 students gives a mean x = 55% with a standard deviation s = 2%. Is the sample finding consistent with the claim? We regard the original claim as a

null hypothises (H 0 )

TRUE: H 0 :    If the null hypothesis is true, the

test statistic

t = x  which is tentatively accepted as Z(0,1) 0.95

-1.96

1.96

Z(0,1) s   n is a random variable with a Normal (0, 1) = Standardised Normal Z(0,1) (or U(0,1)distribution.

Thus 55 - 60 = - 35 / 2 = - 17.5 2/  49 is a random value from Z(0, 1).

rejection regions

But this lies outside the 95% confidence interval (falls in the

rejection region)

, so either or (i) The null hypothesis is incorrect (ii) An event with a probability of at most 0.05 has occurred.

Consequently, we reject the null hypothesis, knowing a probability of 0.05 exists that we are in error. Technically, we say we reject the null hypothesis at the 0.05

level of significance.

The alternative to rejecting H 0 , is to declare the

test to be inconclusive.

there is some tentative evidence to support the view that H 0 is This means that

approximately correct

.

Modifications

Based on the properties of the Normal , student t and other distributions, we can generalise these ideas. If the sample size n < 25, we should use a t n-1 distribution; we can also vary the level of significance of the test and we can apply the tests to proportionate sampling environments.

Example. 40% of a random sample of 1000 people in a country indicate satisfaction with government policy. Test at the .01 level of significance if this consistent with the claim that 45% of the people support government policy?

Here, so H  0 : P = 0.45 p (1-p) / n = 0.015 p = 0.40, n = 1000 test statistic = (0.40 - 0.45) / 0.015 = - 3.33

99% critical value = 2.58 so H 0 is rejected at the .01 level of significance.

One-Tailed Tests

If the null hypothesis is of the form H 0 : P   then arbitrary large values of p are acceptable, so that the rejection region for the test statistic lies in the left hand tail only.

Example. 40% of a random sample of 1000 people in a country indicate satisfaction with government policy. Test at the .05 level of significance if this consistent with the claim that at least 45% of the people support government policy?

Here the critical value is -1.64, so the the null hypothesis H 0 : P   is rejected at the .05 level of significance -1.64

Rejection region

0.95

n(0,1)

Testing Differences between Means

Suppose that s 1 x 1 x 2 … x m is a random sample with mean x and standard deviation drawn from a distribution with mean  1 and y 1 y 2 … y n with mean y and standard deviation s 1 y 1 y 2 … y n is a random sample drawn from a distribution with mean  2. Suppose that we wish to test the null hypothesis that both samples are drawn from the same parent population (i.e.) H 0 :  1 The pooled estimate of the parent variance is s* 2 =  2.

= { (m - 1) s 1 2 + (n - 1) s 2 2 } / ( m + n - 2) and the variance of x - y, being the variance of the difference of two independent random variables, is s ’ 2 = s * 2 / m + s* 2 / n.

This allows us to construct the test statistic, which under H 0 has a t m+n-2 distribution.

Example. A random sample of size m = 25 has mean x = 2.5 and standard deviation s 1 while a second sample of size n = 41 has mean y = 2.8 and standard deviation s 2 = 2, = 1. Test at the .05 level of significance if the means of the parent populations are identical.

Here H 0 :  1 =  2 x - y = - 0.3 and s* 2 = {24(4) + 40(1)} / 64 = 2.125

so the test statistic is - 0.3 /  22  2  22     8 The .05 critical value for Z(0, 1) is , so the test is inconclusive 

Paired Tests

If the sample values ( x i , y i ) are paired, such as the marks of students in two examinations, then let d i = x i - y i be their differences and treat these values as the elements of a sample to generate a test statistic for the hypothesis H 0 :  1 =  2.

The test statistic d / s d /  n has a t n-1 distribution if H 0 is true.

Example. In a random sample of 100 students in a national examination their examination mark in English is subtracted from their continuous assessment mark, giving a mean of 5 and a standard deviation of 2. Test at the .01 level of significance if the true mean mark for both components is the same.

Here n = 100, d = 5, so the test statistic is s d 5 / 0.2 = 10.

/  n = 2/10 = 0.2

the 0.1 critical value for a Z(0, 1) distribution is 2.58, so H 0 significance.

is rejected at the .01 level of

Tests for the Variance.

For normally distributed random variables, given H 0 : s 2 = k, a constant, then (n-1) s 2 / k has a  2 n - 1 distribution.

Example. A random sample of size 30 drawn from a normal distribution has variance s 2 Test at the .05 level of significance if this is consistent with H 0 : s Test statistic = (29) 5 /2 = 72.5, while the .05 critical value for  2 29 2 = 2 .

is 45.72, so H 0 is rejected at the .05 level of significance. = 5.

Chi-Square Test of Goodness of Fit

This can be used to test the hypothesis H 0 that a set of observations is consistent with a given probability distribution. We are given a set of categories and for each we record the observed O j nd expected E j the test statistic S (O j  categories.

number of observations that fall in each category. Under H 0 , E j ) 2 / E j has a  2 n - 1 distribution, where n is the number of Example.A pseudo random number generator is used to used to generate 40 random numbers in the range 1 - 100. Test at the .05 level of significance if the results are consistent with the hypothesis that the outcomes are randomly distributed.

Range Observed Number 1-25 26 - 50 51 - 75 76 - 100 Total 6 12 14 8 40 Expected Number 10 10 10 10 40 Test statistic = (6-10) 2 /10 + (12-10) 2 /10 + (14-10) 2 /10 + (8-10) 2 /10 = 4.

The .05 critical value of  2 3 = 7.81, so the test is inconclusive.

Chi-Square Contingency Test

To test that two random variables are statistically independent, a set of observations can be recorded in a table with m rows corresponding to categories for one random variable and n columns for the other. Under H 0 , the expected number of observations for the cell in row i and column j is the appropriate row total by the column total divided by the grand total. Under H 0 , the test statistic S (O ij  E ij ) 2 / E ij has a  2 (m -1)(n-1) distribution.

Chi-Square Contingency Test - Example

In the following table, the figures in brackets are the expected values.

Results Maths History Geography Totals Honours 100 (50) 70 (67) 30 (83) 200 Pass 130 (225) 320 (300) 450 (375) 900 Fail Totals 70 (25) 300 10 (33) 20 (42) 100 400 500 1200 The test statistic is S (O ij  E ij ) 2 / E ij = (100-50) 2 / 50 + (70 - 67) 2 / 67 + (30-83) 2 / 83 + (130-225) 2 / 225 + (320-300) 2 / 300 + (450-375) 2 /375 + (70-25) 2 / 25 + (10-33) 2 / 33 + (20-42) 2 / 42 = 248.976

The .05 critical value for  2 2 * 2 is 9.49 so H 0 is rejected at the .05 level of significance.

In general the chi square tests tend to be very conservative vis-a-vis other tests of hypothesis, (i.e.) they tend to give inconclusive results.

T he meaning of the term “

degrees of freedom

” .

In simplified terms, as the chi-square distribution is the sum of, say k, squares of independent random variables, it is defined in a k-dimensional space. When we impose a constraint of the type that the sum of observed and expected observations in a column are equal or estimate a parameter of the parent distribution, we reduce the dimensionality of the space by 1. In the case of the chi-square contingency table, with m rows and n columns, the expected values in the final row and column are predetermined, so the number of degrees of freedom of the test statistic is (m -1 ) (n -1 ).