Transcript Diapositive 1 - Institut Pasteur
Mutivariate statistical Analysis methods
Ahmed Rebaï Centre of Biotechnology of Sfax [email protected]
Basic statistical concepts and tools
Statistics Statistics are concerned with the ‘ optimal ’ methods of analyzing data generated from some chance mechanism (random phenomena).
‘Optimal’ means appropriate choice of what is to be computed from the data to carry out statistical analysis
Random variables A random variable is a numerical quantity that in some experiment, that involve some degree of randomness, takes one value from some set of possible values The probability distribution is a set of values that this random variable takes together with their associated probability
The Normal distribution Proposed by Gauss (1777-1855) : the distribution of errors function) in astronomical observations (error Arises in many biological processes , Limiting distribution of all random variables for a large number of observations.
Whenever you have a natural phenomemon which is the result of many contributiong factor each having a small contribution you have a Normal
The Quincunx Bell-shaped distribution
Distribution function The distribution function is defined
F(x)=Pr(X
(
t
)
t
f
(
x
)
dx where f
(
x
)
1 2
²
e
(
x
2
² )²
F
is called the cumulative distribution function
(cdf)
and
f
the probability distrbution function and
²
are respectively the
mean (pdf)
of
X
and the
variance
of the distribution
Moments of a distribution The
k th
'
k
moment
E
(
x k
)
is defined as
x k f
(
x
)
dx
The first moment is the
mean
The
k th
moment about the mean is
k
E
(
x
)
k
(
x
)
k f
(
x
)
dx
The second moment about the mean is called the
variance
²
Kurtosis: a useful moments’ function Kurtosis 4
4 =
4 3
² 2
0 for a normal distribution so it is a measure of Normality
Observations Observations x
i
are realizations random variable X of a The pdf of X can be visualized by a histogram : a graphics showing the frequency of observations in classes
Estimating moments The Mean of X is estimated from a set of n observations
(x 1 , x 2 , ..x
n )
as
x
1
n i n
1
x i
The variance is estimated by Var(X) = 2
n
1 1
i n
1
(
x i
x
)
2
The fundamental of statistics Drawing conclusions about a population on the basis on a set of measurments or observations on a sample from that population Descriptive : get some conclusions based on some summary measures and graphics (Data Driven) Inferential : test hypotheses we have in mind befor collecting the data (Hypothesis driven).
What about having many variables?
Let X=(X
1 , X 2 , ..X
p
) be a set of p variables What is the
marginal
distribution of each of the variables X
i
and what is their
joint
distribution If f(X
1 , X 2 , ..X
p
) is the joint pdf then the marginal pdf is
f
(
X i
)
f
(
x
1
,..,
x i
1
,
x i
1
...,
x p
)
dx
1
....
dx p
Independance Variables are said to be independent if
f(X 1 , X 2 , ..X
p )= f(X 1 ) . f(X 2 )…. f(X p )
Covariance and correlation Covariance is the joint first moment of two variables, that is
Cov(X,Y)=E(X-
X )(Y-
Y )=E(XY)-E(X)E(Y)
Correlation: a standardized covariance
(
X
,
Y
)
Cov
(
X
,
Y
)
Var
(
X
).
Var
(
Y
)
is a number between -1 and +1
For example: a bivariate Normal Two variables X and Y have a bivariate Normal if
f
(
x
,
y
)
2 1 2 1 1 2
e
1 1 2
(
x
1
²
1
)²
2
(
x
1
)(
1 2
y
2
)
(
y
²
2 2
)²
is the correlation between X and Y
Uncorrelatedness and independence If =0 (Cov(X,Y)=0) we say that the variables are uncorrelated Two uncorrelated variables are independent if and only if their joint distribution is bivariate Normal Two independent variables are necessarily uncorrelated
Bivariate Normal
f
(
x
,
y
) 2 1 2 1 1 2
e
1 1 2 (
x
1 ² 1 )² 2 (
x
1 )( 1
y
2 2 ) (
y
2 ² 2 )² If
=0
then
f
(
x
,
y
) 1 2 1 ²
e
(
x
1 )² ² 1 So f(x,y)=f(x).f(y) 1 2 2 ²
e
(
y
2 )² ² 2 the two variables are thus independent
Many variables We can calculate the Covariance or correlation matrix of
(X 1 , X 2 , ..X
p )
C=Var(X)=
v
(
c(x x
1 1 )
c(x
1
,x
2
,x p c(x
1
,x
2
) ) v
(
x c(x
2 2
)
)
,x
........
........
c(x
2
p ) c(x
1
,x
..........
v ,x
(
p p x ) ) p
) A square (pxp) and symmetric matrix
A Short Excursion into Matrix Algebra
What is a matrix?
Operations on matrices Transpose
Properties
Some important properties
Other particular operations
Eigenvalues and Eigenvectors
Singular value decomposition
Multivariate Data
Multivariate Data Data for which each observation consists of values for more than one variables; For example: each observation is a measure of the expression level of a gene i in a tissue j Usually displayed as a data matrix
Biological profile data
The data matrix
x
11
x x
21
n
1
x x
12 22
x n
2
....
x
....
x
1
p
....
x
2
p np
n
observations (rows) for
p
variables (columns) an nxp matrix
Contingency tables When observations on two categorial variables are cross-classified.
Entries in each cell are the number of individuals with the correponding combination of variable values
Eyes colour Blue Medium Dark Light Fair 326 343 98 688 Red Hair colour Medium 38 84 241 909 48 116 403 584 Dark 110 412 681 188
Mutivariate data analysis
Exploratory Data Analysis Data analysis that emphasizes the use of informal graphical procedures not based on prior assumptions about the structure of the data or on formal models for the data Data= smooth + rough is the underlying regularity or pattern in the data. The objective of EDA is to separate the smooth minimal use of formal mathematics or statistics methods where the from the smooth rough with
Reduce dimensionality without loosing much information
Overview on the techiques Factor analysis Principal components analysis Correspondance analysis Discriminant analysis Cluster analysis
Factor analysis A procedure that postulates that the correlations between a set of p observed variables arise from the relationship of these variables to a small number k of underlying , unobservable,
latent variables
, usually known as common factors where
k
Principal components analysis A procedure that transforms a set of variables into new ones that are uncorrelated and account for a decreasing proportions of the variance in the data The new variables, named
principal components (PC
variables ), are linear combinations of the original
PCA If the few first PCs account for a large percentage of the variance (say >70%) then we can display the data in a graphics that depicts quite well the original observations
Example
Correspondance Analysis A method for displaying relationships between categorial variables in a scatter plot The new factors are combinations of rows and columns A small number of these derived coordinate values (usually two) are then used to allow the table to be displayed graphically
Example: analysis of codon usage and gene expression in
E. coli
(McInerny, 1997) A gene can be represented by a 59 dimensional vector (universal code) A genome consists of hundreds (thousands) of these genes Variation in the variables (RSCU values) might be governed by only a small number of factors For each gene and each codon i calculate RCSU=# observed codon /#expected codon
Codon usage in bacterial genomes
Evidence that all synonymous codons were not used with equal frequency:
Fiers
et al.,
1975 A-protein gene of bacteriophage MS2, Nature 256, 273-278
UUU Phe 6 UCU Ser 5 UAU Tyr 4 UGU Cys 0 UUC Phe 10 UCC Ser 6 UAC Tyr 12 UGC Cys 3 UUA Leu 8 UCA Ser 8 UAA Ter * UGA Ter * UUG Leu 6 UCG Ser 10 UAG Ter * UGG Trp 12 CUU Leu 6 CCU Pro 5 CAU His 2 CGU Arg 7 CUC Leu 9 CCC Pro 5 CAC His 3 CGC Arg 6 CUA Leu 5 CCA Pro 4 CAA Gln 9 CGA Arg 6 CUG Leu 2 CCG Pro 3 CAG Gln 9 CGG Arg 3 AUU Ile 1 ACU Thr 11 AAU Asn 2 AGU Ser 4
Multivariate reduction Attempts to reduce a high-dimensional space to a lower-dimensional one.
In other words, it tries to simplify the data set.
Many of the variables might co-vary, therefore there might only be one, or a small few sources of variation in the dataset A gene can be represented by a 59-dimensional vector (universal code) A genome consists of hundreds (thousands) of these genes Variation in the variables (RSCU values) might be governed by only a small number of factors
Plot of the two most important axes Lowly-expressed genes Highly expressed genes Recently acquired genes
Discriminant analysis Techniques that aim to assess whether or a not a set of variables distinguish or discriminate between two or more groups of individuals Linear discriminant analysis (LDA): uses linear functions (called canonical discriminant functions) of variable giving maximal separation between groups (assumes tha covariance matrices within the groups are the same) if not use Quadratic Discriminant analysis (QDA)
Example: Internal Exon prediction Data: A set of exons and non-exons Variables : a set of features donor/acceptor site recognizers octonucleotide preferences for coding region octonucleotide preferences for intron interiors on either side
LDA or QDA
Cluster analysis
A set of methods (hierarchical clustering, K-means clustering, ..) for constructing sensible and informative classification of an initially unclassified set of data Can be used to cluster individuals or variables
Example: Microarray data
Other Methods Independant component analysis (ICA): similar to PCA but components are defined as independent and not only uncorrelated; moreover they are not orthogonal and uniquely defined Multidimensional Scaling (MDS) low-dimentional geometrical : a clustering technique that construct a representation of a distance matrix (also Principal coordinates analysis)
Useful books: Data analysis
Useful book: R langage