Basic principles of probability theory

Download Report

Transcript Basic principles of probability theory

Name: Garib Murshudov (when asking questions Garib is sufficient)
e-mail: [email protected]
location: Bioscience Building (New Biology), K065
webpage for lecture notes and exercises
www.ysbl.york.ac.uk/~garib/mres_course/2007/
You can also have a look previous year’s lectures: 2006
There will be two types of exercises: With numbers. They will be marked.
With names. You can do them and I will
mark them.
You can send questions about this course and other questions I can help with to
the above e-mail address.
Additional materials
• Linear and matrix algebra
– Eigenvalue/eigenvector decomposition
– Singular value decomposition
– Operation on matrices and vectors
• Basics of probabilities and statistics
–
–
–
–
–
–
–
Probability concept
Characterstic/moment generating/cumulative generating functions
Entropy and maximum entropy
Some standard distributions (e.g. normal, t, F, chisq distributions)
Point and interval estimation
Elements of hypothesis testing
Sampling and sampling distributions
• Optimisation techniques
– Gradient methods
– Super-linear and second order techniques
Introduction to R
Example of analysis in this course will be done using R. You can use any package you
are familiar with. However I may not be able to help in these cases.
R is a multipurpose statistical package. It is freely available from:
http://www.r-project.org/
Or just type R on your google search. The first hit is usually hyperlink to R.
It should be straightforward to download.
R is an environment (in unix/linux terminology it is some sort of shell) that offers from
simple calculation to sophisticated statistical functions.
You can run programs available in R or write your own script using these programs. Or
you can also write a program using your favourite language (C,C++,FORTRAN)
and put it in R.
If you have a mind of a programmer then it is perfect for you. If you have a mind of a
user it gives you very good options to do what you want to do.
Here I give a very brief introduction to some of the commands of R. During the course I
will give some other useful commands for each technique.
To get started
If you are using Windows: Once you have downloaded R (the University has already
that) then you can either follow the path Start/Programs/R or if you have a
shortcut to R version double click that icon. Then you will have R window
If you are using unix/linux/MacOS/: After defining path where R executables are just
type R in one of your windows. Usually path is defined during download time.
Useful commands for beginners:
help.start()
will usually start a web browser and you can start learning. A very useful section is
“An Introduction to R”. There is a search engine also.
To get information about a command you just type
?command
It will give some sort of help (sometimes helpful help).
command()
Gives R script if available. Reading these scripts may help you to write your own
script or program
Simple commands: assignment
The simplest command is that of assignment
v=5.0
or
v <- 5.0
the value of the variable v will become 5.0 (Although there are several ways for
assignment I will always use =)
If you type
v = c(1.0,2.0,10.0,1.5,2.5,6.5)
will make a vector with length 6.
if you type
v
R will print the value(s) of the variable v.
v=c(“mine”,”yours”,”his/hers”,”theirs”,”its”)
will create a vector of characters. Type of variable is defined on fly.
To access particular value of a vector use for example
v[1] – the first element
To create a matrix
The simplest way to create a matrix is to create a vector then convert it to a matrix
c = vector(len=100)
c=1:100 (The values of c will become integers from 1 to 100)
dim(c ) = c(5,20)
c
The second command will work whenever you have a vector. The resulting c will be
a matrix with dimensions 5x20.
You can also use:
d = matrix(c,c(5,20)) or d = matrix(c,nrow=5) or d=matrix(c,ncol=20)
d
then c will be kept intact and d will become a matrix. You can also give names to the
columns and rows (LETTERS is a built in vector of the English letters)
rownames(d) = LETTERS[1:5]
colnames(d) = LETTERS[1:20]
Simple calculations: arithmetic
Almost all elementary functions are available:
exp(v)
log(v)
tan(v)
cos(v) and others
These functions are applied to all elements of the vector (or matrix). Types of the
value of these function are the same as the types of the arguments. It will of
course fail if v is a vector of characters and you are trying to use a function with
real argument or the values are outside of the range of function’s argument space.
Apart from elementary functions there are many built in special functions like Bessel
functions (besselI(x,n), besselK(x,n) etc), gamma functions and many others. Just
have a look help.start() and use “Search engine and Keywords”
Two commands for sorting
There are two commands for sorting. One of them is
sort(randu[,1])
It just sorts the data in an ascending order. It has a limited use. Another, more
important one does not sort but creates a vector of indices that corresponds to a
sorted data. That is:
order(randu[,1])
It gives position of the ordered data. It can now be used to access data in an ordered
form. sort(data) and data[order(data)] are equivalent.
randu[order(randu[,1]),]
will change rows of the data so that the first column is sorted..
Reading from files
The simplest way of reading from a file of table is to use
d = read.table(“name of the file”)
It will read that table from the file (you may have some problems if you are using
windows). Do not forget to put end of line for the final line if you are using
windows.
scan is also a useful command for reading.
d = scan(file=“name of the file”)
There are options to read files from various stat packages. For example read.csv,
read.csv2
Built in data
R has numerous built in datasets. You can view them using
data()
You can pick one of them and play with it. It is always good idea to have a look what
kind of data you are working with. There are also helps for R datasets
data(DNase)
?DNase
It will print information about DNase.
You can have all available data sets using
data(package = .packages(all.available = TRUE))
To take a data set from another package you can load the corresponding library using
library(name of library)
and then you can read data set. This command will load all functions in that library
also
Once you have data you can start analyzing them
Simple statistics
The simplest statistics you can use are mean, variance and standard deviations
data(randu)
mean(randu[,2])
var(randu[,2])
sd(randu[,2])
will calculate mean, variance and standard deviation of the column 2 of the data
randu
Another useful command is
summary(randu[,2])
It gives minimum, 1st quartile, median, mean, 3rd quartile and maximum values
Simple two sample statistics
Covariance between two samples:
cov(randu[,1],randu[,2])
Correlation between two samples:
cor(randu[,1],randu[,2])
When you have a matrix (columns are variables and rows are observations)
cov(randu)
will calculate covariance between columns
cor(randu)
will calculate correlation between columns
If rows are observations then you can use the transpose of the matrix
cov(t(randu))
Simple plots
There are several useful plot functions. We will learn some of them during the course.
Here are the simplest ones:
plot(randu[,2])
Plots values vs indices. The x axis is index of a data point and the y axis is its value
Simple plots: boxplot
Another useful plot is boxplot.
boxplot(randu[,2])
It produces a boxplot. It is a useful plot that may show extreme outliers and overall
behaviour of the data under consideration. It plots median, 1st, 3rd quantiles,
minimum and maximum values. In some sense it a graphical representation of
command summary
Simple plots: histogram
Histogram is another useful command. It may give some idea about the underlying
distribution
hist(randu[,2])
will plot histogram. x axis is value of the data and the y axis is number of occurrences
Simple plots: qqplot
Useful way of checking if data obey a particular distribution
qqnorm(randu[,2])
is useful to see if the distribution is normal. It must be linear. Clearly it is not normal
Simple qqplot
Let us test another one. Uniform distribution
qqplot(randu[,2],runif(1000))
runif is a random number generator from the uniform distribution. It is a useful
command.
The result is (It looks much better):
Further reading
1)
2)
“Introduction to R” from package R
Dalgaard, P. “Introductory Statistics with R”