R from step 1 for the experimental biologist with an eye

Download Report

Transcript R from step 1 for the experimental biologist with an eye

To err is human – to R is divine
R from step 1 for the
experimental biologist with an
eye on the tomoRRow!
Schraga Schwartz, Bioinformatic Workshop, June 2009
Outline
• Why R
• How R
• iRis
• Down syndRome
• Where R
Why
?
R programming
language is a lot
like magic...
except instead of
spells you have
functions.
R from step 1 for the
experimental
biologist with an eye
on the tomoRRow!
=
muggle
SPSS and Excel users are like muggles. They are limited in
their ability to change their environment. The way they approach
a problem is constrained by how SPSS/Microsoft employed
programmers thought to approach them. And they have to pay
money to use these constraining softwares.
=
wizard
R users are like wizards. They can rely on functions (spells) that
have been developed for them by statistical researchers, but
they can also create their own. They don’t have to pay for the
use of them, and once experienced enough (like Dumbledore),
they are almost unlimited in their ability to change their
environment.
R’s strengths
•
•
•
•
•
•
Data management & manipulation
Statistics
Graphics
Programming language
Active user community
Free!
R’s weakness
•
•
•
•
Not user friendly at start.
Minimal GUI.
No commercial support
Substantially slower than programming
languages (e.g. perl, java, C++).
R graphics: the sky's the limit!
http://addictedtor.free.fr/graphiques/
How R?
R as a calculator
• Calculator
+, -, /, *, ^, log(), exp(), sqrt(), …:
(17*0.35)^(1/3)
log(10)
exp(1)
3^-1
Variables in R
• Variables are assigned using either “=“ or “<-”
x=12.6
x
[1] 12.6
Numeric vectors
A vector composed of numbers. Such a vector may be created:
1. Using the c() (short for concatenate) function:
y=c(3,7,9,11)
> y
[1]
3
7
9 11
2. Using the rep(what,how_many_times) function:
y=rep(3,30)
3. Using the “:” operator, signifiying “a series of integers between”
y=1:30
Vector manipulation
n=c(1,4,5,6,7,2,3,4,5,6) #creates a vector with the
numbers in the brackets, stores it in y
length(n) #number of elements
n[3] #extract 3rd element in y
n[-2] #extract all of y but 2nd element
n[1:3] #extract first three element of y
n[c(1,3,4)] #extract first, third, and fourth
element of y
n[n<4] #extract all elements in y smaller than 4
n[n<4 & n!=1] #extract element smaller than 4 AND
different from 1
n[n<4 | n!=1] #extract element smaller than 4 OR
different from 1
Vector manipulation…
n+1 #add 1 to all elements in y
n*2 #multiply by two all elements in y
sum(n)
mean(n)
median(n)
var(n)
min(n)
max(n)
log(n) #extract logs from all variables in y
sum(n[n<4]) #number of elements in y with
values smaller than 4
Fuctions in R
- Functions are bits of code which receive
something as input (termed: arguments),
and produce something as output (termed:
return value).
- A function can be recognized by the round
brackets "()" following the function name.
- The arguments of the "mean" function is a
vector of numbers; the return value is their
average.
Basic visualization of numbers
•barplot(n)
•plot(n)
•hist(n)
•boxplot(n)
•pie(n)
0
1
2
3
4
5
6
7
barplot(n,col="red")
4
3
2
1
y
5
6
7
plot(n,col="red")
2
4
6
Index
8
10
hist(n,col="red")
1.0
0.5
0.0
Frequency
1.5
2.0
Histogram of y
1
2
3
4
y
5
6
7
1
2
3
4
5
6
7
boxplot(n,col="red")
pie(n[1:3])
2
1
3
Help in R
Click ? + function_name.
? barplot
Help pages contain the following components:
- function_name(package) – if the package is not installed, this is the
time to install it and call it (using "library")
- Description: brief overview
- Usage
- Description of arguments (input)
- Details: more information
- Value: value returned by the function (output)
- See also: great way to learn new stuff you didn't even know you
wanted to do!
- Examples: Can be copy-pasted as is! Highly informative!
Other vectors
Character vectors:
nms=c("miriam","schragi","chaim","joc
hanan","ephraim","avraham","yemima",
"shakked","ayala","adi")
names(n)=nms #giving names to each
value in numeric vector y
n["shakked"]
Class Exercise: Redraw some of the previous plots
with modified n!
The paste() function
Concatenates different characters into a
single character, separated by the variable
defined by sep argument (default: sep=" ")
paste("To","err","is human.","To
R is","divine!",sep="_")
Boolean vectors!
Boolean vectors:
b=c(TRUE,FALSE,TRUE,FALSE,TRUE,TRU
E)
Factor vectors (We love factors!)
f=as.factor(c("stupid","stupid","s
mart","stupid","imbecile","smart
","smart","imbecile"))
levels(f) #possible values a
variable in y can have
summary(f) #provides the number of
time each factor occurs
Class Exercise: Compare summary(n),
summary(b), and summary(f) – note difference in
output!
The data.frame Class
(We also love data.frames!)
• A data.frame is simply a table
• Each column may be of a different class
(i.e. one column may be numeric, another
may be a character, a third may be
boolean and a fourth may be a factor)
• All rows in a given column must be of the
same class
age
gender
disease
50
M
TRUE
43
M
FALSE
• The number of rows in each
25
F
TRUE
18
M
TRUE
column must be identical.
72
F
FALSE
65
45
M
F
FALSE
TRUE
Iris database
Petal (‫)עלה כותרת‬
Sepal (‫)עלה גביע‬
The iris dataset
The fascinating questions
• What are typical lengths and widths of
sepals and petals?
• Do these change from one family of irises
to another?
• Do longer petals tend to be wider?
• Do longer petals tend to correlate with
longer (or wider) sepals?
• Do such correlations change from one
family of irises to another?
Playing with data frames - I
1. Set the work directory to the directory
you're working in:
setwd("F:/presentations/R presentation")
(Note: getwd() tells you which directory you're in)
2. Load the table you want to work with
(make sure you saved it as tab delimited
file!):
ir=read.table(file="iris_dataset.txt",sep="\t",header=T)
#loads iris_dataset.txt into variable "ir". Assumes that
the file is tab delimited, and that the first line is a
header.
Playing with data frames II
class(ir) #shows the class of ir
dim(ir) #returns the number of rows and
columns in ir
ir[1,2] #first line, second column in ir
ir[1,] #all columns in first line in ir
ir[,1] #all rows in first column of ir
ir$seplen #same as above
ir[,"seplen"] #same as above
ir[,c("seplen","sepwid")] OR ir[,1:2] #first
two columns of ir
summary(ir) #each of the columns is
summarized according to its class
Playing with data frames - III
ir$seplen>6 #returns a boolean vector with TRUE and
FALSE values depending on whether seplen is
greater than 6
ir[ir$seplen>6,] #returns a subset of ir containing
all columns of all rows in which seplen is greater
than 6
ir[ir$seplen>6,c("seplen","sepwid")] #returns same
rows as above, but only "seplen" and "sepwid"
columns
ir[ir$seplen>6 & ir$sepwid >3,c("seplen","sepwid")]
#returns same columns as above, but only rows in
which seplen is greater than 6 and sepwid is
greater than 3
Visualization
hist(ir$seplen) #histogram of
seplen
15
10
5
0
Frequency
20
25
30
Histogram of ir$seplen
4
5
6
ir$seplen
7
8
Visualization - II
hist(ir$seplen,30) #histogram of
seplen
6
4
2
0
Frequency
8
10
Histogram of ir$seplen
4.5
5.0
5.5
6.0
ir$seplen
6.5
7.0
7.5
8.0
Visualization - III
mean_seplen=mean(ir$seplen)
hist(ir$seplen,20,col="light blue",main="Distribution of
Septal lengths",xlab="Lengths of septal
(cm)",sub=paste("Mean septal length is",mean_seplen))
10
5
0
Frequency
15
Distribution of Septal lengths
5
6
7
Lengths of septal (cm)
Mean septal length is 5.84333333333333
8
The tapply() function
age
50
43
25
18
72
65
45
Suppose you want to obtain average
ages of patients (a numeric) variable,
as a function of their gender (a factor)
variable. And suppose the data is stored
in the data frame data. The magic spell is:
tapply(data$age,data$gender,mean)
gender
M
M
F
M
F
M
F
disease
TRUE
FALSE
TRUE
TRUE
FALSE
FALSE
TRUE
The tapply function – receives three parameters:
- A numeric distribution
- A factor variable, dividing the numeric distribution into
groups
- A function (mean,min,max,sd,sum)
Visualization - IV
mean_per_species=tapply(ir$seplen,ir$species,mean)
#calculates the mean value of ir$seplen after dividing it
into three groups based on ir$species
0
1
2
3
4
5
6
barplot(mean_per_species,col="red")
setosa
versicolor
virginica
Adding packages
Select mirror
Select library
Class exercise
Install the following three libraries:
gplots, lattice,car
These libraries will be used in subsequent
examples.
Visualization - V
5
4
3
2
1
0
Mean septal lengths
6
7
sd_per_species=tapply(ir$seplen,ir$species,sd) #caculate
standard deviation
library(gplots) #loads all functions in gplots into
workspace (including the barplot2 function)
barplot2(mean_per_species, plot.ci = T, ci.l =
mean_per_species-sd_per_species, ci.u =
mean_per_species+sd_per_species,col="red",ylab="Mean
septal lengths")
setosa
versicolor
virginica
Visualization - VI
6.0
5.5
5.0
Sepal length
6.5
library(gplots)
plotmeans(ir$seplen~ir$species,xlab="species",ylab="
Sepal length")
n=50
n=50
n=50
setosa
versicolor
virginica
species
Looking at correlations
1.5
1.0
0.5
ir$petwid
2.0
2.5
plot(ir$petlen,ir$petwid) #plotting one set of
numbers as a function of another
1
2
3
4
ir$petlen
5
6
7
Arguments of the plot function
Some parameters of plot() function (get more by typing "?
plot.default"):
x – x values (defaults 1:number of points)
y – the distribution
type – type: can be either "l" (line), "p" (points) or more
pch – type of bullets (values from 19-25)
col – color (either numbers of names of colors) – can receive
multiple colors
lwd – line width
lty – line type
xlab,ylab – X and Y labels
main, sub – main title (top of chart) and subtitle (beneath the X
label)
More sophisticated plotting
1.5
1.0
0.5
Petal length
2.0
2.5
plot(ir$petlen,ir$petwid,col=as.numeric(ir$species),p
ch=19,xlab="Petal width",ylab="Petal length")
1
2
3
4
Petal width
5
6
7
And more sophisticated plot, with
legend and P values
stat=cor.test(ir$petlen,ir$petwid)
rval=stat$estimate
pval=stat$p.value
plot(ir$petlen,ir$petwid,col=as.numeric(ir$species),pch=19,xlab="
Petal width",ylab="Petal length",main=paste("R=",rval," ;
P=",pval,sep=""))
legend(x="topleft",legend=levels(ir$species),col=1:3,lty=1,lwd=2)
#adding a legend
2.5
R=0.962865431402796 ; P=0
1.5
1.0
0.5
Petal length
2.0
setosa
versicolor
virginica
1
2
3
4
Petal width
5
6
7
Plotting correlations as a function
of a third factor variable
library("lattice")
xyplot(ir$seplen ~ ir$sepwid | ir$species)
virginica
8
7
6
ir$seplen
5
setosa
versicolor
8
7
6
5
2.0
2.5
3.0
3.5
4.0
4.5
ir$sepwid
Looking at everything as a function
of everything else
pairs(iris[,1:4])
pairs(iris[,1:4],col=iris$Species,upper.panel=NULL)
Sepal.Length
4.0
4.5 5.5 6.5 7.5
4.5 5.5 6.5 7.5
2.5 1 2 3 4 5 6 7 2.0
3.0
Sepal.Width
1.5
1.5
2.5
Petal.Length
0.5
0.5
Petal.Width
4.5 5.5 6.5 7.5
2.0
3.0
4.0
1 2 3 4 5 6 7
0.5
1.5
2.5
Even more sophisticated…
library(car)
scatterplot.matrix(iris[,1:4],groups=iris$Species,ellipse=
T,levels=0.95,upper.panel=NULL, smooth=F)
Sepal.Length
||||||||||||||||||||||||
||||||||||||||||||||||||| || |
4.0
4.5 5.5 6.5 7.5
4.5 5.5 6.5 7.5
2.5 1 2 3 4 5 6 7 2.0
3.0
Sepal.Width
| || | | | | | | | | | | | |||| | | | | | | |
Petal.Length
| |||||||||||||||||||||
|||||
||||||||||||| |||
2.5
||||||||||||||
1.5
1.5
Petal.Width
setosa
versicolor
0.5
0.5
virginica
|| | | | |
||
4.5 5.5 6.5 7.5
2.0
3.0
4.0
1 2 3 4 5 6 7
0.5
| | | | |||| | | | | | | | | | |
1.5
2.5
And more (for the highly motivated
or extremly bored…)
upperpanel.cor <- function(x, y,method="pearson",digits=2,...) {
points(x,y,type="n");
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1));
correl <- cor.test(x, y,method=method);
r=correl$estimate;
pval=correl$p.value;
color="black";
if (pval<0.05) color="blue";
txt <- format(r,digits=2)
pval <- format(pval,digits=2)
txt <- paste("r=", txt, "\npval=",pval,sep="")
text(0.5, 0.5, txt,col=color)
}
scatterplot.matrix(iris[,1:4],groups=iris$Species,ellipse=T,level
s=0.95,upper.panel=upperpanel.cor,cex=0.3,smooth=F,main="This
is cool!!!")
Final output
This is cool!!!
2.5
3.0
3.5
4.0
0.5
1.0
1.5
2.0
2.5
r=0.87
pval=0
r=0.82
pval=0
r=-0.43
pval=4.5e-08
r=-0.37
pval=4.1e-06
4.5
r=-0.12
pval=0.15
5.5
Sepal.Length
6.5
7.5
2.0
|| || | || | || || | || || | || || | || | || || | | || |
2.0 2.5 3.0 3.5 4.0
Sepal.Width
| | | | | | | | | | | | | | | | | | | | |
|
7
|
| | || || || ||| |||| || ||||| || || ||| || || || | | |
0.5 1.0 1.5 2.0 2.5
| ||| |||| ||| |
1
2
3
r=0.96
pval=0
4
5
6
Petal.Length
Petal.Width
setosa
versicolor
virginica
| | | | | |
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
1
2
3
4
5
6
7
| | | | | | | | | | | | | | | |
Saving Graphics to Files
• Before running the visualizing function, redirect all
plots to a file of a certain type. Possibilities:
–
–
–
–
jpeg(filename)
png(filename)
pdf(filename)
postscript(filename)
• After running the visualization function, close
graphic device using dev.off()
Saving graphics
Example:
pdf("F:/test.pdf")
barplot(1:10,col="red")
dev.off()
Note:Different graphic functions can also
receive arguments regarding width and
height of canvas. Use "?" + function name
(e.g. ?jpeg to obtain arguments)
Statistics
•
•
•
•
•
t.test #Student t test
wilcox.test #Mann-Whitney test
kruskal.test #Kruskal-Wallis rank sum test
chisq.test #chi squared test
cor.test #pearson/spearman correlations
• lm(),glm() #linear and generalized linear models
• p.adjust #adjustment of P values to multiple
testing using FDR, bonferroni, or whatnot…
Down Syndrome
The fascinating research question
Do genes from any particular
chromosome alter their
expression levels in Down
syndrome?
GEO database: A paradise of
numbers
Getting the data!
Loading the data (look at it first!)
setwd("F:/Presentations/R presentation/")
#sets the work directory
a=read.table(file="GSE5390_series_matrix.txt
",sep="\t",header=T,comment.char="!")
#loads the gene expression values and
stores them in a
names(a)=c("id","down1","down2","down3","dow
n4","down5","down6","down7","healty1","hea
lty2","healty3","healty4","healty5","healt
y6","healty7","healty8") #give informative
names to columns in a
The merge() function
a=
name
inky
dinky
Jose
Oleg
Jesus
Cleopatra
Stanislav
age
50
43
25
18
72
65
45
gender
M
M
F
M
F
M
F
name
dinky
Cleopatra
Stanislav
disease
TRUE
FALSE
TRUE
TRUE
FALSE
FALSE
TRUE
age
43
65
45
b=
gender
M
M
F
name
dinky
Cleopatra
Stanislav
disease
FALSE
FALSE
TRUE
IQ
43
750
565
IQ
43
750
565
merge(a,b,by="name") OR merge(a,b,by.x="name",by.y="name")
Merging data
convert=read.table(file="convert_affyprobes_
2_chromosome_location_from_UCSC.txt",sep="
\t",header=T)
b=merge(a,convert,by="id") #merges a and
convert by the columns indicated by the by
arguments. In other words, the column "id"
in "a" is compared to the column "id" in
"convert". Only lines in which the two
values are identical are retained,
yielding a new data frame with shared
values & shared information.
Assign informative names
downcols=2:8
healthycols=9:16
allarraycols=c(downcols,healthycols)
Calculate Fold Change between
disease and healthy
Step 1: calculate mean expression values for all patients with Down
syndrome
b$meandown=apply(b[,downcols],1,mean)
Step 2: calculate mean expression values for all healthy subjects
b$meanhealthy=apply(b[,healthycols],1,mean)
Step 3: Calculate difference between the two (since data is log
transformed)
b$dif=b$meandown-b$meanhealthy
Step 4: anti-log the fold change
b$foldchange=2^b$dif
Calculate P values
Step 1: Create function which receives a line as input, and
knows how to break it up into disease and control groups
and yield a p value
GetPval=function(line) {
ttest=t.test(line[downcols-1],line[healthycols1])
ttest$p.value
}
Step 2: Apply this function to all rows of the data frame
b$pval=apply(b[,allarraycols],1,GetPval)
Step 3: Adjust P value to multiple testing
b$adjustedPval=p.adjust(b$pval,method="fdr")
Saving data frames to a file
write.table(b,file="DownWithPvals.txt",sep="
\t",row.names=F,col.names=T) #generates a tabdelimited file with column names, without row names
containing the data in the data frame b
Finding significant events
sigs=b[b$foldchange>1.75 &
b$adjustedPval<0.01,] #finding events with significant
fold change and significant P values
sigs=sigs[order(sigs$adjustedPval,decreasing=T)
,] #sorting table based on P values
Finding and plotting % significantly
over/under expressed genes per
chromosome
percentages=summary(sigs$chr)*100/sum
mary(b$chr) #divides the number of
times each chrosome appears in
"sigs" by number of time it appears
in original data
barplot(percentages,las=3,col="light
blue",ylab="% significant
genes",main="To R is divine!")
#barplot depicting the percentage of
genes from each chromosome within
sig
chr1
chr1_random
chr10
chr11
chr12
chr13
chr13_random
chr14
chr15
chr15_random
chr16
chr16_random
chr17
chr17_random
chr18
chr19
chr19_random
chr2
chr2_random
chr20
chr21
chr21_random
chr22
hr22_h2_hap1
chr22_random
chr3
chr3_random
chr4
chr4_random
chr5
chr5_h2_hap1
chr5_random
chr6
hr6_cox_hap1
chr6_qbl_hap2
chr6_random
chr7
chr7_random
chr8
chr8_random
chr9
chr9_random
chrM
chrX
chrX_random
chrY
0
1
2
3
% significant genes
4
5
R - for a better tomoRRow!
Even better plot…
validchrs=c(paste("chr",1:22,sep
=""),"chrX","chrY")
percentages=percentages[validchr
s]
barplot(percentages,las=3,col="l
ight blue",ylab="% significant
genes",main="R - for a better
tomoRRow!")
chrY
chrX
chr22
chr21
chr20
chr19
chr18
chr17
chr16
chr15
chr14
chr13
chr12
chr11
chr10
chr9
chr8
chr7
chr6
chr5
chr4
chr3
chr2
chr1
0
1
2
3
% significant genes
4
5
Results…
To R is divine!
Volcano plots: P values as a
measure of fold change
plot(log2(b$foldchange),log2(b$pval),col=(b$chr=="chr21")+1,pch=19,xlab="log foldchange",ylab="-log P value")
legend(x="topleft",legend=c("non chr 21","chr
21"),lty=1,col=1:2,lwd=3)
abline(h=-log2(0.001),col="blue",lty=3)
abline(v=c(log2(1.75),-log2(1.75)),col="blue",lty=3)
text(2,17,"Significantly\nOver-represented",col="blue")
text(-1.4,17,"Significantly\nUnder-represented",col="blue")
abline() function: adds either horizontal or vertical line/s (as well as more
sophisticated stuff as well), depending on whether the "h" or "v" arguments
are populated
text() function: receives x,y coordinates
|on plot, as well as text to plot
Volcano plot
A particular R strength: genetics
• Bioconductor is a suite of
additional functions and
some 200 packages
dedicated to analysis,
visualization, and
management of genetic
data
• Much more functionality
than software released by
Affy or Illumina
Where R?
R homepage: http://www.rproject.org/
Choose server…
Click on “Windows”
Click “base”
Click on “Download” link and follow
installation guidelines…
There you R!
Installing Tinn-R
• Go to: http://www.sciviews.org/Tinn-R/
• Scroll to bottom of page
Loading R from within Tinn-R
Final Tips
• Use http://www.rseek.org/ & google for finding help on
what you want
• Know your objects’ classes: class(x)
• Know your functions arguments. Use "?
function_name" to learn what arguments a function
receives & what its return values are.
• Each help files provides examples, which can be
copy-pasted into R as is. Extremely useful!
• MOST IMPORTANT - the more time you spend using
R, the more comfortable you become with it. DESPAIR
NOT – and you will never look back!
Final Words of Warning
• “Using R is a bit akin to
smoking. The beginning is
difficult, one may get
headaches and even gag the
first few times. But in the long
run,it becomes pleasurable
and even addictive. Yet, deep
down, for those willing to be
honest, there is something not
fully healthy in it.” --Francois
Pinard
R
Thank you!
May the R be
with you!
Todo
• multiple panels
• lists, loops, lapply, sapply
• regular expressions