Getting started with R

Transcript Getting started with R

Get started with R
Henrik Bengtsson
Mathematical Statistics,
Centre for Mathematical Sciences,
Lund University.
[email protected]
Slides available @ http://www.maths.lth.se/bioinformatics/calendar/20031113/
1 of 40
Why not R?
• Currently there are no good image analysis packages.
Matlab is still better here.
• Currently S-Plus (commercial software similar to R) can
handle larger data set.
• ...yeah, why not?
2 of 40
Why R?
•
•
•
•
•
•
•
•
•
•
•
•
3 of 40
R was written by scientists to be used by scientists.
R is a great language.
Runs on Windows, Mac, Unix, Linux, ...
Easy to install.
R is free.
Hundreds(!) of packages available.
Papers are often published with a R package.
Great help and documentation.
24/7 online mail support.
A great community – friendly and helpful people...
Easy to call Fortran, C, Java, ... libraries
Everyone else is and will be using R
(-Yes, I‘m willing to bet!)
Starting and quitting R
On Windows click on the R icon or as on Unix type R:
markov{hb}: R
R : Copyright 2003, The R Development Core Team
Version 1.7.1 (2003-06-16)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type `license()' or `licence()' for distribution details.
R is a collaborative project with many contributors.
Type `contributors()' for more information.
Type `demo()' for some demos, `help()' for on-line help, or
`help.start()' for a HTML browser interface to help.
Type `q()' to quit R.
> 1+2+3
[1] 6
> q()
Save workspace image? [y/n/c]: n
markov{hb}:
4 of 40
Getting help
• Browse the help
> help.start()
> help.search(“linear model”)
Help files with alias or title matching 'linear model' using fuzzy
matching:
anova.glm(base)
anova.lm(base)
glm(base)
family.glm(base)
summary.glm(base)
lm(base)
family.lm(base)
summary.lm(base)
lm.fit(base)
loglin(base)
predict.lm(base)
cv.glm(boot)
glm.diag.plots(boot)
• Getting help on a specific function
> help(lm)
> ?lm
• Searching for help
> help.search(“linear model”)
5 of 40
Analysis of Deviance for Generalized Linear
Model Fits
ANOVA for Linear Model Fits
Fitting Generalized Linear Models
Accessing Generalized Linear Model Fits
Summarizing Generalized Linear Model Fits
Fitting Linear Models
Accessing Linear Model Fits
Summarizing Linear Model Fits
Fitter Functions for Linear Models
Fitting Log-Linear Models
Predict method for Linear Model Fits
Cross-validation for Generalized Linear Models
Diagnostics plots for generalized linear
Running help examples
> help(lm)
> example(lm)
lm> ctl <- c(4.17, 5.58, 5.18, 6.11, 4.5, 4.61, 5.17,
4.53, 5.33, 5.14)
lm> trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03,
4.89, 4.32, 4.69)
lm> group <- gl(2, 10, 20, labels = c("Ctl", "Trt"))
lm> weight <- c(ctl, trt)
lm> anova(lm.D9 <- lm(weight ~ group))
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
group
1 0.6882 0.6882 1.4191 0.249
Residuals 18 8.7293 0.4850
lm> summary(lm.D90 <- lm(weight ~ group - 1))
Call:
lm(formula = weight ~ group - 1)
Residuals:
Min
1Q
-1.0710 -0.4938
Median
0.0685
3Q
0.2462
Max
1.3690
Coefficients:
Estimate Std. Error t value Pr(>|t|)
groupCtl
5.0320
0.2202
22.85 9.55e-15 ***
groupTrt
4.6610
0.2202
21.16 3.62e-14 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-Squared: 0.9818,
Adjusted R-squared: 0.9798
F-statistic: 485.1 on 2 and 18 DF, p-value: < 2.2e-16
lm> summary(resid(lm.D9) - resid(lm.D90))
Min.
1st Qu.
Median
Mean
-6.661e-16 -6.245e-17 2.776e-17 1.544e-17
3rd Qu.
1.110e-16
lm> opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0))
lm> plot(lm.D9, las = 1)
lm> par(opar)
6 of 40
Max.
3.331e-16
r-help mailing list
• Ask questions on the r-help mailing list.
• Get answers 24/7 (including Xmas day!)
• Sign up on http://www.r-project.org/
• Things to think of before sending a
question:
– Make sure you read the help/docs first.
– Try to add a code example what you are trying to
do.
– Tell what version of R you are using.
7 of 40
A first example
Vectors:
> x <- c(6,5,4,3,2,1)
> x
[1] 6 5 4 3 2 1
> sum(x)
[1] 21
> x[c(1,3,5)]
[1] 6 4 2
> x <- 6:1
> x
[1] 6 5 4 3 2 1
> x[1:3]
[1] 6 5 4
> x[1:3]+x[6:4]
[1] 7 7 7
> y <- c(0,10)
> x+y
[1] 6 15 4 13 2 11
8 of 40
Matrices (default is to fill column by column):
> x <- matrix(1:18, nrow=3)
> x
[,1] [,2] [,3] [,4] [,5] [,6]
[1,]
1
4
7
10
13
16
[2,]
2
5
8
11
14
17
[3,]
3
6
9
12
15
18
> x[,3]
[1] 7 8 9
> x[-1,4:6]
[,1] [,2] [,3]
[1,]
11
14
17
[2,]
12
15
18
> str(x)
int [1:3, 1:6] 1 2 3 4 5 6 7 8 9 10 ...
R is multiplying elementwise!
Matrices:
> X <- matrix(1:9, ncol=3, byrow=TRUE)
> X
[,1] [,2] [,3]
[1,]
1
2
3
[2,]
4
5
6
[3,]
7
8
9
> I <- diag(1, nrow=3)
> I
[,1] [,2] [,3]
[1,]
1
0
0
[2,]
0
1
0
[3,]
0
0
1
> I * X
[,1] [,2] [,3]
[1,]
1
0
0
[2,]
0
5
0
[3,]
0
0
9
> I %*% X
[,1] [,2] [,3]
[1,]
1
2
3
[2,]
4
5
6
[3,]
7
8
9
9 of 40
Why? Because it turns out
that it is more common
than vector and matrix
multiplications.
R is looping over missed values
- really useful indeed!
Vectors:
> x
> x
[1]
> y
> y
[1]
> x
[1]
<- 7:1
7 6 5 4 3 2 1
<- 1:3
1 2 3
+ y
8 8 8 5 5 5 2
R will expand the shortest
vector before adding!
# Zero out every 2nd:
> x <- 3:8
> y <- c(1,0)
> x * y
[1] 3 0 5 0 7 0
10 of 40
Matrices:
> x <- matrix(1:16, nrow=3)
Warning message:
Replacement length not a multiple of the
elements to replace in matrix(...)
> x
[,1] [,2] [,3] [,4] [,5] [,6]
[1,]
1
4
7
10
13
16
[2,]
2
5
8
11
14
1
[3,]
3
6
9
12
15
2
> y <- c(-1,1)
# Multiply (“moving down the columns”)
> x * y
[,1] [,2] [,3] [,4] [,5] [,6]
[1,]
-1
4
-7
10 -13
16
[2,]
2
-5
8 -11
14
-1
[3,]
-3
6
-9
12 -15
2
A first plot example
>
>
#
>
#
>
x <- seq(from=1, to=2*pi, length=41)
y <- sin(x)
Scatter plot, with red data points of size 2
plot(x,y, col="red", cex=2)
A line with double width through data points
lines(x,y, lwd=2)
Loop over plot attributes:
> col <- c(“red”, “blue”)
> plot(x,y, col=col, cex=1:3, lwd=4)
Write last plot to file:
> dev.print(png, "sin.png")
> dev.print(postscript, “sin.ps”)
(also in menus on Windows)
11 of 40
Questions for Clarification?!
(2 minutes)
12 of 40
Putting names on things
- helps you avoid stupid bugs etc.
Vectors:
> x <- c(87,76.3,1.67)
> x
[1] 87.0 76.3 1.67
> names(x) <- c(“age”, “weight”, “height”)
> x
age weight height
87.0
76.3 1.67
# Alternatively
> x <- c(age=87, weight=76.3, height=1.67)
Why?
> x[“age”]
[1] 87.0
> bmi <- x[“weight”]/x[“height”]^2
cf.
13 of 40
> x[1]
[1] 87.0
> bmi <- x[2]/x[3]^2
Putting names on things
Matrices:
14 of 40
>
>
>
>
>
x1 <- c(87,76.3,1.67)
x2 <- c(78,96.3,1.84)
x3 <- c(45,62.9,1.54)
x <- matrix(c(x1,x2,x3), nrow=3, byrow=TRUE)
x
[,1] [,2] [,3]
[1,]
87 76.3 1.67
[2,]
78 96.3 1.84
[3,]
45 62.9 1.54
> colnames(x) <- c("age", "weight", "height")
> rownames(x) <- c(“jon”, “kim”, “dan”)
> x
age weight height
jon 87
76.3
1.67
kim 78
96.3
1.84
dan 45
62.9
1.54
> x[“jon”,]
age weight height
87.00 76.30
1.67
> x[,c("weight","age")]
weight age
jon
76.3 87
kim
96.3 78
dan
62.9 45
> bmi <- x[,"weight"]/x[,"height"]^2
> bmi
jon
kim
dan
27.35846 28.44400 26.52218
Lists
> x <- c(a=1:2, b=3, c=5:8)
> x
a1 a2 b c1 c2 c3 c4
1 2 3 5 6 7 8
> x <- list(a=1:2, b=3, c=5:8)
> x
$a
[1] 1 2
$b
[1] 3
$c
[1] 5 6 7 8
> x$a
[1] 1 2
> x[[“a”]]
[1] 1 2
> x[2:3]
$b
[1] 3
$c
[1] 5 6 7 8
> x[c(“b”,”c”)]
$b
[1] 3
$c
[1] 5 6 7 8
15 of 40
Vectors and matrices can only contain
one type of data at the time, e.g. either
numbers or strings, but lists can carry
mixed types:
> x <- list(a=1:4, b=4:6+2i, src=“GenePix”)
> x
$a
[1] 1 2 3 4
$b
[1] 4+2i 5+2i 6+2i
$src
[1] "GenePix“
Data frames
Data frames are very powerful. Simply speaking you can treat them as
both lists and matrices:
> df <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45),
weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54))
> df
name age weight height
1 jon 87
76.3
1.67
2 kim 78
96.3
1.84
3 dan 45
62.9
1.54
> df$weight
[1] 76.3 96.3 62.9
> df[,c("name","age")]
name age
1 jon 87
2 kim 78
3 dan 45
> as.matrix(df)
name age weight
1 "jon" "87" "76.3"
2 "kim" "78" "96.3"
3 "dan" "45" "62.9"
16 of 40
height
"1.67"
"1.84"
"1.54"
Importing data
Data in text (ASCII) files can be imported using read.table():
> df <- read.table(“foo.txt”, header=TRUE)
> dim(df)
[1] 221952
6
> summary(df)
slide
spot
R
Min.
:1.00
Min.
:
1
Min.
:
30.0
1st Qu.:1.75
1st Qu.:13873
1st Qu.:
60.0
Median :2.50
Median :27745
Median : 102.0
Mean
:2.50
Mean
:27745
Mean
: 487.5
3rd Qu.:3.25
3rd Qu.:41616
3rd Qu.: 233.0
Max.
:4.00
Max.
:55488
Max.
:65211.0
G
Rb
Gb
Min.
:
24.0
Min.
: 32.00
Min.
: 24.00
1st Qu.:
38.0
1st Qu.: 40.00
1st Qu.: 26.00
Median :
59.0
Median : 49.00
Median : 32.00
Mean
: 194.7
Mean
: 51.85
Mean
: 31.92
3rd Qu.: 110.0
3rd Qu.: 64.00
3rd Qu.: 36.00
Max.
:62341.0
Max.
:1234.00
Max.
:255.00
> str(df)
`data.frame':
221952 obs. of 6 variables:
$ slide: int 1 1 1 1 1 1 1 1 1 1 ...
$ spot : int 1 2 3 4 5 6 7 8 9 10 ...
$ R
: int 4416 335 39 568 42 43 56 40 7912 51 ...
$ G
: int 1533 155 47 211 50 110 64 45 4535 65 ...
$ Rb
: int 39 38 39 39 39 39 39 40 39 39 ...
$ Gb
: int 43 42 43 42 42 44 42 43 42 48 ...
17 of 40
Boxplots and density plots
> df <- read.table(“foo.txt”, header=TRUE)
> dim(df)
[1] 221952
6
> boxplot(df[,c("R","G","Rb","Gb")],
col=c("red","green"))
# Ooops
> boxplot(df[,c("R","G","Rb","Gb")],
col=c("red","green"), ylim=c(0,300))
# Better!
> plot(density(log(df$G,base=2)), col="green",
main="Densities of R and G")
> lines(density(log(df$R,base=2)), col="red")
18 of 40
Summary of “data types”
Vectors (only one type at the time):
> x
> x
[1]
> x
> x
[1]
<- 7:1
7 6 5 4 3 2 1
<- c("a",3,"e")
"a" “3" "e"
Lists (any shape, mixed type):
> x <- list(a=1:4, b=4:6+2i, src=“GenePix”)
> x
$a
[1] 1 2 3 4
$b
[1] 4+2i 5+2i 6+2i
$src
[1] "GenePix“
Matrices (rectangular, only one type at the time):
> x <- matrix(1:18, nrow=3)
> x
[,1] [,2] [,3] [,4] [,5] [,6]
[1,]
1
4
7
10
13
16
[2,]
2
5
8
11
14
17
[3,]
3
6
9
12
15
18
> x <- matrix(letters[1:12], nrow=3)
> x
[,1] [,2] [,3] [,4]
[1,] "a" "d" "g" "j"
[2,] "b" "e" "h" "k"
[3,] "c" "f" "i" "l"
19 of 40
Data frames (rectangular, mixed type):
> x <- data.frame(name=c(“jon”,”kim”,”dan”),
age=c(87,78,45), weight=c(76.3,96.3,62.9),
height=c(1.67,1.84,1.54))
> x
name age weight height
1 jon 87
76.3
1.67
2 kim 78
96.3
1.84
3 dan 45
62.9
1.54
Questions for Clarification?!
(2 minutes)
20 of 40
Installing R on Windows
21 of 40
Random numbers
> x <- rnorm(10000, mean=2, sd=4)
> length(x)
[1] 10000
> mean(x)
[1] 2.007904
> sd(x)
[1] 3.969784
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu.
Max.
-12.670
-0.607
2.000 2.008
4.701 15.850
> hist(x)
# or use probabilities (not counts) on y-axis
> hist(x, probability=TRUE)
22 of 40
Defining functions and scripts
> cubic <- function(x) 1 + x^2 - 0.5*x^3
> cubic(0:4)
[1]
1.0
1.5
1.0 -3.5 -15.0
# Read R code from file
> source(“cubic.R”)
> cubic(0:4)
[1]
1.0
1.5
1.0 -3.5 -15.0
> cubic
function(x) {
1 + x^2 - 0.5*x^3
}
23 of 40
Adding data points to an existing plot
> f <- function(x) x^2
> x <- seq(-10,10, by=0.01)
> eps <- rnorm(length(x), sd=4)
> y <- f(x) + eps
# plot() creates a new plot
> plot(x,y)
# points() add data points to
# an existing plot
> points(x,f(1.1*x), col="blue")
#
>
>
>
Same for abline() and others
abline(h=0, col="green")
abline(v=0, col="purple")
abline(a=0, b=1, col="orange")
# Same for curve() with add=TRUE
> curve(f, col="red“, add=TRUE)
(On request by Linda)
24 of 40
Packages
• On CRAN - Comprehensive R Archive
Network – there are today 300+ packages
published!
• Browse CRAN at
http://www.r-project.org/
• Find what you want.
• Dirt simple to install package!
• At the Centre for Mathematical Sciences
we try to keep install and update all
package on our system.
25 of 40
Install a package
• On Windows extremely easy!
• On all systems:
> install.packages(c("adapt","maptools"))
trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/PACKAGES'
Content type `text/plain; charset=iso-8859-1' length 12674 bytes
opened URL
downloaded 12Kb
trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/adapt_1.0-3.zip'
Content type àpplication/zip' length 39304 bytes
opened URL
downloaded 38Kb
trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/maptools_0.3-2.zip
'
Content type àpplication/zip' length 129634 bytes
opened URL
downloaded 126Kb
Delete downloaded files (y/N)? y
updating HTML package descriptions
> library(maptools)
26 of 40
Update packages
• On all systems:
> update.packages()
trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/PACKAGES'
Content type `text/plain; charset=iso-8859-1' length 12674 bytes
opened URL
downloaded 12Kb
cluster :
Version 1.7.3 in c:/PROGRA~1/R/rw1071/library
Version 1.7.6 on CRAN
Update (y/N)? y
foreign :
Version 0.6-1 in c:/PROGRA~1/R/rw1071/library
Version 0.6-3 on CRAN
Update (y/N)? y
...
trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/foreign_0.6-3.zip'
Content type àpplication/zip' length 109855 bytes
opened URL
downloaded 107Kb
Delete downloaded files (y/N)? y
updating HTML package descriptions
>
27 of 40
Using/loading a package
• On all systems:
> library(maptools)
# Some package loads other package too
> library(com.braju.sma)
Loading required package: R.oo
R.oo v0.44 (2003/10/29) was successfully loaded.
Loading required package: R.io
R.io v0.44 (2003/10/29) was successfully loaded.
Loading required package: R.graphics
R.graphics v0.44 (2003/10/29) was successfully loaded.
com.braju.sma v0.64 (2003/10/31) was successfully loaded.
> example(MAData)
28 of 40
Questions for Clarification?!
(2 minutes)
29 of 40
Fit a linear model (i)
1. Find function to use:
> help.search(“linear model”)
Help files with alias or title matching 'linear model' using fuzzy
matching:
anova.glm(base)
anova.lm(base)
glm(base)
family.glm(base)
summary.glm(base)
Analysis of Deviance for Generalized Linear
Model Fits
ANOVA for Linear Model Fits
Fitting Generalized Linear Models
Accessing Generalized Linear Model Fits
Summarizing Generalized Linear Model Fits
lm(base)
Fitting Linear Models
family.lm(base)
summary.lm(base)
lm.fit(base)
loglin(base)
predict.lm(base)
cv.glm(boot)
glm.diag.plots(boot)
Accessing Linear Model Fits
Summarizing Linear Model Fits
Fitter Functions for Linear Models
Fitting Log-Linear Models
Predict method for Linear Model Fits
Cross-validation for Generalized Linear Models
Diagnostics plots for generalized linear
models
It says that there is a function called lm() in a package called base.
(Package base is always loaded when R is started)
30 of 40
Fit a linear model (ii)
2. Look at the help:
> help(lm)
31 of 40
Fit a linear model (iii)
3. Test it on our data:
#
>
>
>
>
Data with noise
f <- function(x) 3.3 + 0.85*x
x <- seq(0,10, by=0.01)
eps <- rnorm(length(x), sd=2)
y <- f(x) + eps
# Plot data
> plot(x,y)
# Fit a linear model. See ?lm
> fit <- lm(y ~ x)
> fit
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept)
x
3.2780
0.8386
# Plot fitted line
> abline(fit, col=“red")
32 of 40
Fit a linear model (iv)
4. Investigate fitted line in detail:
> summary(fit)
Call:
lm(formula = y ~ x)
Residuals:
Min
1Q Median
-8.2449 -1.3234 0.1234
3Q
1.2670
Max
6.5321
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.27797
0.12608
26.00
<2e-16 ***
x
0.83863
0.02183
38.41
<2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.996 on 999 degrees of freedom
Multiple R-Squared: 0.5963,
Adjusted R-squared: 0.5959
F-statistic: 1476 on 1 and 999 DF, p-value: < 2.2e-16
33 of 40
Fit a linear model (v)
5. What does lm() return?
> str(fit)
List of 12
$ coefficients : Named num [1:2] 3.357 0.824
..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
$ residuals
: Named num [1:1001] 1.09757 -0.00305 0.95878 1.64874 -2.59366 ...
..- attr(*, "names")= chr [1:1001] "1" "2" "3" "4" ...
$ effects
: Named num [1:1001] -236.55
75.33
-0.99
1.62
-2.63 ...
..- attr(*, "names")= chr [1:1001] "(Intercept)" "x" "" ""
...
$ rank
: int 2
$ fitted.values: Named num [1:1001] 3.36 3.36 3.37 3.38
3.39 ...
..- attr(*, "names")= chr [1:1001] "1" "2" "3" "4" ...
$ assign
: int [1:2] 0 1
$ qr
:List of 5
..$ qr
: num [1:1001, 1:2] -31.6386
0.0316
0.0316
0.0316
0.0316 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:1001] "1" "2" "3" "4" ...
[ SNIP ]
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=length 10 <environment>
.. .. ..- attr(*, "predvars")= language list(y, x)
attr(*, "class")= chr "lm"
34 of 40
Fit a linear model (vi)
6. Related functions etc.
> res <- residuals(fit)
> hist(res)
> qqnorm(res)
35 of 40
Questions for Clarification?!
(2 minutes)
36 of 40
Local Polynomial Regression Fitting (LOESS)
# Read microarray data
> df <- read.table("foo.txt", header=TRUE)
# Use only R & G from slide 1
> df <- df[df$slide == 1, c(“R”,”G”)]
# Get the log-ratios and log-intensities
> M <- log(df$R/df$G, base=2)
> A <- log(df$R*df$G, base=2)/2
# Plot M vs A
> plot(A,M)
#
>
>
>
Fit a LOESS curve
fit <- loess(M ~ A)
Mcurve <- predict(fit, A)
points(A, Mcurve, col="red")
# Remove trend
> M <- M - Mcurve
> points(A,M, col=”blue”)
37 of 40
Column-by-column standardization of matrix
# Subract the mean and scale by the std. dev.
> zscore <- function(y, ...) {
+ (y - mean(y, ...)) / sd(y, ...)
+ }
# Generate data
> x <- matrix(rnorm(1000, mean=2, sd=10), ncol=2)
#
#
#
>
Standardize data. If there are NA values, pass argument
na.rm=TRUE to both mean() and sd() via ..., which tells
them to remove NAs first)
z <- apply(x, MARGIN=2, FUN=zscore, na.rm=TRUE)
# Create a vector with the mean and the std. dev. of the first column
> c(mean(x[,1]),sd(x[,1]))
[1] 2.44114 10.25328
> c(mean(z[,1]),sd(z[,1]))
[1] -1.616485e-16 1.000000e+00
(On request by Halfdan)
38 of 40
Links
http://www.r-project.org/
- homepage of the R project
- download and install R
- sign up on the r-help mailing list
- more manuals and examples.
- additional packages (CRAN)
- conferences and other large-scale projects
- ...
http://www.maths.lth.se/help/R/
- Help @ MC
- Local packages
Although I’m finishing up my thesis... I can help out, but please make
sure you read the help first. Don’t forget about the r-help list. Thanks!
39 of 40
Session open for any questions
(remaining time)
40 of 40

Getting started with R

Transcript Getting started with R

Directory