R BootCamp - Washington University in St. Louis

Download Report

Transcript R BootCamp - Washington University in St. Louis

R BootCamp
07/31/2012
Jingqin Luo
Course Overview
• Modules
– Module 1 (07/31 pm, 1:30~5:00): R basics
– Module 2 (08/01 am, 9:00~12:00): statistical
modeling in R
– Module 3 (08/01 pm , 1:30~5:00): R graphics and
advanced topics
• Instructors: Jingqin (Rosy) Luo, Jeff Gill, Tsung-han
Tsai
• R primer website:
http://artsci.wustl.edu/~jgill/biostat.R.primer.html
• Issues: survey, laptop/R installation, wireless timeout
Module 1 outline
•
•
•
•
•
•
•
•
Downloading and Installing R
Setting up help
Basic syntax
Data types and data structures
Basic operations
Data import and export
Subsetting and managing data
Quitting R and saving R object
R: Statistical Computing and Graphics
• R webpages
– CRAN : http://cran.r-project.org
– WashU mirror: http://cran.wustl.edu
– Bioconductor: http://bioconductor.org
• Platform independent (Windows/Linux/Mac)
• R manuals: CRAN, DocumentationManuals
• R: open source
– R BASE and some recommended packages
– Optional contributed packages: go to CRAN,
software  packages
Install R
• CRAN: http://cran.r-project.org
• “Download and Install R” section in
the middle
–Linux
–Mac
–Windows: base
Start R
•
•
•
•
Start R in windows
Start R in Shell: type “R”
Start R in Emacs: ESS
R console
– prompt “>” (can be changed)
R Help
• Window R Help menu
• Start the help browser in R session:
> help.start()
• packages
• “Search engine and keyword”
• help on a specific function
> help(plot)
> ?plot
• Pattern search: > help.search(“wilcoxon”)
• Help On a package: > help(package=MASS)
• CRAN search page: go to CRAN webpage
“CRAN”  “Search” “R site search”
• Google
Running R
• Script and R session
– Start R R session
– Open a new script : File  New script (.R)
– Windows  Tile vertically
• R working directory:
> getwd()
> dir.create(“C:/R_Primer”)
> setwd(“C:/R_primer”)
• Submit scripts:
– Copy & paste
– Highlight & “Ctrl+c+r”
• Repeat previous command (up arrow)
R Environment
• Comment line: #
• Object :
– store value(s)/object to an object with a name
– Naming conventions
• meaningful
• Can’t start with numbers
• Can use “.” or “_” for connecting long name
– Case sensitive
• Assign value/object etc to an object: =, , 
– inherit values, class, property
> a=2
>b2
>2c
> assign(“d”,2)
• One command one line/multiple command one line (use ;)
Objects in memory
• List all objects in the workspace:
> ls()
> ls(pattern=“a”)
• Remove objects
> rm(a, b)
> rm(list=ls(pattern=“a”))
> rm(list=ls())
Print to Screen: Cat and Print
• Cat: print in a scalar/vector format;
separator/line
> cat(“Hello, world! \n”)
> a2
> cat(a,”\n”)
> cat(“a=\t”, a, “\n”)
• Print: print any format; automatically return
> print(a)
Quitting R
• Save command history/working objects
> savehistory(file=“module1.RData”)
>save.image (file=“module1.Rhistory”)
• Quit r: > q( )
– save history? yes/no/cancel (yestwo files with
default name)
• “ .RData”: a binary file storing all objects
• “.Rhistory”: a text file, save all command
history
• .Rdata: can only be opened by R
– right click to open by R
– Load in saved objects by script
> load (“.RData”)
automatically load all saved objects into
memory
Ex1: Start R, cat and print
1. Start Window R
2. File open a new script and name it
“Rprimer_module1.R”
3. In the .R file, write/implement the following
– Add a comment line by typing “# Module 1: Ex 1”
– Write script to create and change the R working
directory to “C:/R_Primer”
– write the scripts as covered in the previous slides
– Find relevant info on the R function “heatmap” and
the R package “randomForest”
4. Save the .R file, quit R and say “yes” to save the history
and objects
5. Now start R again to open the previous check if
the objects are still there ( ls() )
Data Structures in R
Data Structures
• Data storage mode: numeric, character, factor, logic, date
• Main types of data structures
– A scalar
– Array: of one type
• Vector: 1 dim
• Matrix:2 dim
• Arrays: ≥3 dim
– List : contains many components, different types,
different dim , length
– Data frame: a special list, a generalized Matrix
format
• row by column
• columns of different types
Operators
• Arithmetic operators: +, -, *, /, ^ (power), sqrt(x),
log(x), exp(x), abs(x)
• Character: nchar(x), substr(x, m1, m2)
• Comparison operator: >, >=, <, <=, ==, !=
• Logic:
– Or: |
– And: &
• Numeric vector operator:
– all arithmetic operators
a  c(5,7,4,2,3,1,10)
– prod(x), which(x>3); sum(x), max(x), min(x)
mean(x), var(x), crossprod(x,y), outer(x,y)
Data structures-----Scalar
• Scalar: x, length=1
• A numeric scalar:
> a  2 ; length(a)
> class(a) ; mode(a); str(a)
> a+2 ; ba^2
• A character scalar:
> b  “apple”
> str(b);
> nchar(b); substr(b, 3, 4)
• A logic scalar:
> c T ; d  F ;
> model(c); class(c)
> a>2; a==2; a!=2
> b==“apple”; b!=“apple”
> !c
Data structures-----Scalar (Cont’d)
– Date: time Format:
–POSIXct: more convenient to be in data
frame
–POSIXlt: better human readable
> d1 “2012-09-01”
> d1 as.POSIXlt(d)
> d2 <- strptime (“09/01/2012” “%m/%d/%Y”)
> time <- difftime(d1,d2,units=“days”)
Note: “%Y” if 2012 and “%y” if “02”
Data structures-----Vector
• Vector: a collection of multiple values of same
type
• Concatenation function ( c ): arrange a collections
of values into a vector
– Numeric vector
> x  c(1,2,3, 4 ) ; str(x);
> x1 1; x2 3; x3  c(x1,x2)
– character vector
> y  c(“1”, “2”, “3”, “4”) or y c(“a”, ”b”, “a”)
equivalent : yas.character(x)
> y==“a” ; y>=“a” (alphabetical comparison)
what happens if c(“1”, 2)?
• Named vector:
> z  c(“apple”=2, “banana”=5, “orange”=3);
Data structures-----Vector (cont’d)
• Attributes :
> length(x)
> names(x)
• Subset a vector:[ ]
–Numeric indices:
> x[1]; x[c(1,3)]; x[-2]
–Names
> x[“apple”]; x[c(“apple”, “orange”)]
!can’t mix use of numeric index with names
–Logic value:
> x[x>=3]
Data structures-----Vector (cont’d)
• Special vectors
> x  numeric();
> x[1]2; x[2] 5
> y  numeric(3)
> rep(1,5); z  rep(c(1,2), each=5)
> x  seq(1, 10, by=5)
> x  1:3
(equivalent, x <- seq(1, 3,by=1)
> y[x]  c(3, 6, 1)
> z character (2)
Data structures-----Vector (cont’d)
• Operation on numeric vector
> x  1:5
> x+2 ; x^2, log(x); exp(x)
> prod(x); mean(x); summary(x)
• Operation on character/factor/logic vectors:
table
> gender  rep(c(“F”,”M”), each=20)
> table(gender)
> age  rep(“Young”, “Old”), 20)
> table(gender, age)
• Which
> which(x>=3) ;
> which( gender==“F” & age==“Young”)
Factor
• Factor type vector, usually for categorical
variables, can compare levels
> age2  factor(age, levels=c(“Young”, “Old”))
> str(age2)
• Levels of a factor: alphabetical order
> levels(age2)
> nlevels(age2)
• Relabel levels
> age3 factor(age2,labels=c(“18~40”, “>40”))
• Ordered factor
> age3 factor(age2,ordered=T)
> age3 >= “Young” ; age3==“Old”
Ex2: vector
• Practice the scripts covered in the previous slides
• Generate a numeric vector of the values: 5, 1, 3, 2, 4
and then a named character vector of the elements:
“a”, “e”, “d”, “c”, “b” with names “L1” to “L5”
• Get the simple descriptive statistics on the numeric
vecotr including min, max, mean, var for
• Calculate the logged value of the numeric vector
• Subset the numeric vector for the 1st and 3rd element
and assign the subsetting results to a new vector
• Subset the named character vector for the 2nd and
4th element by indices and names
Data structures-----Array
• Array: any dimension, values are of the same type, usually
numeric
> A  array(1:24,dim=c(3,4,2))
> B  array(letters, dim=c(2,3,4))
• Define a matrix
> A matrix(1:12,nrow=3, ncol=4, byrow=F)
(default, by column)
Equivalent to: > A  array(1:12, dim=c(3,4))
what if: > B matrix(1:12,3,5) ?
Data structures-----Matrix
• Attributes
– 2-dim (row and column):
> dim(A);
> nrow(A)
> ncol(A)
– row names/column names:
> rownames(A)  paste(“R”, 1:3,sep=“”)
> colnames(A)  paste(“C”, 1:5, sep=“”)
• Indexing: [ , ]
– Numeric indices
> A[,1], A[,3]
> A[1, 2] 2; A[1,c(1,3,4)] <- c(2,5,6); A[1, ] <- c(2,5,6, 1, 3);
> A[1:2,1:2] <- matrix(c(1,6,3,5),2,2,byrow=T)
> B  A[-c(2,4) ,-c(1,3)]
– $
> A$R1 ; A$C1
> A[A$R2>=3, c(“C2”, “C4”)]
> A[, -c(“C1”, “C3”)] No!!!
Data structures-----Matrix Operation
• Arithmetic operator: element-wise
> A+2 ; A*2 ; log(A)
• Transpose a matrix : > t(A)
• Matrix multiplication: B ##! ncol(A)==nrow(B)
> A  matrix(1:6,3,2) ; B <- matrix(1:6,2,3);
>A%*%
• summary: column-wise
> summary(A)
• sum over all entries : > sum(A)
• Row-wise, colum-wise operation
– Get row-wise sum/mean/var
> apply(A,1,sum)
> apply(A,1,mean)
> apply(A,1,var)
– Get col-wise sum/mean/var etc
>apply(A,2,sum)
>apply(A,2,mean)
> apply(A,2,var)
Matrix to vector conversion
• Vector to matrix: matrix()
• Matrix->vector:
> a  A[1, 1:3]
> a <- as.vector (A)
> a  c(A)
Ex3: matrix
• Practice the scripts covered on the slides
• Generate a numeric sequence of 1 to 20
• Transform it into a matrix of dim 4 by 5 with 1
to 5 as the 1st row
• Use dim, nrow, ncol to find out the dimensions
• Name the row and the column separately your
way
• Print the first 2 rows and then the last 2
columns of the matrix in multiple ways
• Get the row-wise and col-wise minimum,
maximum and mean
List
• List: a data structure of mixture of types and
length
> my.list  list(x1=1:3, x2=5:10, x3=rep(c(“M”,”F”),each=5),
x4=matrix(1:6,3,2) )
> str(my.list)
• Subsetting a list: $, [[ ]]
> my.list$x1 Equivalently > my.list[[1]]
> my.list$x2 Equivalently > my.list[[2]]
> my.list$x3 Equivalently > my.list[[3]]
Compare to :
> str(my.list[1])
> str(my.list[[1]]
> list1 my.list[1:2]
List (continued)
> list0 <- list()
> list0 <- vector(“list”,5)
> list0[[1]] <- 1:5
> list0[[2]] <- matrix(1:6,3,2)
Data frame: a special type of list
• Matrix-like, columns may be of different modes
> dat data.frame(ID=1:3, name=c(“apple”, “orange”,
“peach”), wt=c(3,6,9))
> str(dat)
• Attributes
> rownames(dat)
> names(dat) equivalent to > colnames(dat)
• Subsetting: $, [, ]
> dat[1,3], dat$name; dat$name[dat$ID=3]
• Transform matrix to data frame
> A <- matrix(1:6,3,2); B <- as.data.frame(A)
Operations on data frame
• Combining
> dat1data.frame(ID=1:3, Name=letters[1:3])
> dat2data.frame(ID=4:6, Name=letters[4:6])
> dat3  data.frame(age=15:17)
> dat4  rbind(dat1,dat2)
> dat5  cbind(dat1, dat3)
rownames /colnames must be the same
• Merge two data frame
> dat2dat.frame(ID=c(2,3), InStore=c(T,F,T))
>merge(dat1,dat2,by.x=“ID”, by.y=“ID”, all.x=T,
all.y=T)
Subsetting a data frame
• Matrix-like subsetting : $ , [,]
• The “subset” function
> subset(dat5,select=c(name,age)
> subset(dat0,select=-ID)
>subset(dat5, subset=dat$age>=16)
attach and detach
• Before attach, can’t use variables in data frame
directly
>name[1]
> age  age+1 will not change dat$age
To change it,
> dat$age <- dat$age+1
• attach(dat) and use variables directly
> attach(dat)
> age[2]
> detach(dat)
Sort and order
• sort and order for a vector
> score  c(1,7,6,4)
> sort(score)
> idx1 order(score, decreasing=F)##from min to
max
> score2 score[idx1]
• Sort a matrix/data frame
> idx  order(dat5$age, decreasing=T);
> dat  dat5[idx,]
Special Values in R
• Reserved special objects:
–NULL : empty, no value
–NA: missing value
–NaN: not a number, sqrt(-2) or 0/0
–Inf (-Inf) : infinity
• is.null(x); is.na(x); is.nan(x);
is.infinite(x); is.finite(x)
Objects
• Properties of objects
– structure(x) or str(x): mode, dimension/length, elements
– class(x): numeric (integer,double), character, factor, logic,
array, data frame, list, table,
• unclass(x): lose class property
– typeof(x) and mode(x): numeric/character/logic/list
– attributes(x): a matrix (dimension, row names, col names etc)
– length(x); dim(x)
– names(x); rownames(x); colnames(x)
• Is an object of a type?
is.numeric(x); is.factor(x); is.integer(x); is.character();
is.matrix(x); is.data.frame(x) etc
• Switch types: as.numeric(x); as.integer(x); as.character();
as.factor(x); as.matrix(x); as.data.frame(x) etc
Ex4: Data frame
• Create two data frame for a class of 5 students
(fake some values)
– the 1st contains columns “ID”, “Age”, “Sex”
– The 2nd includes columns “ID” , “MathScore”
(note: ID should have overlaps)
• Merge the two dataset by ID
• Obatin summary information on Age,
MathScore
• Practice on the dataset the scripts covered so far
Data Export and import
Data Export
• Write.table:
> write.table(dat, file=“dat.txt”, sep=“\t”, col.names=T,
row.names=F, append=F)
> write.table(dat, file=“dat.csv”, sep=“,”, col.names=T,
row.names=F, append=F)
• save a “list” object
> dput(my.list, file=“my.list.txt”)
• Cat
> cat(“2 3 5 7”, “11 13 17 19”,file=“catfile.dat”,
sep=“\n”)
• package “xlsReadWrite”: save excel spreadsheet
> write.xls( dat, file=“dat.xls”, colNames = TRUE, sheet
= 1, from = 1, rowNames = NA )
Data Import
• read.table() : Rectangular spreadsheet-like data, read in as
a data frame
> read.table(file=“dat.txt”, header=T, sep=“\t” )
• Read.csv
>read.csv(file=“dat.csv”, header=T)
Note: help on other useful arguments!
• Read in a saved list
>dget(“mylist.txt”)
• Read in an excel file : readXls
Dump/save
• Dump/save
– Save is more reliable version of dump
– This function takes a vector of names of R objects and produces
text representations of the objects on a file or connection.
– A dump file can usually be sourced into another R (or S) session.
> x  1; y  1:10
> dump(ls(patt='^[xyz]'), "xyz.Rdmped")
> save(x,y,”x.y.RData”)
> save.image(): save all current objects
• To recall the R objects:
> source(“xyz.Rdmped”)
> source(“Rfunct.r”)##all my R functions
Read in data generated from SAS and
other statistical software)
• “foreign” package
– SAS permanent dataset (.sas7bdat or .ssd0x suffix)
Note: must be able to run SAS
library(foreign)
sashome <- "C:/Program Files/SAS/SAS 9.1"
read.ssd(libname="C:/RCourse07/ImportSASData/ExampleSASData",sectionname
s="hsb2",sascmd=file.path(sashome,"sas.exe"))
– Others: SPSS, S etc: see help(package=foreign)
read.spss()
Install R Packages
• Package dependency: (e.g., “DiagTest3Grp”)
• Window menu: Packages  Install Package
(e.g., “DiagTest3Grp”
• Window and Linux: R session
install.packages(“DiagTest3Grp”, lib=“/hom/rosy/R”)
– linux command:
R CMD INSTALL -l your-dir package.tar.gz
“-l” : locally
• load package:
library(package, lib.loc=“/home/rosy/R”)
Ex5
• Practice on data import and export
Set operator functions
a  c( 1,3,6,4,2) ; b c(5,2,1,4) ; c  c(1,2)
> intersect(a, c):shared elements
> union(a, b)
> setdiff(a,b): in x but not in y
> setdiff(b, a): in y but not in x
> setequal(a, b)
> is.element(c(1,2), a) (equivalently %in%)
Define intersection operator:
> %n% <- intersect
> a%n%b%n% c