R Programming for Life Sciences

Download Report

Transcript R Programming for Life Sciences

R Programming for Life Scientists
Version 2.0
Raymond R. Balise, Ph.D.
Health Research and Policy
Spectrum
Roadmap
•
•
•
•
•
•
•
What makes R different for the rest?
Setting up R
Types of data
Working with collections of data
Importing and exporting data
Writing functions
Graphics
When to Use R
• Shoestring budget
• Cutting edge statistics
• Developing your own or fine-tuning existing
methods
• Local expertise
Programming Languages
• Procedural languages
– C, Fortran, Cobol, Basic
– use a model where the logic flows from the top of
the page to the bottom with calls to “goto”
subroutines as needed
• It is hard to encapsulate the code.
• Object oriented languages
• C++, Visual Basic, JAVA
• involves creating “objects” and then operating on them
R is Object Oriented (OO)
• You create objects
• vector of numbers, a graphic, etc.
• You call methods/functions to operate on the
objects.
• Working with an OO language requires you to
learn about special methods to create, access,
modify, or destroy objects and their properties.
– R hides these processes.
– It helps a lot if you want to write new statistics and
methods and is required for making new packages.
OO Example
• With R you write code in the editor which I will
show you in a minute.
• You can create an object which holds a bunch of
numbers (a vector, if you remember math)
• You can then use (aka call) a function (aka
method) to operate on the object.
– The summary() function
• Create and display a numeric summary object
– The plot() function
• Create and display a graphic summary object
Make the
ages object
Call the
summary
function.
Call the
plot
function.
But wait … there’s more!
• There is a lot of functionality built into R. It
ships with libraries that do many different
tasks. And you can download more.
Activate the map
datasets and
functions.
Map most of the
USA.
But hold on…. There is MORE!
• You can add options to the function calls to
make them do fancy things like color….
• Or you can have one function act on the
output of another function….
• And you can save output as objects!
Important Objects
• Vectors are lists of numbers.
• Dataframes are like database or spreadsheets.
Everything is an Object
Every object in R has a mode which determines how much space it uses.
R objects can have a class which indicates what they are used for.
Class:
Mode:
character
numeric
logical
function
complex
raw
character
numeric
logical
function
factor
data.frame
list
matrix
lm
table
… lots more…
Holding “simple” data elements
Structures to hold data
Columns (of same length) with data like database
Columns of data (of different length)
Grid of data (all same type)
Output from summary/graphic procedures
Where to Get R
• R has two main websites. One describes the project:
http://www.r-project.org/
• The other has most of the stuff you want to
download:
http://cran.r-project.org/
• Because the R project has people working all over
the globe, the software download site is “mirrored”
everywhere. The closest mirror is USA CA1 (aka UC
Berkeley).
http://cran.cnr.berkeley.edu/
• There is an R installer for all the common
operating systems:
• cran.cnr.berkeley.edu/bin/windows/base/
• cran.cnr.berkeley.edu/bin/macosx/
• cran.cnr.berkeley.edu/bin/linux/
• Each is basically self explanatory.
Installing on Windows
• Double click the installer and just push next
until you get to this screen.
Specify that you
want to do
customized startup.
This will let you set
up R to work with
other programs
nicely.
Customize
• Use these options, then hit Next> a bunch.
• help.start() and push enter to start the help.
• q() and push enter to quit but don’t yet.
Use the built in
editor.
GUI
Save or restore all
the objects in use.
Save or reload the
code from the
console.
Set the working
directory to save
objects.
Keep all the text in
the console for the
session.
Edit existing data.
Tweak the
appearance of the
console.
GUI
Rprofile.site
• If you have instructions that you always want
run when R starts up, you can include them in
the Rprofile.site file:
GUI
Common
commands.
Show the add on
packages currently
accessible.
Packages in R
• User-supplied packages are typically found at
one of three places:
– CRAN for all kinds of stuff
– Omegahat for web-based statistics
– Bioconductor for genomic analysis
• R packages update often.
• Your colleagues will recommend task-specific
packages.
– Rcmdr is my favorite.
Use a previously
downloaded package.
I type library(name)
instead.
USA (CA1) is closest
GUI
to Stanford.
Choose which set of
packages to look at.
See the HUGE list of
packages.
Update often!
GUI
This is useful.
HTML
help…
This will not find
information if you
have not installed
the packages.
This is useful
but not
Google.
Rseek.org is Google-driven
• I highly recommend it.
Search help for the
word "map".
Search for details
on a function if the
package is loaded
and you know the
functions name.
Mac Quick Help
Search help for the
word "map".
Load the package.
Windows Quick Help
Search help for the
function named
"map".
Mac Install
• Download and double click the dmg file.
Click customize and
make sure Tcl/Tk is
checked on.
X11
• Some packages for R on the Mac (like Rcmdr)
require X11 to be installed.
– I think it is part of the standard Leopard
installation but was an option with Tiger. If you
need it, try to install it off of the DVD that came
with your machine because people have reported
using the dmg files from Apple.com.
X11 and Add-on Packages
You can click here
to make sure X11
works.
To get add on
packages, use this
menu.
Getting or Updating Packages
• Click Get List, click the package name, be sure
install dependencies is checked on, then click
install.
Instead of Point and Click
• You can also run this code to have Mac or
Windows R download a list of packages:
usefulPackages = c("car", "foreign", "hexbin", "gdata",
"ggplot2", "gmodels", "gplots", "Hmisc", "reshape",
"Rcmdr")
install.packages(usefulPackages, dependencies = TRUE)
Be sure to take note of any packages that do not install.
‘marray’, ‘affy’, ‘Biobase’, ‘Rgraphviz’ were not available
Your First Package
• I suggest you install the Rcmdr package first
thing.
– Use the Install packages… option on the package
menu to download Rcmdr
• To make it available for your R session type:
library(Rcmdr)
– CAPITALIZATION MATTERS!
– The first time you run it, it will ask you if it can
download additional packages.
On a Mac you can
not directly import
from Excel.
If you are on Windows you
can directly import Excel.
Hate Typing?
• Tab is your fiend. It will auto-complete if it
can or give you a list of functions that match
what you have typed. It woks very well on the
Mac. In Windows sometimes you need to type
tab twice.
• In Windows if you type tab after a ( it displays
options for the function or they just appear in
the Mac.
Before Analysis
• Rcmdr makes the most commonly done
analyses easy or if you are told the names of
the functions to use, writing the code is
almost tolerable.
• Data manipulation is relatively difficult in R
compared to other analysis and data
management tools.
• You need to know how to manipulate the data
objects.
Data Set Objects
• Vectors
– A bunch of data in a single row or column
– All of the same type
• Matrix
– A row and column arrangement of data
– All of the same type
• Data frame
– A row and column arrangement of data
– Columns are of different types
• List
– Very free-form structure
– A grouping of different types of data
Like a “good” spreadsheet
or relational database file
Types of Data Vectors
• Numeric
– Integer, real, and complex are different types but
you will not need to pay attention to the details
– NA means missing
– NAN means not a number
• String
– Characters of the alphabet
• Logical
– TRUE, FALSE or NA
Making a Vector
• Make a sequence
R is case sensitive.
# means ignore the rest of the line.
; means a new command follows.
OneToThirty = seq(1, 30)
OneToThirty # same as print(OneToThirty)
oneToThirty = seq(1, 30, by = 2); oneToThirty
This will be VERY useful.
x1230 = 1:30
Surround the expression
(x1to30 = 1:30)
with () to display the result
automatically.
Making Vectors With c()
• c stands for concatenate
ages = c(9, 11, 40, 41) ; ages
stooges = c("Larry", "Moe", "Curly", "Shemp"); stooges
Getting Details
• You can use is functions and length to get
details on a vector.
is.vector(ages)
is.numeric(ages)
is.logical(ages)
length(ages)
Recycling and Vectorizing
• You can add one to all four ages.
ages + c(1,1,1,1)
• If you provide the scalar integer, R will
temporarily vectorize the 1 by recycling that
value to match the length of the ages vector.
ages + 1
• It will recycle a series also.
ages
ages + c(1,2)
Naming Parts of a Vector
• You can assign names to the elements of a vector.
This allows later access to the elements using the
names instead of the position.
names(ages) = stooges
ages
• To erase them:
names(ages) = NULL; ages
• Notice what happens when the lengths differ:
stooges= c("Larry", "Moe", "Curly")
names(ages) = stooges
ages
Attributes
• When you add names to things (objects) they
acquire or change their “names attribute.”
attributes(ages)
• When you strip off the names, the vector is
left with no attributes.
names(ages) = NULL
attributes(ages)
Complex Objects
• A data frame is an object with many
attributes.
• R ships with a lot of datasets if you want one…
help.start()
• Click packages then datasets.
esoph
?esoph
attributes(esoph)
Getting at Parts of a Vector
• Specify the element number.
heyMoe = ages[2] ; heyMoe
• Specify to drop everything except the element
number.
ages[c(-1, -3, -4)]
• Specify a list with TRUE and FALSE
ages[c(FALSE, TRUE, FALSE, FALSE)]
Getting Parts with Names
ages = c(9, 11, 40, 41) ; ages
names(ages) = c("Larry", "Moe", "Curly", "Shemp")
ages
• Specify the name.
heyMoe = ages["Moe"]
Duplicate Names
• That code only returns the first one if there
are duplicates.
names(ages)[4] = "Moe"
ages
heyMoe = ages["Moe"]
heyMoe
• Gives all if duplicates
names(ages) %in% "Moe"
ages[names(ages) %in% "Moe"]
Parts of a Data Frame
• You can select columns of a data frame just
like you selected elements from a vector.
booze = esoph["alcgp"]
is.data.frame(booze)
esoph[2]
esoph[c(4,5)]
Choosing Records
• If you put a single item or series inside of the
square brackets, R thinks you are requesting
columns.
• If you want to get access to specific rows, you
include a comma after the rows.
• blah[rows, columns]
esoph[ 1 , ]
esoph[ 1:3 , ]
Smarter Access to a Vector
• You can use logic checks to find the record
numbers in a vector which meet your criteria.
ages < 21
which(ages < 21)
• You can then subset down your data to the
records of interest using the [ ] subset
operator.
ages[which(ages < 21)]
ages[ages < 21]
Smarter Access to a Data Frame
• You can find the records that pass a logic
check inside a data frame.
esoph["ncases"] > 0
esoph$ncases > 0
which(esoph$ncases > 0)
Subset a Data Frame
• Recall that you can select rows with
frameName[rows,columns] and if you do not
include a comma, all records are chosen.
which(esoph$ncases > 0) gives you a list of
records which adhere to that rule. Therefore,
the code below gives you a subset
esoph[ which(esoph$ncases > 0) , ]
or
esoph[esoph$ncases > 0 , ]
Subset Data Frames Easily
• If that logic is rough to think about, use the
subset function.
subset(esoph, ncases > 0)
Choosing Values
• If you need specific values, you can use the &
(and) or the | (or) operators to get the
ordered set of TRUE and FALSE values.
ages > 21 & ages < 41
• ! means not
!(ages > 21 & ages < 41)
• Notice that it is applying the one logic check
to the vector of ages. How does it do that?
Math on Data Frame Columns
• You have seen how to do scalar and vector
algebra. Algebra on a data frame is easy.
names(esoph)
esoph$total=esoph$ncases + esoph$ncontrols
• To see the end of the data frame, use tail()
tail(esoph)
Comparing Against Vectors
• What happens when you try to compare a vector
to a set of things?
gender = c(NA, "Male", "Female", "Blue", "Female")
This one uses
gender == "Male" | gender == "Female"
recycling and
gives wrong
gender == c("Male", "Female")
answers.
• R recycles the shorter vector to be the longer
length, then does the comparison. Use the %in%
operator if you want to compare as if you wrote a
series of or statements.
gender %in% c("Male", "Female")
Categorical Variables
• R makes a distinction between variables holding
a bunch of characters from the alphabet and
variables holding categorical information. If
you have a classification/categorical variable,
you want R to treat it as a factor or an ordered
factor. Typical factors are treatment or gender.
dose = c("low", "placebo", "high", "low")
dose
typeof(dose)
Factors
• To convert a character variable to a factor, use the
as.factor function.
There are is. or as. predicate functions
doseF = as.factor(dose)
typeof(doseF)
class(doseF)
to check object types or convert
between types of objects.
• Behind the scenes, the character variable is
converted into numbers and the numbers are
given character strings to display.
• In modern R the levels of the factor are ordered
alphabetically and the first one is represented
with the digit 1, the second is 2, etc.
Comparing Factors
• You can compare a factor vs. a constant value.
doseF == "high"
as.integer(doseF) == 1
• Or you can compare vs. vectors (CAREFULLY).
doseF == c("high", "low")
doseF %in% c("high", "low")
Notice wrong answer thanks to recycling.
• R will stop you from comparing factors that have
different categories.
doseF2 = as.factor(c("blah", "placebo", "high", "low"))
doseF == doseF2
Recoding Factors
• Often you will want to regroup factor levels.
amount=as.factor(c("placebo", "10mg", "5mg", "10mg"))
levels(amount)
regroup = list(none="placebo", some=c("5mg", "10mg"))
none
levels(amount) = regroup
placebo
some
amount
5mg
10mg
Numeric Factors
• If you have numeric factors, be careful
converting from factors back to numbers.
ID = c(1000, 1000, 1001, 2)
IDf = factor(ID)
as.integer(IDf)
levels(IDf)
numbersAgain = as.numeric(levels(IDf))[IDf]
Recoding Values in a Vector
• R has functions like if and ifelse to process
values.
ifelse(ages < 21, "Young", "Old")
ifelse(ages<21, "Young", ifelse(ages<41, "Old",
"Fossilized") )
Easier Recoding
• Other packages like car have functions to recode:
library(car)
newAge2=recode(ages, ' 1:21="Young"; else= "Old" ')
newAge2
detach("package:car")
Attaching and Masking
• You can see the order in which R searches for
things using the search() function.
Remove the
package from
the search path.
Attaching Data Frames
• People who really hate typing attach data
frames so they can refer to them with short
names.
bmi = women$weight / women$height ^2 * 703
• Instead you can attach the women data frame
and an easier formula write:
attach(women)
search()
bmi = weight / height ^2 * 703; bmi
detach(women)
Keeping Track
• You can see what datasets are in each of the
work environments/packages with the list
stuff function ls().
rm(list=ls(all=TRUE))
ls()
search()
attach(women)
ls()
search()
head(women); head(height)
.GlobalEnv
women
datasets
height
weight
Look 1st
Look 2nd
…
women
…
Look 3rd
Adding a Variable & Making a DF
women$bmi = weight / height ^2 * 703; bmi
head(women)
The data frame
ls()
with bmi
.GlobalEnv
and the data
frame without
bmi
women
women
Look 1st
datasets
height
weight
Look 2nd
…
women
…
Look 3rd
Making a Data Frame
• Frequently you will want to make data frames for
analysis with Rcmdr. Use the data.frame()
command:
attach(sleep)
pair = data.frame(extra[group=="A"], extra[group=="B"])
Using Rcmdr for a paired t-test
Click here.
Loading Text Data into R
• Reading text files:
fakeAlleles=read.table("c:\\blah\\fakeAlleles.txt",
header=TRUE)
• See if it worked:
fakeAlleles
names(fakeAlleles)
summary(fakeAlleles)
fakeAlleles$dude = as.character(fakeAlleles$dude)
fakeAlleles
• A better option:
fakeAlleles = read.table("c:\\blah\\fakeAlleles.txt", header =
TRUE, colClasses = c("character", "factor","factor"))
Other Text Formats
• Other text reading methods:
read.csv = coma separated values
read.csv2 = semicolon delimited files
read.delim = read tab delimited files
read.fwf = read fixed width format files
• Use same options as read.table
• If the data has bad or no column headings you may
also want to include:
read.table( …stuff…, col.names = c("name1", "name2") )
• To prevent characters from coming in as factors:
options(stringsAsFactors = FALSE)
Data Frames
• The data imported into a data frame.
class(fakeAlleles)
• A data frame really is a list of vectors where the
vectors are all the same length.
as.list(fakeAlleles)
• To select a column you specify the data frame $
variable name.
theDudes = fakeAlleles$dude
• All the stuff you saw for logic checks on vectors
can be used on the parts of a data frame.
fakeAlleles$allele1 == "A"
Subsetting Vectors (again)
• Recall that you can subset using the [ ]
operator:
ages = c(9, 11, 40, 41)
heyMoe = ages[2]
ages <21
ages[ages < 21]
• The same voodoo works on the vectors that
make up data frames!
dudeTwo = fakeAlleles$dude[2]
Subsetting Data Frames
• Parts (subsets) of data frames are referenced
by "column numbers comma row numbers":
– The first record: fakeAlleles[1, ]
– The 2nd and 3rd columns: fakeAlleles[ , c(2,3)]
– The genotype for record 6: fakeAlleles[6, c(2,3)]
• or by names:
fakeAlleles[, c("allele1", "allele2")]
Named Rows
• You can name your rows also:
fakeAlleles = read.table("c:\\blah\\fakeAlleles.txt",
header = TRUE, colClasses = c("character",
"factor","factor"))
row.names(fakeAlleles) = fakeAlleles$dude
fakeAlleles = fakeAlleles[, -1]; fakeAlleles
fakeAlleles["006", c("allele1", "allele2")]
Getting Counts with Rcmdr
Subsetting Using Logic
• You can use logic checks to subset:
fakeAlleles$allele1 == "A" & fakeAlleles$allele2 =="A"
fakeAlleles[ fakeAlleles$allele1 == "A" &
fakeAlleles$allele2 =="A", ]
Importing From Excel
• If you have PERL on your machine, you can
use the read.xls() function in the gdata library
to easily get data out of Excel and into a data
frame.
– Mac has PERL
– Windows
http://www.activestate.com/activeperl/
Using read.xls
• Windows:
library(gdata)
sleepy = read.xls("c:\\blah\\sleep.xls")
• Mac:
library(gdata)
read.xls("/users/balise/desktop/sleep.xls")
• It’s that easy… Behind the scenes it is converting
the xls file into a csv so you can use the text
importing options.
• Do summary() on the data frame and notice what
happens to the missing value.
RODBC
• ODBC is a language/convention for accessing
databases. R allows you to use ODBC
connections to burrow directly into databases
and other data containers like Excel.
library(RODBC)
channel<-odbcConnectExcel("c:\\blah\\sleep.xlsx")
sqlQuery(channel, "select * from [Sheet1$]")
odbcCloseAll()
SQL
• If you have to learn one programming
language, learn SQL.
– With it you can manipulate data stored in nearly
every commercial database.
– You can aggregate, subset and modify data.
– It is well implemented inside of both R and SAS.
• SQL with R is nicely documented in Spector's
(2008) Data Manipulation with R. It is a must
own for people who want to learn R.
Exporting Text Files
• R can write objects full of data, including data
frames, into text files.
– By default, it will quote the character string and fill
in the letters NA where there were originally
missing values.
• This code exports back to the original
appearance.
write.table(sleepy, file = "c:\\blah\\exported.tab",
sep ="\t", quote = FALSE, na ="")
Office 2007 Excel
• ODBC connection
1.
2.
3.
4.
5.
6.
Control Pannels,
Double click Adminstrative Tools
Double click Data Sources (ODBC)
On the USER DNS tab choose ADD
Click Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)
Give the connection a name and browse to the file.
Jot down the name of the connection for the R code.
Using an ODBC connection
• Once the ODBC connection is set-up use code like this:
library(RODBC)
connection = odbcConnect("sleepODBC")
dataFromODBC= sqlFetch(connection, "Sheet1")
odbcClose(connection)
Creating Programs
• You can write line-by-line instructions in the R
console, use the editors built into R, or use a third
party editor (like Tinn-R for windows or JGR).
• Console
– Type history() to see the lines you have submitted
recently and then save to a file and re-run it later if
needed.
• Built-in Editor
– Mac: Click the blank page at top of the console
– Windows: File > New Script
Windows Tinn-R Editor
http://www.sciviews.org/Tinn-R/index.html
OO Programming in R
• OO programming requires
– objects
– classes
• describe specific properties for groups of objects
– inheritance
• classes related to eachother (derived from other classes) have related
properties
– polymorphism
• the same function name applied to different classes does different things
• R vs. JAVA: R typically has separate classes for actions instead of
bundling them with the data structures
– JAVA
• Animal -> domesticated -> dog (walks)
– R
• Animal -> domesticated -> dog
• Movement -> Walks
Polymorphism is Fun
• plot does different things depending on the
function arguments:
plot(sleepy)
plot(sleepy$extra)
plot(sleepy$baseline, sleepy$extra, sleepy$group)
• Take a look at how it works:
isS4(plot)
methods(plot)
getAnywhere(plot.factor)
Arguments
• Functions try to match arguments in the order they
appear in the definitions in the help files.
?plot
• You can explicitly reference the arguments' full names
and they typically allow abbreviations (but don’t
obfuscate your code).
• The … means that other arguments are allowed and
are passed along the class hierarchy to other methods.
Look at plot to see this.
– The … makes it easy to write bugged code because you if
misspell the argument name, it is silently passed along.
Writing Functions
• You can easily write functions, but notice that the last
thing calculated is returned:
MandM = function(x){mean(x); median(x)}
MandM(sleepy$extra) # returns only the median
• Store the values you want into a list:
MandM = function(x) {
blah = list(theMean=0, theMedian=0)
blah$theMean = mean(x)
blah$theMedian = median(x)
return (blah)
}
MandM(sleepy$extra)
Other Arguments
MandM(sleepy$baseline)
• It points out that we need to deal with missing
values. Look up mean and median and you will
see they allow the na.rm parameter to determine
if missing values are dropped.
• Using MandM(sleepy$baseline, na.rm=TRUE)
does not work because the parameter list does
not allow it. We want to allow that parameter to
be passed along. So rewrite the function.
M and M Again
• Recall that an … in the argument list means
"other stuff".
MandM = function(x, ...) {
blah = list(theMean=0, theMedian=0)
blah$theMean = mean(x , ...)
blah$theMedian = median(x, ...)
return (blah)
}
MandM(sleepy$extra)
MandM(sleepy$baseline, na.rm=TRUE)
Appling Your Function
• R does allow you to write loops to iterate over
records or variables but if you are not writing
novel math functions, they can generally be
avoided.
• R will try to vectorize and process:
MandM(c(sleepy$baseline,sleepy$extra))
• Use sapply to apply a function to a data frame:
sapply(sleepy, MandM, rm.na=TRUE)
Better M and M
MandM = function(x, ...) {
blah = list(theMean=0, theMedian=0)
if(is.numeric(x)== TRUE) {
blah$theMean = mean(x , ...)
blah$theMedian = median(x, ...)
}
return (blah)
}
sapply(sleepy, MandM, na.rm=TRUE)
Yummy M and M
MandM = function(x, ...) {
blah = list(theMean=NaN, theMedian=NaN)
if(is.numeric(x)== TRUE) {
blah$theMean = mean(x , ...)
blah$theMedian = median(x, ...)
}
return (blah)
}
MandM(sleepy$extra)
MandM(sleepy$baseline, na.rm=TRUE)
sapply(sleepy, MandM, na.rm=TRUE)
Writing Novel Functions
• Look hard on rseek.org before you reinvent
the wheel.
• R syntax is very similar to C.
– Select/Case logic is different (R short-circuits) .
• The R Book by Crawley is too big to buy for
just this topic but it is good for syntax. Get it
from the library and read the early chapters.
• The final chapter of Spector has a few
wonderful pages.
# The "what the function" (wtf) function
wtf = function(x){
if (missing(x)) stop("\n\tUsage: wtf(x)\n\twhere x exists")
item = as.character(sys.call())[2]
if(is.factor(x) == TRUE) {
thingy = cat(item,
"\n FACTOR \n length", length(x),
"\n MODE:", mode(x),
"\n CLASS:", class(x), "\n"
)
}
if(is.vector(x) == TRUE) {
thingy = cat(item,
"\n VECTOR \n length", length(x),
"\n MODE:", mode(x),
"\n CLASS:", class(x), "\n"
)
}
if(is.matrix(x) == TRUE) {
thingy = cat(item,
"\n MAXTRIX \n dimentions", paste(dim(x),
collapse="x"),
"\n MODE:", mode(x),
"\n CLASS:", class(x), "\n"
)
}
if(is.list(x) == TRUE) {
thingy = cat(item,
"\n LIST \n Components:", paste(names(x), collapse=" "),
"\n MODE:", mode(x),
"\n CLASS:", class(x), "\n"
)
}
if(is.data.frame(x) == TRUE) {
thingy = cat(item,
"\n DATAFRAME \n Components:", paste(names(x),
collapse=" "),
"\n component lenght of", length(x[[1]]), "items",
"\n MODE:", mode(x),
"\n CLASS:", class(x), "\n"
)
}
if(is.function(x) == TRUE) {
thingy = cat(item,
"\n FUNCTION \n"
)
}
}
A Handy Function
Destroying Efficiency
• A matrix of data is really a vector with row and
column attributes added to it. This has profound
speed issues if you add to the size of a matrix
because the data has to be shifted all over the
place.
• If you plan on writing your own functions to
manipulate matrices, build an empty matrix of
the maximum size (or guess bigger) rather than
using the functions to add rows or columns.
Writing Efficient Code
• R has decent tools for profiling code.
• The Rprof and summaryRprof functions will
help you figure out what is bogging down your
code.
Rprof()
MandM(rnorm(1000000))
Rprof(NULL)
summaryRprof()
Debugging in R
• See Chapter 9 in Gentleman's book.
• The browser() function can be put inside a function to
pause execution and see what is going on.
• The codetools package is great for tweaking big
functions:
findLocals(), findGlobals(),
– shows you if variables and functions originate inside of a
function
checkUsage(), and checkUsagePackage()
– shows you what variables are modified or not touched in a
function
Creating Graphs
• Basic plots are easy but tweaking them for
publications can be rough because the
documentation on the function arguments is
appalling.
• Data Analysis and Graphics Using R by John
Maindonald and John Braun is extremely useful.
• There are myriad graphics built into the core of R
plus more in the packages.
addictedtor.free.fr/graphiques/thumbs.php
Test Scores
scores = read.table("c:\\blah\\walkerScores.txt", header = TRUE)
rapply(scores, class)
scores$CENTER = as.factor(scores$CENTER)
scores$PAT = as.character(scores$PAT)
rapply(scores, class)
scores$isSick = ifelse(scores$SCORE > 0, 1, 0);
library(car)
(scores$SEV = with(scores, recode(SCORE, '0 = "None" ;1:30 =
"Mild"; 31:69 = "Moderate"; 70:100 = "Severe"; else = "BAD
DATA"')))
(scores$SEV = factor(scores$SEV, levels = c("None", "Mild",
"Moderate", "Severe"), ordered = TRUE));
Common Plots are Easy
attach(scores) #to avoid typing scores$
plot(SEV, main = "MainTitle", xlab = "xlab", ylab =
"ylab")
plot(SCORE)
hist (SCORE)
boxplot(SCORE)
boxplot(SCORE ~ SEX, ylim = c(0,100))
detach(scores)
Graphics Tweaks
0.8
0.6
0.2
0.0
-2
-1
0
1
2
3
-3
-2
-1
0
1
2
z
z
Quantiles = qnorm
Random numbers = rnorm
100
50
-1
0
frequency
1
150
2
200
-3
0
Quantile (Z)
0.4
Probability
0.3
0.2
0.1
0.0
Probability density
1.0
Probability = pnorm
0.4
Density = dnorm
-2
mfrow is used to set
number of rows and
columns of graphics on
a page
0.0
0.2
0.4
0.6
p
0.8
1.0
-4
-2
0
z
2
3
Strip Charts for Small Datasets
HI
LO
PB
• par(cex = 1.5) # big font
• with(Gad, stripchart(HAMA ~ DOSEGRP, xlab =
"HAMA", pch = 16))
20
25
HAMA
30
35
3 Languages for the Price of 1…
• The graphics I have shown use the classic
graphic methods.
• There are trellis plots from the lattice package
that split the data into multiple panes
automatically.
• ggplot2 uses a "grammar of graphics"
approach (like SPSS).
Don’t play with pie!
•
•
•
•
The lattice package makes
trellis graphics (I didn’t make
up these names!).
library(lattice)
trellis.par.set(list(fontsize=list(points=20)))
trellis.par.set(list(fontsize=list(text=25)))
dotplot(table(Gad$DOSEGRP), xlim = c(-1, 21))
DOSEGRP
PB
HI
LO
LO
HI
PB
0
5
10
Freq
15
20
8 12 16
EE
EE
8 12 16
EE
EE
8 12 16
EE
EE
8 12 16
EE
EE
EE
Typical lattice plot with
banding to show
subsets
NOx (micrograms/J)
4
3
2
1
8 12 16
8 12 16
8 12 16
Compression Ratio
8 12 16
8 12 16
trellis.par.set(list(fontsize=list(points=15)))
trellis.par.set(list(fontsize=list(text=15)))
EE <- equal.count(ethanol$E, number=9,
overlap=1/4)
## Constructing panel functions on the fly; prepanel
xyplot(NOx ~ C | EE, data = ethanol,
prepanel = function(x, y) prepanel.loess(x, y,
span = 1),
xlab = "Compression Ratio", ylab = "NOx
(micrograms/J)",
panel = function(x, y) {
panel.grid(h=-1, v= 2)
panel.xyplot(x, y)
panel.loess(x,y, span=1)
},
aspect = "xy")
Basic plot + geometric
details + adding details
+ adding more details +
yet more details
qplot(carat, price, data = diamonds, geom
= c("point", "smooth"))
Use Rcmdr (R Commander)
• Rcmdr has A LOT of great graphics built into
the point and click interface.
library(Rcmdr)
• Look up my short course (5 talks) covering
basic statistics to see how to code many
graphics.
www.stanford.edu/~balise/HowToDoBiostatistics.htm
You are Going to Need More Help
• Data Manipulation with R by Spector.
– A must-have book on how to read and write data with or without SQL,
manipulate data with R, aggregate data, and reshape datasets easily.
• R Programming For Bioinformatics by Gentleman.
– A very good intermediate level book on how R object-oriented
programming really works.
• The R Book or Statistical Computing by Crawley.
– These have nicely written intermediate level statistics.
– But they are highly redundant across the two books.
Redundant
• A couple datasets used for this talk were from Glenn
A. Walker’s Common Statistical Methods for Clinical
Research with SAS Examples.
– Buy Walker’s book if you do clinical research and you use
SAS.
• Data Analysis and Graphics Using R by Maindonald and Braun
– I constantly use this book to figure out how to do graphics.
• Using R for Introductory Statistics by Verzani.
– This one has fantastic coverage of the R to do the common statistics.
– It also has a nearly useless index at the end of book.
Biostatistics
• John Fox, the guy who made Rcmdr, is an
excellent author and he provides an R based
supplement for his superb statitics book.
Spectrum
• If you are doing biomedical research and have
questions we are here to help.
– Study design
– Analysis plan
– Power and sample size calculation
– (Limited availability help with SAS and R code)
med.stanford.edu/spctrm/biostatistician.html
Special thanks to 5F-X, Assemblage 23