Transcript Slide 1

An Introduction to R!
Roni Kobrosly
Epidemiology Nuts and Bolts Talk
November 2009
My goal is to...




Explain the pros and cons of R
Introduce you to some basic syntax
Live demo of basic epidemiologic statistical
analyses
 Basic descriptive analysis and hypothesis
testing, linear regression, logistic
regression, survival analysis
Show you some helpful R resources!
2
Things you should have in front
of you



Data dictionaries for STAR and NCCTG Lung
Cancer datasets
A R reference sheet, by Tom Short
A copy of this presentation
3
The pros of using R!



It is 100% FREE
Huge community of users, which means lots of
help and support.
Available on virtually all platforms: OS X,
Unix/Linux, and Windows XP, Vista, and 7!

Documentation is available in many languages

Publication quality graphics with default options
4
The pros of using R! cont.


R library database has virtually every type of
statistical analysis that has ever been
conceived. The most cutting edge biostatistical
ideas are always implemented in R first.
R is very flexible. All models and graphics can
be completely customized and modified for the
analysis you’re working on.
5
The cons of R ...



The spreadsheet-style data manipulation is
poor for editing
Steep learning curve
The programming is largely done by
statisticians not computer scientists … so can
be slow during computationally intensive tasks
6
Throughout the presentation...


My narrative bullets will use “this” font style
R code and output will be shown with this font
style:
women <- c(5,7,3,13,5)
men <- c(3,8,8,2,3)
t.test(women,men,paired=F)
7
R terminology




Vector: One dimensional data object (numeric,
character, or logical)
Matrix: Two dimensional data object with
elements of all the same data type (numeric,
character, logical)
Array: Like a matrix, but can have more than
two dimensions
Dataframe: This is a dataset as we know it. The
first line contains variable names, and columns
can have different data types. All columns must
be of same length
8
R terminology cont.



Package/Library: downloadable set of software
that can contain data and functions. All basic R
functions are contained in the core R software.
If you want to do something fancy, will you need
to install packages.
Factor: A nominal variable
Ordered factor: An ordinal variable
9
Access R help files



If you ever want to learn about what a specific
function does or its syntax, using the “help()”
function
It will also provide a data dictionary for datasets
If you end up trying R, you will be using this
function often!
help(glm)
help(STAR)
10
R syntax


This table shows the
most common
operators.
R is object oriented,
so you will use the
assignment operator
and list indexing alot.
+, -, *, /, ^
Arithmetic
!=, &, |, ==,
>, >=, <, <=
Logical
~
Model Formula
<-, ->, =
Assignment
$
List Indexing
:
Sequence
11
Using the R workspace


You can type individual commands in the
console
You can also create an R script and save lines
of code, like you have done with SAS.

Press CTRL+R to run highlighted script lines

Press CTRL+L to clear the console of output

Press the up arrow to retrieve the last command
you just typed into the console
12
Writing commands in R...

Variables, datasets, and output can be saved
as objects with complex names like:
Roni.Data.Final.02_02_2009

Unlike SAS, R is case-sensitive. So the
variables...
HAPPYvar
Happyvar
and happyvar
… are recognized as different objects
13
Writing commands in R...

Spaces don't matter in R:
weight/(height^2)
is interpreted the same as...
weight

/
(height
^ 2)
Comments in R code can be made with “##”
## This is cumulative incidence
(incid / total.at.risk) → c.incid
14
Writing commands in R...

Use a semi-colon to have R run a batch of
commands in the console:
factorial(4); factorial(3); factorial(2);
factorial(1)
This code will display the factorials of 4,3,2, and 1
15
Basic command line operations

Simple arithmetic:
((2 + 4)^2)/3

Save this result to variable “a”
((2 + 4)^2)/3 -> a

Creating list of consecutive numbers (1-15)
1:15 -> nums

Create a vector
c(2,4,6,7) -> num.list
or c('a','b','c') -> char.list
16
Basic command line operations

Accessing the third number in the vector
num.list[3]

Logic:

Is element 3 of the num.list vector > 10?
num.list[3] > 10

Are any of the elements of the num.list vector = 0?
num.list[1] == 0 | num.list[2] == 0 | num.list[3]
== 0 | num.list[4] == 0
17
Basic command line operations

Create a 3x3 matrix
matrix( c(2,4,5,6,7,8,9,10,11), nrow=3, byrow=T)
-> my.matrix

Access the element in the 3rd row and 1st column
my.matrix[3,1]

Change the element in the 3rd row and 1st column
to the value '300'
300 -> my.matrix[3,1]
18
Basic command line operations

See what variables are saved
ls()

Remove the “nums” variable
rm(nums)

Remove all saved variables
rm(list=ls())
19
Working with vectors

Lets create two numeric data objects of the
same length. This is data from four, male
subjects:
weight <- c(175, 232, 145, 193)
height <- c(72, 75, 70, 64)


Simply type these variables names into the
console, and R will show you the contents of
these vectors.
We will create BMI values for these subjects...
but there is a problem, they are in units of
pounds and inches...
20
Working with vectors

To convert vectors, just multiply their assigned
names by a constant. Lets save these results to
new variables:
weight*0.454 -> kg.weight
height*0.0254 -> m.height

We can now create a vector of BMI values:
kg.weight / (m.height^2) -> BMI

When you are operating with vectors like this,
the vectors must be of equal length!
21
Working with vectors

Finally, lets combine metric weight, height, and
BMI vectors into one matrix (so we can see
them side by side).
matrix(c(kg.weight,m.height,BMI), ncol=3) ->
final
22
Misc. useful functions

Pick 5 random integers between 1 and 100
sample(1:100, 5, replace=F) -> lottery

Generate 100 random numbers from normal
distribution
rnorm(n=100, mean=70, sd=6)

Find left-sided area under curve of Z distribution
pnorm(110, mean=100, sd=10)

Find value for 95%, left-sided area under Z
distribution
qnorm(0.95, mean=100, sd=10)
23
Misc. useful functions

Find left-sided area under curve of t distribution
pt(2.015, df=5)

Find value for 95%, left-sided area under t
distribution
qt(0.95, df=5)

Find left-sided area under curve of chi-square
distribution
pchisq(3.84,df=1)

Find value for 95%, left-sided area under chisquare distribution
qchisq(0.95,df=1)
24
Let's install packages




The “epicalc” package is short for
“Epidemiological Calculator.”
This package is a hodge-podge of
epidemiological analysis tools, utilities, and
datasets to test these on.
The “epiR” package is also helpful
You will need an internet connection to get any
package! You only need to download a
package once!
25
Let's install a package
It is easy to do from the console:



Type: install.packages()
A window titled, “CRAN mirror” will appear,
giving you a choice of servers to download the
package from. Just choose anything from the
U.S. (it will download faster).
Then, you can choose what package to
download. R will automatically download any
pre-requisite packages too.
26
The “epicalc” package


Once the installation is complete, type this to
activate the package: library(epicalc)
Let's use the “cci” function to calculate an odds
ratio and 95% CI

Type: help(cci) to understand the syntax

An example:
cci(145,200,300,280, graph=F)
27
The “epiR” package


The “epi.studysize” function calculates power or
minimal sample size needed for various
epidemiologic study types.
This will give you power to detect an
approximate odds ratio of 2.0 using a two-sided
0.05 test when 188 cases and 940 controls are
available, assuming a 0.30 prevalence of
smoking in the male population
epi.studysize(treat = 2/100, control = 1/100, n =
1128, power = NA, sigma = 0.30, r = 0.2,
conf.level = 0.95, sided.test = 2, method =
28
"case.control")
Other helpful functions

These two packages (epicalc and epiR) also
provide functions relating to imputation, MantelHaenszel and CMH tests, tests for homogeneity
across strata, matched case-control analysis,
ROC curves, descriptive analysis, calculating
adjusted rates, and much more!
29
Lets invoke a dataset




From the “epicalc” package, lets load a simple
dataset with data on marriage
Type: data(Marryage)
If you type in Marryage, R will print out every
observation of the dataset
For large datasets, type head(Marryage) to see
the first six observations and variable names …
or try tail(Marryage) to see the last six
observations.
30
The “Marryage” dataset

To isolate a single variable for analysis or
editting, use the “$” operator:
Marryage$birthyr


To avoid having to constantly specify the
dataset of variables, type this: attach(Marryage)
detach(Marryage),
will reverse this option and will
again have to specify the dataset
31
On reading and writing datasets
not from a package ...

There are so many file formats that data can be
stored in:
.txt .dat .csv .ods .xls .sas7bdat .dta .sav
The list goes on and on...


I would recommend using the .CSV file format
to store and transfer your files. It will never
become obsolete!
R can read in datafiles from SAS, SPSS, and
Stata using the “foreign” package
32
How to read/import datasets

Let's say you've just finished cleaning a dataset
in Excel, you're ready to analyze it, and it is
saved to your desktop as a .CSV file.
read.csv(“C:/Documents and
Settings/Roni/Desktop/mydata.csv”,
header=T) -> mydata

IMPORTANT: notice how specifiying the file
location involves the use of forward-slashes (/),
not back-slashes (\)!!
33
How to write/export datasets

Writing/exporting datasets from R to your
desktop, for instance, is also very easy:
write.csv(mydata, “C:/Documents and
Settings/Roni/Desktop/mynewdata.csv”)
34
Don't like the way R looks?



If R doesn't look SAS-ish (SASy?) enough for
you, try out these meta-packages that changes
the appearance and capabilities of the R
console and output:
Java GUI R (JGR): Very sleek setup that helps
you complete input commands! With colorcoded text!
Rcommander: Simple, yet elegant console
35
User-created functions



You can create your own functions in R!
These are much more versatile than macros in
SAS
The R support community posts lots of
homemade functions on the message boards.
36
The STAR dataframe




Tennessee’s Student Teacher Achievement
Ratio (STAR) project was a large, four-year
study of the effect of reduced class size
Lets take the Star dataset from the “mlmRev”
package.
There are > 26000 observations!
Lets randomly choose 1000 observations to
analyze.
37
Preparing the STAR data
Lets invoke the mlmRev package and the data:
install.packages(), select the mlmRev package,
then type: library(mlmRev); data(star)


See what variables STAR contains:
head(star)

Lets removing all observations with missing
data:
na.omit(star) -> star2
38
Preparing the STAR data


We need to know how many observations there
are: length(star2$id)
There are now 22815 observations, lets
randomly choose 5000 of these:
sample(1:22815, 5000, replace=F) -> xyz
star2[xyz, ] -> star3
39
Descriptive analysis




Try this: summary(star3)
This will provide basic univariate analysis
output for all continuous variables, and one-way
frequency tables for categorical variables.
For a histogram of a continuous variable, try
this: hist(star3$read, breaks=20)
For user-defined quantiles (deciles for ex.):
quantile(star3$read,
c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1))
40
Preparing the STAR data
Other useful functions for univariate analysis

Standard Deviation:

Variance:

Show a Q-Q Plot:

Show a Boxplot:

Frequency Tables:
sd()
var()
qqnorm()
boxplot()
table()
41
Preparing the STAR data

Let's dichotomize the “read” variable at the
median, and create a dummy variable for “read”
values above and below the median:
median(star3$read) -> cutoff
1 -> star3$readdummy[star3$read <= cutoff]
0 -> star3$readdummy[star3$read > cutoff]

Lets attach this new dataset ... we are ready for
analysis!
attach(star3)
42
Some bivariate analysis




Basic correlation between reading and math
scores: cor.test(read,math,method=”spearman”)
T-test to see if math scores differ between
males and females: t.test(math~sx)
ANOVA to see if reading scores differ among
the four school types: anova(lm(read~schtype))
Chi-sq to see if ethnicity is associated with
school type: chisq.test(eth,schtype)
43
Sub-setting your data

To examine subsets of data, you don't use a
“where” statement like SAS. To see if math
scores differ between males and females, only
among non-White students:
t.test(math[eth != 'W'] ~ sx[eth != 'W'])

What is the mean reading scores for 3rd grade,
inner city students?
mean(read[gr == '3' & schtype == 'inner'])
44
Linear Regression

Try simple linear regression. Use reading
scores to predict math scores. Here is the
syntax:
lm(math ~ read, data=star3) -> SLRmodel
summary(SLRmodel)

Let's make a scatterplot and include the
regression line:
plot(read,math,main=”Neat Scatterplot”)
abline(SLRmodel, col="red")
45
Multiple Linear Regression

Let's try multiple linear regression in predicting
math scores:
lm(math ~ read + eth + sx + schtype, data=star3)
-> MLRmodel
summary(MLRmodel)


Notice how R deals with factor variables (like
“eth,” “sx,” and “schtype”)
You can specifiy the reference group for factor
variables with the group() function, or just use
dummy variables
46
Comparing Models

R allows you to easily compare nested models
through a likelihood ratio test. You need to
obtain the “lmtest” package and use the
“lrtest()” function.
install.packages()
library(lmtest)
lrtest(SLRmodel, MLRmodel)
47
Logistic Regression

Let's use that dichtomous reading variable we
made for a logistic regression model. You now
have to use the glm() function:
glm(readdummy ~ math + schtype, family =
binomial(“logit”)) -> logmodel
summary(logmodel)

Note: These are not odds ratios that are printed,
they are log odds of having a low reading score!
Take the anti-log to get the odds ratios:
exp(logmodel$coeff)
48
Survival Curves and Cox Regression



R can generate beautiful Kaplan-Meier survival
curves and can perform Cox proportional
hazards regression (with or without timedependent covariates)
You will need the “survival” package. It is part of
the core R software, so you don't need to
download it. Just type: library(survival)
Invoke the North Central Cancer Treatment
Group (NCCTG) lung cancer dataset by typing:
data(lung)
49
Survival Curves and Cox Regression


Type: head(lung) and examine the data
dictionary you've got to get a sense of what's in
this dataset. Attach the dataset, so we don't
have to continuously type “lung$” before each
variable: attach(lung)
This will print out the standard KM survival
function output:
summary(survfit(Surv(time,status==2) ~ 1 ,
data=lung))
50
Survival Curves and Cox Regression

To see the KM survival curve for different
subsets (for males and females, for instance):
summary(survfit(Surv(time,status==2) ~ sex ,
data=lung))

Lets plot KM survival curves, stratified by
gender:
plot(survfit(Surv(time,status==2) ~ sex ,
data=lung), lty = 2:3, main=”KM survival curves
for men and women”, xlab=”days”, ylab=”%
survival”)
legend(600, 1, c("Male", "Female"), lty = 2:3)
51
Survival Curves and Cox Regression

Are these two curves different? Here is the
syntax for conducting a Log-Rank test:
survdiff(Surv(time,status==2) ~ sex , data=lung)

Lets perform Cox PH regression to see the
effect of gender on the risk of death, adjusting
for age, ECOG score, and calories consumed
at meals
summary(coxph(Surv(time, status==2) ~ sex + age +
ph.ecog + meal.cal))
52
The final verdict on R



R's data management and editing capabilities
leave something to be desired
R's statistical analysis and graphical output
outshines that of SAS, Stata, and SPSS
If you decide to use R, I would recommend
cleaning, organizing, managing data in SAS
first. Export the SAS dataset as a .CSV file,
then read it into R and conduct your analysis.
53
For help with R ...




Ask me for help! My email is:
[email protected]
A great website geared towards SAS users who
want to learn
R:http://www.statmethods.net/index.html
http://www.ats.ucla.edu/stat/r/
http://www.r-project.org/
54