Introduction to SAS - University of Toronto

Download Report

Transcript Introduction to SAS - University of Toronto

Introduction to R
Workshop Plan
The R interface

The Console

The Script Editor

The “Workspace”

R programming rules…
How does R ‘think’

R Objects

The data frame
Importing Data
Data Manipulation
Simple Analyses
J. Charles Victor – Intro to R
What is R?
Programming environment
 Useful for statistics and powerful graphing
capabilities
But you will be programming, not clicking and
pointing

Free, ‘open’ software
Users create programs which are made available
to other users via web and installation interface
Based on S, S-Plus programming
J. Charles Victor – Intro to R
First Step
Open R…
J. Charles Victor – Intro to R
The R Console
The main window
 Commands are written and submitted
 Log of progress recorded
 Output (except graphs) produced
 Similar to STATA interface and function
Prompt ‘>’ indicates R is waiting for a command
 Try the following:
> x <- c(1,2,3,4,5) [ENTER]
> mean(x)
[ENTER]

You should find the following result
[1] 3

R is telling you the mean of [1,2,3,4,5] is 3
J. Charles Victor – Intro to R
The Script Editor
Accessible from the File menu item
 Used to create a series of commands (ie program)
that can be saved and run at a later date
 Similar to DO editor in STATA
 Will make SAS and SPSS syntax users more
comfortable
Write commands, highlight and click on submit button

Try opening the Script editor (‘New Script’) and
repeating the same commands as before
X <- c(1,2,3,4,5)
mean(x)

Now highlight this code and click on the submit button
J. Charles Victor – Intro to R
Script Editor
Nothing Fancy – but VERY useful
Saves programs
J. Charles Victor – Intro to R
The Workspace
The ‘Workspace’ is the R data and objects
 When exiting R, saving the workspace saves
your data and work

Let’s see our work thus far
Type: ls()

What do you see?

Try saving your work thus far
File -> Save Workspace
J. Charles Victor – Intro to R
R Programming
General Programming
 R is generally case-sensitive

Character strings must be in quotes (only “ “)

Hitting ENTER submits a command
If you want a command to go over more than one line, add a
‘+ then hit enter
Try the following:
> newy <- c(0,0,1, +
+ 1,0)


Use ‘comments’ to identify what you have done
Comments begin with “#”
J. Charles Victor – Intro to R
How does R think?
R thinks of data elements as ‘objects’
 Objects can be:
Single variables
Arrays of variables
Entire Datasets
Results from analyses (if saved as an object)
When you save the ‘Workspace’ you save all of
these objects
 So in a small sense, R works like Excel.
J. Charles Victor – Intro to R
OK, I don’t understand this Object thing…
For data analysts it is usually easiest to start by equating
the term ‘object’ to mean ‘variable’ at first
 We have already created one variable called ‘x’
 We can create another variable (object) called ‘y’ that
has the values (20, 27, 18, 50, 99)
> y <- c(20,27,18,50,99)

To see all of the variables (objects) in memory we can
use the ‘list’ command
ls()


Or click on MISC -> LIST OBJECTS
What do you see?
J. Charles Victor – Intro to R
DATA INPUT
Creating Data
How? No Spreadsheet??
 Create your own
Class of 5 students, need average test score
John Smith
58 M
Jaysharee Singh
82 F
Emily Xu
90 F
Ute VanDroglen
65 F
Charles Victor
90 M
J. Charles Victor – Intro to R
First Attempt to Enter Data
Many ways to create and edit data in R
 First create variables (objects)
 Then compile the data set from the variables
Creating variables – 2 main ways
 Relatively few values
VARIABLENAME <- c(VALUE1,VALUE2,VALUE3….)
Character values in quotes z <- c(“ABC”,”DEF”)
Many values
VARIABLENAME <- scan() ENTER
VALUE1 VALUE2 VALUE3 VALUE4 …. VALUE8 ENTER
VALUE9 VALUE10 ….. ENTER
ENTER
>

J. Charles Victor – Intro to R
Try entering this data in
 Use method 1 for first name and last name and sex
 Use method 2 for exam mark
John Smith
Jaysharee Singh
Emily Xu
Ute VanDroglen
Charles Victor
58
82
90
65
90
After you create each variable, look at the variable to
see that it is correct by typing the variable name at the
command prompt
> firstname
J. Charles Victor – Intro to R
A few notes on entering values
1) Variable names can contain most special
characters including ‘.’
2) Missing values should be coded as
NA
3) To create a variable whose values are a
sequential list of numbers, use a colon (:)
StudentID <- c(1:5)
J. Charles Victor – Intro to R
Creating the Dataset
Currently we just have 5 variables (objects)
 These objects are independent of each other (ie the first
name John is not linked with the last name Smith)
To ‘link’ these objects we need to compile these variables
together in a dataset which R calls a ‘data frame’
 In R a data frame is an object just like a variable, and thus it
is created in a similar fashion
DATA_NAME <- data.frame (VARIABLE1,VARIABLE2,VARIABLE3)
Note: All variables must have the same number of
observations
Now take a look at the data by typing the dataset name
J. Charles Victor – Intro to R
Back to ‘Objects’
Look at the objects now in memory
 ls() or click MISC -> List all object




You should see all of the variables + the
dataset
You can now use the dataset similar to how
we have used variables
To see a variable, type the variable name
To see the dataset, type the dataset name
J. Charles Victor – Intro to R
BUT…
Once attached to a dataset, the variables (Studentid,
firstname, lastname, mark, sex) are different than the
‘objects’ in R’s memory

So we have
The object: mark
The variable mark on the class dataset

You may want to get rid of the ‘objects’ now that you
have compiled them onto the dataset – (any changes
made to the objects, will not be reflected on dataset)
rm(studentid, firstname, lastname, mark, sex)
J. Charles Victor – Intro to R
Importing Existing Data into R
R has not been very foreign data friendly
 But this is changing - rapidly

Optimally datasets need to be in the form of:
ASCII text
Tab delimited
Comma delimited

Best to convert Excel data into one of these
formats
J. Charles Victor – Intro to R
Importing: ASCII text
Use command: read.table
OBJECT <- read.table(“C:\\My Document\\FILE.TXT, header=T)


Note: Pathways, have to have double slash: \\
If variable names are on the first row
Use header=T option
Otherwise variables will be named V1 V2 V3…


Try to import the heart_rx dataset
If you are unsure of the pathway you can use the command:
file.choose() nested in the read.table
This will cause R to bring up a GUI to choose your file
OBJECT <- read.table(file.choose(), header=F)
Try to import the heart_rx_noheader dataset this way
J. Charles Victor – Intro to R
Importing: Tab Delimited or Comma Separated or
Database File
Tab Delimited
 Use command: read.delim
OBJECT <- read.delim(“C:\\My Document\\FILE.TXT”, header=T,sep=“\t”)
Comma Separated Value (CSV)
 Use command: read.csv
OBJECT <- read.csv(“C:\\My Document\\FILE.CSV”, header=T,sep=“,”)
J. Charles Victor – Intro to R
Importing: Access, SPSS, Stata etc
Best method: 3rd party software to convert data to a
Delimited or CSV file

DBMS Copy is very popular

Stat Transfer is very good
Some users have created
 read.spss
 read.xport (for SAS files)
 read.dta (for STATA files)
 But these commands need to be downloaded and
installed (more on that later)
J. Charles Victor – Intro to R
Importing: R Dataset
If a workspace has been saved from a previous
session, simply load the workspace by ‘clicking
and pointing’
Or use the load command
load(“PATHWAY\\FILENAME.Rdata”)
J. Charles Victor – Intro to R
Creating a Dataset from a Dataset
If you want to create a copy of a current dataset, this is a
simple function in R.

Simply create a new object (ie with a different name)
from the existing dataset
NEWDATA <- OLDDATA

To create a new dataset from an edited version of an
old dataset
NEWDATA <- edit(olddata)
This will bring up the data editor (more on this later), and any
changes will be attributed to NEWDATA, but not to OLDDATA
J. Charles Victor – Intro to R
DATA MANIPULATION
99% of the work
(don’t underestimate)
Data Manipulation: General
Most of your time should be spent in this phase
 R is probably not the ‘best’ package
Data manipulation includes (among other things)
 Renaming variables
 Getting rid of variables
 Creating variables
 Changing variables (eg categorising age)
 Changing values of specific observations
(eg someone reports age of 180)
 Getting rid of observations
 Merging datasets
J. Charles Victor – Intro to R
A couple of things first….
R has MANY ways of accomplishing similar
tasks due to its open software construction

When referring to variables on a dataset you
must either:
Use: d_name$v_name
OR
“Attach” the dataset

Attach(d_name)
But attaching the dataset does not allow for
manipulation of dataset variables only the use of
these variables
J. Charles Victor – Intro to R
What is he talking about??
Lets create a new dataset with two variables x and y

X will be the numbers 1 to 20

Y will be 20 random values from a normal distribution
X <- c(1:20)
Y <- rnorm(x)
Testdata <- data.frame(x,y)

Remove the x and y objects
rm(x,y)

Print the dataset, and then x and y
testdata
X
Y

Notice we could not access x and y this way. Try:
Testdata$x
Testdata$y

That worked, but is a lot of typing. So we could also:
Attach(testdata)
X
Y

That worked too! So attaching a dataset, allows us to access the
variables on the dataset, without using the $ format – but only for
visualizing and analysing, not editing (so I don’t like to
do it)Victor – Intro to R
J. Charles
Renaming Variables
Occasionally we need to rename a variable
 Many ways
We can edit the data like a spreadsheet
Fix(d_name)
Create a copy of Class dataset, and “Fix” it
NEWDATA <- edit(d_name)
OR We can create a new variable
d_name$new_v_name <-d_name$old_v_name
J. Charles Victor – Intro to R
Deleting and Creating Variables
To delete a variable set a variable to NULL
d_name$v_name <- NULL
To create a variable just set the new variable
equal to some value – we use a similar construct
as before
 d_name$v_name <- SOME_VALUE OR
EXPRESSION
J. Charles Victor – Intro to R
Creating Variables
Suppose we want a variable identifying the
day the exam was written and a variable
identifying the maximum value for the exam
class$test_day <- c(“Monday”)
class$test_max <- c(100)
J. Charles Victor – Intro to R
Creating Variables
We can also create variables based on other
variables
 Imagine that we now want to calculate the
students percentage on the exam
 d_name$newv_name = expression
 For example:
class$prct <- (class$score / class$test_max)*100
Remember rules of BEDMAS
J. Charles Victor – Intro to R
A Note on Mathematic Functions

+
*
/
()
**
abs( x )
int( x )
log( x )
log10( x )
sqrt( x )

round( x, value)










= addition
= subtraction
= multiplication
= division
= brackets
= to the exponent
= absolute value of x
= integer value of x
= natural log of x (ie Ln to non-math types)
= log base 10 of x (ie Log to non-math types)
= square root of x
= round x, to value decimals
J. Charles Victor – Intro to R
Changing Variables
Lets change the existing prct variable into letter
grades
 Map out which letter grades apply to which
percents





Below 50
50 – 59
60 – 69
70 – 79
80 – 100
=F
=D
=C
=B
=A
J. Charles Victor – Intro to R
Changing Variables - Recoding

Two ways
1) Only for numeric variables
Using Base R
Cut function
D_name$new_v_name <Cut(d_name$old_v_name ,
breaks = c(breakpoints) OR breaks = #breaks,
labels = c(“LABEL1”, “LABEL2”,….) )

EG
class$lettergrd <- cut(class$prct , breaks = c(-Inf,49,59,60,
79,100), labels = c(“F”,”D”,”C”,”B”,”A”) )
J. Charles Victor – Intro to R
Recoding variables – Second Method
There is a “RECODE” function, but it has been
developed outside of the original Base R
 We can incorporate programs that have been
written by other people
 Often these programs are compiled into a
group of programs that are used for a similar
construct
 These groups of programs are called
“Packages”
J. Charles Victor – Intro to R
Installing a Package
(to get a function that you do not have)
First, note that you do not have ‘recode’

help(recode)
Now (after searching google) you find out that a special function
called ‘recode’ is available in the package called ‘car’
Click PACKAGES -> INSTALL PACKAGE(S)

R will ask you to set a CRAN Mirror (site from which to download
packages)
Choose CANADA (ON)

R will now ask which package you want to download
Choose “CAR”
R will now download the ‘car’ package
BUT the car package has just been installed, it has not yet been
loaded
Click PACKAGES -> LOAD PACKAGE(S)

R will ask which package to Load from all that you have installed

Choose “CAR”
You can now use the recode function

Type help(recode)
J. Charles Victor – Intro to R
Recoding – Second Method
Now that the ‘CAR’ package is installed, we can
use ‘recode
D_name$new_v_name <- recode(d_name$old_v_name, recodes)
Where recodes can be in form of:
specific values: “c(99,999) = NA; c(1)=‘Y’ “
range of values: “lo:50=‘F’; 51:60=‘D’ “
class$lettergrd2 <- recode(class$prct, “lo:50=‘F’;
51:60=‘D’;…..”)
J. Charles Victor – Intro to R
Combining Conditional Statements
to Change Values within Observations
Your TA informs you that Jim Smith was sick on for the Monday
Exam, instead he was given a makeup exam, out of 98
 To identify observations using conditional statements, we
use the R function IFELSE
IFELSE(condition/expression, value if true, value if false)
class$testmax <- ifelse(class$firstname == ‘Jim’ &
class$lastname == ‘Smith’, 98, class$testmax)
J. Charles Victor – Intro to R
More complex…
You are then informed that the twins (Joan and
John Smith) cheated, you have to give them
zeros:
class$score <- ifelse((class$firstname ==
‘Joan’ | class$firstname == ‘John’) &
class$lastname == ‘Smith’, 0,
class$score)
J. Charles Victor – Intro to R
Logical Statements








<
<=
>
>=
!=
==
= Less than
= Less than or equal to
= Greater than
= Greather than or equal to
= Not equal to
= Equal to
& or &&
| or ||
= Intersection boolean operator
= Union boolean operator
J. Charles Victor – Intro to R
Deleting Observations (or Subsetting)
Suppose we want to look at only the Female students



We need to either delete the Males or keep the
females
Best to create a new dataset with only females than
deleting observations from our original dataset
Many ways – Use subset command
New_d_name <- subset(old_d_name, condition,
select=variables wanted)
J. Charles Victor – Intro to R
Females <- subset(class, class$sex == ‘F’)
Note, we can also select out certain variables
only
Males <- subset(class, class$sex == ‘M’,
select=c(firstname,lastname,lettergrd) )
J. Charles Victor – Intro to R
Data Merge
Two important types of merge

Concatenation
Adding new observations to a set of old
observations

Matched merge
Adding new variables (values) to an existing
dataset with the same observations
(eg we need to add mid-term marks to our exam
database)
J. Charles Victor – Intro to R
Concatenation
Easy
 Use rbind function, and add all datasets
new_d_name <- rbind(d_name1, d_name2,…)
But all datasets must have same number (and
names) of variables!
J. Charles Victor – Intro to R
Matched Merge
A little more complex
 Use merge function
If there is a common variable on which to merge:
New_d_name <- merge(d_name1, d_name2,
by = “ID”, all=TRUE)
If the matching variables has different names
New_d_name <- merge(d_name1, d_name2, by.x=“IDX”,
by.y=“IDY”,all=TRUE)
J. Charles Victor – Intro to R