Alice in Wonderland & eval(parse()) in R

Download Report

Transcript Alice in Wonderland & eval(parse()) in R

Advanced Data Manipulation
in R
- Matthew Keller
+
Data manipulation
ability
=
Along with graphics and cutting-edge statistics, data
manipulation is one of R’s great strengths. Being good at
manipulating data opens up a world of possibilities!
Today’s lesson:
•Splicing data
•Sorting data
•Merging data
•Reshaping data
Caveat
• The one weakness in R (as it is in many programs,
including Matlab, SPSS, Python but not SAS to this
degree) is the difficulty in working with HUGE data
sets (e.g., datasets that approach 1/2 your RAM size)
• R stores data in a way that takes up a lot of RAM
(good for some things; not good for huge datasets)
• If you are working on datasets > 2Gb, you really need
to switch to 64-bit computing, which allows access to
> 3Gb RAM. But you’re still capped by your
machines’ RAM
• New libraries are in the works to store big objects on
the hard disk instead of RAM, but so far, they are
pretty clunky
Lesson 1: The most important skill to being
great at splicing data is building logical
vectors that do exactly what you want
• If you can translate complicated ideas
into logical vectors, you can select data
in as complicated a way as you want.
• Useful functions for creating logicals:
==, !=, >, >=, <, <=, %in%, duplicated,
is.na, is.null, is.numeric, etc…
Data splicing example
Original dataset, a
But say we want a new
dataset that has females
above 7 on v1 but below
10 on v10, and males if
non-missing on v4
Lesson 2: Learn to sort the R way
• Sorting seems round-about in R (there’s
no all-in-one function). But it’s easy.
• Step 1 - create a numeric vector in the
order you want your data sorted
• Step 2 - use that vector as an index for
your rows or columns
• Helpful functions: order(), sort(),
which(), rev(), unique()
Sorting example
Example:
Say we want to sort a by v1 if male and
by v2 if female. How to do it?
Sorting example
• Step 1 - create a numeric vector in the
order you want your data sorted
• Step 2 - use that vector as an index for
your rows or columns
Lesson 3: Merging is easy with merge()
Lesson 4: reshape(): long to wide data formats
df3 is in ‘long’ format
‘wide’ format
You must use two arguments:
•idvar = 1+ variables in long format identifying rows that are the same individual
•timevar = 1 variable that differentiates multiple records from same individual
Lesson 4: reshape(): long to wide data formats
df3 is in ‘long’ format
An alternative ‘wide’ format
•idvar = 1+ variables in long format identifying rows that are the same individual
•timevar = 1 variable that differentiates multiple records from same individual
Lesson 4: reshape(): wide to long data formats
wd is in ‘wide’ format
•varying = a LIST of column names in wide format that will be the same
columns in the long format
•v.names = a vector of names of the new columns in the long format
Damn! What’s the
name of that
function again. I
should have made
a cheat sheet!