Chapter 14 - Richard (Rick) Watson

Transcript Chapter 14 - Richard (Rick) Watson

Introduction to R
Statistics are no substitute for judgment
Henry Clay, U.S. congressman and senator
R
R is a free software environment for
statistical computing and graphics
Object-oriented
It runs on a wide variety of platforms
Highly extensible
Command line and GUI
Conflict between extensible and GUI
Scripts
Results
R Studio
Datasets
Files, plots, packages, & help
Creating a project
Store all R scripts and data in the same
folder or directory by creating a project
File > New Project…
Script
A script is a set of R commands
A program
# CO2 parts per million for 2000-2009
co2 <c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78)
year <- (2000:2009) # a range of values
# show values
co2
year
#compute mean and standard deviation
mean(co2)
sd(co2)
plot(year,co2)
c is short for combine in c(369.40, …)
Exercise
Plot kWh per square foot by year for the
following University of Georgia data.
year
2007
sqfeet
14,214,216
kWh
2,141,705
2008
14,359,041
2,108,088
2009
14,752,886
2,150,841
2010
15,341,886
2,211,414
2011
15,573,100
2,187,164
2012
15,740,742
2,057,364
1.
2.
3.
4.
5.
Smart editing
Copy each column to
a word processor
Convert table to text
Search and replace
commas with null
Search and replace
returns with commas
Edit to put R text
around numbers
# Data in R format
year <- (2007:2012)
sqft <- c(14214216, 14359041, 14752886, 15341886, 15573100, 15740742)
kwh <- c(2141705, 2108088, 2150841, 2211414, 2187164, 2057364)
Datasets
A dataset is a table
One row for each observation
Columns contain observation values
Same as the relational model
R supports multiple data structures and
multiple data types
Data structures
Vector
A single row table where data are all of the
same type
co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78)
year <- (2000:2009)
co2[2] # get the second value
Matrix
A table where all data are of the same type
m <- matrix(1:12, nrow=4,ncol=3)
m[4,3]
Exercise
Create a matrix with 6 rows and 3
columns containing the numbers 1
through 18
Data structures
Array
Extends a matrix beyond two dimensions
a <- array(1:24, c(4,3,2))
a[1,1,1]
Data frame
Same as a relational table
Columns can have different data types
Typically, read a file to create a data frame
gender <- c("m","f","f")
age <- c(5,8,3)
df <- data.frame(gender,age)
df[1,2]
df[1,]
df[,2]
Data structures
List
An ordered collection of objects
Can store a variety of objects under one
name
l <- list(co2,m,df)
l[[3]] # list 3
l[[1]][2] # second element of list 1
Logical operations
Logical operator
Symbol
EQUAL
==
AND
&
OR
|
NOT
!
Objects
Anything that can be assigned to a
variable
Constant
Data structure
Function
Graph
…
Types of data
Classification
Nominal
Sorting or ranking
Ordinal
Measurement
Interval
Ratio
Factors
Nominal and ordinal data are factors
By default, strings are treated as factors
Determine how data are analyzed and
presented
Failure to realize a column contains a
factor, can cause confusion
Use str() to find out a frame’s data
structure
Missing values
Missing values are indicated by NA (not
available)
Arithmetic expressions and functions
containing missing values generate
missing values
sum(c(1,NA,2))
Use the na.rm=T option to exclude
missing values from calculations
sum(c(1,NA,2),na.rm=T)
Missing values
You remove rows with missing values by
using na.omit()
gender <- c("m","f","f","f")
age <- c(5,8,3,NA)
df <- data.frame(gender,age)
df2 <- na.omit(df)
Packages
R’s base set of packages can be extended
by installing additional packages
Over 4,000 packages
Search the R Project site to identify
packages and functions
Install using R studio
Packages must be installed prior to use
and their use specified in a script
library(packagename)
Packages
# install ONCE on your computer
# can also use Rstudio to install
install.packages("knitr")
# library EVERY TIME before using a package in a
session
# loads the package to memory
library(knitr)
Exercise
Install the package birk and use one of
its functions to do the following
conversions:
100ºF to ºC
1oo meters to feet
Compile a notebook
A notebook is a report of an analysis
Interweaves R code and output
File > Compile Notebook …
Select html, pdf, or Word output
Install knitr before use
Install suggested packages
PDF
Reading a file
R can read a wide variety of input
formats
Text
Statistical package formats (e.g., SAS)
DBMS
Reading a text file
Delimited text file, such as CSV
Creates a data frame
Specify as required
Presence of header
Separator
Row names
It will not find this
local file on your
computer.
t <- read.table("~/Dropbox/ Documents/R/Data/centralparktemps.txt", header=T, sep=',')
Reading a text file
Read a file using a URL
url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt"
t <- read.table(url, header=T, sep=',')
Learning about an object
Click on the name
of the file in the
top-right window
to see its content
url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt"
t <- read.table(url, header=T, sep=',')
head(t) # first few rows
tail(t) # last few rows
dim(t) # dimension
str(t) # structure of a dataset
class(t) #type of object
Click on the blue
icon of the file in the
top-right window to
see its structure
Referencing data
datasetName$columName
url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt"
t <- read.table(url, header=T, sep=',')
# qualify with tablename to reference fields
mean(t$temperature)
max(t$year)
range(t$month)
Data set
Column
Creating a new column
library(birk)
<url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt"
t <- read.table(url, header=T, sep=',')
# compute Celsius
t$Ctemp <- round(conv_unit(t$temperature,F,C),1)
Reshaping
year month
Converting data from one
format to another
Wide to narrow
Melt
Cast
co2
1959
1
315.62
1959
2
316.38
1959
3
316.71
1959
4
317.72
1959
5
318.29
1959
6
318.15
1959
7
316.54
1959
8
314.80
1959
9
313.84
1959
10
313.26
1959
11
314.80
1959
12
315.58
Year
1
2
3
4
5
6
7
8
9
10
11
12
1959
315.62
316.38
316.71
317.72
318.29
318.15
316.54
314.8
313.84
313.26
314.8
315.58
External files & RStudio server
Upload a file
Download a file
More > Export …
Reshaping
library(reshape)
url <- 'http://people.terry.uga.edu/rwatson/data/meltExample.csv'
s <- read.table(url, header=F, sep=',')
colnames(s) <- c('year', 1:12)
# melt (normalization)
m <- melt(s,id='year')
colnames(m) <- c('year','month','co2')
# Cast – revers of melt
c <- cast(m,year~month, value='co2')
Writing files
url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt’
t <- read.table(url, header=T, sep=',')
# compute Celsius and round to one decimal place
t$Ctemp = round((t$temperature-32)*5/9,1)
colnames(t)[3] <- 'Ftemp' # rename third column to indicate Fahrenheit
write.table(t,"centralparktempsCF.txt")
The file is
stored in the
project's folder
sqldf
A R package for using SQL with data
frames
Returns a data frame
Supports MySQL
Subset
Selecting rows
library(sqldf)
options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL
trowSQL <- sqldf("select * from t where year = 1999")
Selecting columns
tcolSQL <-
sqldf("select year, month, Ctemp from t")
Selecting rows and columns
trowcolSQL <2000")
sqldf("select year, month, Ctemp from t where year > 1989 and year <
Logical operator
Symbol
EQUAL
==
AND
&
OR
|
NOT
!
Sort
Sorting on column name
sSQL <-
sqldf("select * from t order by year desc, month")
Recoding
Some analyses might be facilitated by
the recoding of data
Split a continuous measure into two
categories
t$Category <- 'Other'
t$Category[t$Ftemp >= 30] <-
'Hot'
Deleting a column
t$Category <-
NULL
Exercise
Download the spreadsheet of monthly mean
CO2 measurements (PPM) taken at the
Mauna Loa Observatory from 1958 onwards
http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loaco2-data.html
Export a CSV file that contains three columns:
year, month, and average CO2
Read the file into R
Recode missing values (-99.99) to NA
Plot year versus CO2
Summarizing data
library(sqldf)
options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL
url <'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt'
t <- read.table(url, header=T, sep=',')
w <- sqldf("select year, avg(temperature) as mean from t group by
year")
Merging files
There must be a common column in
both files
library(sqldf)
options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL
url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt'
t <- read.table(url, header=T, sep=',')
# average monthly temp for each year
a <- sqldf("select year, avg(temperature) as mean from t group by year")
# read yearly carbon data (source: http://co2now.org/Current-CO2/CO2Now/noaa-mauna-loa-co2-data.html)
url <- 'http://people.terry.uga.edu/rwatson/data/carbon1959-2011.txt'
carbon <- read.table(url, header=T, sep=',')
m <- sqldf("select a.year, CO2, mean from a, carbon where a.year =
carbon.year")
Correlation coefficient
cor.test(m$mean,m$CO2)
Pearson's product-moment correlation
data: m$mean and m$CO2
t = 3.1173, df = 51, p-value = 0.002997
95 percent confidence interval:
0.1454994 0.6049393
sample estimates:
cor
0.4000598
Significant
Concatenating files
Taking a set of files of with the same
structure and creating a single file
Same type of data in corresponding
columns
Files should be in the same directory
Concatenating files
Local directory
# read the file names from a local directory
filenames <- list.files("homeC-all/homeC-power",
pattern="*.csv", full.names=TRUE)
# append the files one after another
for (i in 1:length(filenames)) {
# Create the concatenated data frame using the first file
if (i == 1) {
cp <- read.table(filenames[i], header=F, sep=',')
}
else {
temp <-read.table(filenames[i], header=F, sep=',')
cp <-rbind(cp, temp) #append to existing file
rm(temp)# remove the temporary file
}
}
colnames(cp) <- c('time','watts')
Takes a
while to run
Concatenating files
Remote directory with FTP
# read the file names from a remote directory (FTP)
library(RCurl)
url <"ftp://watson_ftp:bulldawg1989@http://people.terry.uga.edu/rwatso
n/data/Mauna%20Loa%20CO2.csvpeople.terry.uga.edu/rwatson/power/"
dir <- getURL(url, dirlistonly = T)
filenames <- unlist(strsplit(dir,"\n")) # split into filenames
# append the files one after another
for (i in 1:length(filenames)) {
file <- paste(url,filenames[i],sep='') # concatenate for url
if (i == 1) {
cp <- read.table(file, header=F, sep=',')
}
else {
temp <-read.table(file, header=F, sep=',')
cp <-rbind(cp, temp) #append to existing file
rm(temp)# remove the temporary file
}
Database access
MySQL access
library(RMySQL)
conn <- dbConnect(RMySQL::MySQL(), "richardtwatson.com",
dbname="Weather", user="db2", password="student")
# Query the database and create file t for use with R
t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;")
head(t)
Exercise
Using the Atlanta weather database and
the lubridate package
Compute the average temperature at 5 pm
in August
Determine the maximum temperature for
each day in August for each year
Resources
R books
Reference card
Quick-R
Key points
R is a platform for a wide variety of data
analytics
Statistical analysis
Data visualization
HDFS and MapReduce
Text mining
Energy Informatics
R is a programming language
Much to learn