Transcript Document
Hands-on Introduction to R 3 2 1 0 1 2 3
Why Leaning Programing?
• • We live in oceans of data. Computers are essential to record and help analyse it.
•
Competent scientists
speak C/C++, Java, MATLAB, Python, Perl, R and/or Mathematica • Data collection and analysis very important in Forensic Science since NAS 2009 Using the above languages, codes can easily be made available for review/discovery
Getting a computer to do anything useful • All machines understand is on/off!
• High/low voltage • • High/low current High/low charge • 1/0 binary digits (bits) • To make a computer do anything, you have to speak
machine language
to it: 000000 00001 00010 00110 00000 100000 Add 1 and 2. Store the result.
Wikipedia
Getting a computer to do anything useful • • • Machine language is not intuitive and can vary a great deal over designs The basic operations operations however are the same, e.g.: • • • • Move data here Combine these values Store this data Etc.
“Human readable” language for basic machine operations:
assembly language
Getting a computer to do anything useful • Assembly is still cumbersome for (most) humans 10110000 01100001 MOV AL, 61h A machine encoding Assembly Move the number 97 over to “storage area” AL
Getting a computer to do anything useful • Better yet is a more “Englishy”, “high-level” language • Enter: C, C++, Fortran, Java, … • Higher level languages like these are translated (“
compiled
”) to machine language • Not exactly true for Java, but it’s something analogous…
• • Getting a computer to do anything useful Even more “Englishy” and “high-level” are interpreted languages • Enter: R MATLAB, Perl, Python, Mathematica, Maple, … • • • The “code” of these languages are “interpreted” as commands by a program that is already running They make many assumptions behind the scenes Much easier to program with Much slower than compiled languages
Why ?
• • •
R
is not a black box!
• Codes available for review; totally transparent!
R
statisticians, and computational scientists • • maintained by a professional group of
From very simple to state-of-the-art procedures available
Very good graphics for exhibits and papers
R
• • is extensible (it is a full scripting language) Coding/syntax similar to Python and MATLAB Easy to link to C/C++ routines
Why ?
• Where to get information on
R
: • R: http://www.r-project.org/ • Just need the
base
• RStudio: http://rstudio.org/ • A great IDE for R • • Work on all platforms Sometimes slows down performance… • CRAN: http://cran.r-project.org/ • Library repository for R • Click on Search on the left of the website to search for package/info on packages
Finding our way around R/RStudio
Handy Commands:
• Basic Input and Output
Numeric input
x <- 4
variables
: store information :
Assignment operator
x <- “text goes in quotes”
Text (character) input
Handy Commands:
•
Get help on an R command:
•
If you know the name
:
?command name
• ?plot brings up html on plot command •
If you don’t know the name
: • • Use Google (my favorite)
??key word
Handy Commands:
•
R is driven by
functions
: func(arguement1, argument2)
function name input to function goes in
parenthesis
function returns something; gets dumped into x
x <- func(arg1, arg2)
Handy Commands:
•
Input from Excel
• Save spreadsheet as a CSV file • Use read.csv function • Needs the
path
to the file Mac e.g.: "/Users/npetraco/latex/papers/data.csv” Windows e.g.: “C:\Users\npetraco\latex\papers\data.csv” *Exercise: basicIO.R
Handy Commands:
• • Matrices:
X
• • • X[,1] returns column 1 of matrix
X
X[3,] returns row 3 of matrix
X
Handy functions for data frames and matrices: • dim, nrow, ncol, rbind, cbind User defined functions syntax: • func.name <- function(arguements) { do something return(output) } • To use it: func.name(values)
Handy Commands:
• User defined function example: • • • Compute the intensities of the Planck distribution Let the user input a Temperature Let the user input endpoint. Assume it is in
nm
• Careful here. Make sure wavelength units are consistent with the other constants.
• What is the “easiest” thing to do??
First Thing: Look at your Data
o Explore the Glass dataset of the
mlbench
package • • • Source (load) all_data_source.R
*visualize_with_plots.r
Scatter plots
: plot any two variables against each other 1.515
1.520
RI 1.525
1.530
•
First Thing: Look at your Data
Pairs plots
: do many scatter plots at once 0 1 2 3 4 5 6 Si K Ca 6 8 10 12 14 16 70 71 72 73 74 75
•
First Thing: Look at your Data
Histograms
: “bin” a variable and plot frequencies 60 10 0 30 20 50 40 1.510
1.515
1.520
RI 1.525
1.530
1.535
•
First Thing: Look at your Data
Histograms conditioned on other variables
: use
lattice
package 5 1.5101.5151.5201.5251.5301.535
6 7 80 60 40 20 0 RIs Conditioned on glass group membership 2 3 1 80 60 40 20 0 1.5101.5151.5201.5251.5301.535
1.5101.5151.5201.5251.5301.535
RI
•
First Thing: Look at your Data
Probability density plots
: also needs
lattice
200 150 100 50 0 1.510
1.515
1.520
RI 1.525
1.530
1.535
•
First Thing: Look at your Data
Empirical Probability Distribution plots
: also called empirical cumulative density 1.0
0.2
0.0
0.8
0.6
0.4
1.515
1.520
RI 1.525
1.530
1.535
•
First Thing: Look at your Data
Box and Whiskers plots
: range possible outliers 1 .5 1 8 8 25 th -%tile 1 st -quartile 1 .5 1 8 9 1 .5 1 9 0 median 50 th -%tile 1 .5 1 9 1 75 th -%tile 3 rd -quartile 1 .5 1 9 2 RI possible outliers
Visualizing Data
• Note the relationship:
•
First Thing: Look at your Data
Box and Whiskers plots
: 60 40 20 0 Al Ba Ca Fe K Mg Na Box-Whiskers plots for actual variable values RI Si 5 0 Al Ba Ca Fe K Mg Na RI Si Box-Whiskers plots for scaled variable values
Confidence Intervals
• A
confidence interval
(CI) gives a range in which a true population parameter may be found.
• Specifically, (1 – a )×100% CIs for a parameter, constructed from a random sample (of a given sample size), will contain the true value of the parameter approximately (1 – a )×100% of the time.
• Different from tolerance and prediction intervals
Confidence Intervals
•
Caution
: IT IS NOT CORRECT to say that there a (1 a )×100% probability that the true value of a parameter is between the bounds of any given CI.
Take a sample.
Compute a CI.
Graphical representation of 90% CIs is for a parameter: Here 90% of the CIs contain the true value of the parameter true value of parameter
Confidence Intervals
• Construction of a CI for a mean depends on: • Sample size
n
• Standard error for means
s x
s
• Level of confidence 1 a • a is significance level a
t c
-value • (1 a
n
)×100% CI for population mean using a sample average and standard error is:
x
t s c x
,
x
t s c x
Confidence Intervals
• Compute a 99% confidence interval for the mean using this sample set: Fragment # Fragment nD 1 2 1.52005
1.52003
3 4 5 6 1.52001
1.52004
1.52000
1.52001
7 8 9 10 11 1.52008
1.52011
1.52008
1.52008
1.52008
x s s x
1.52005
0.0004
0.0001
( a /2=0.005)
t c
= 3.17
Putting this together: [1.52005 - (3.17)(0.00001), 1.52005 + (3.17)(0.00001)] 99% CI for sample = [1.52002, 1.52009] *Try out confidence_intervals.R