Transcript Document

Hands-on Introduction to R 3 2 1 0 1 2 3

Why Leaning Programing?

• • We live in oceans of data. Computers are essential to record and help analyse it.

Competent scientists

speak C/C++, Java, MATLAB, Python, Perl, R and/or Mathematica • Data collection and analysis very important in Forensic Science since NAS 2009 Using the above languages, codes can easily be made available for review/discovery

Getting a computer to do anything useful • All machines understand is on/off!

• High/low voltage • • High/low current High/low charge • 1/0 binary digits (bits) • To make a computer do anything, you have to speak

machine language

to it: 000000 00001 00010 00110 00000 100000 Add 1 and 2. Store the result.

Wikipedia

Getting a computer to do anything useful • • • Machine language is not intuitive and can vary a great deal over designs The basic operations operations however are the same, e.g.: • • • • Move data here Combine these values Store this data Etc.

“Human readable” language for basic machine operations:

assembly language

Getting a computer to do anything useful • Assembly is still cumbersome for (most) humans 10110000 01100001 MOV AL, 61h A machine encoding Assembly Move the number 97 over to “storage area” AL

Getting a computer to do anything useful • Better yet is a more “Englishy”, “high-level” language • Enter: C, C++, Fortran, Java, … • Higher level languages like these are translated (“

compiled

”) to machine language • Not exactly true for Java, but it’s something analogous…

• • Getting a computer to do anything useful Even more “Englishy” and “high-level” are interpreted languages • Enter: R MATLAB, Perl, Python, Mathematica, Maple, … • • • The “code” of these languages are “interpreted” as commands by a program that is already running They make many assumptions behind the scenes Much easier to program with Much slower than compiled languages

Why ?

• • •

R

is not a black box!

• Codes available for review; totally transparent!

R

statisticians, and computational scientists • • maintained by a professional group of

From very simple to state-of-the-art procedures available

Very good graphics for exhibits and papers

R

• • is extensible (it is a full scripting language) Coding/syntax similar to Python and MATLAB Easy to link to C/C++ routines

Why ?

• Where to get information on

R

: • R: http://www.r-project.org/ • Just need the

base

• RStudio: http://rstudio.org/ • A great IDE for R • • Work on all platforms Sometimes slows down performance… • CRAN: http://cran.r-project.org/ • Library repository for R • Click on Search on the left of the website to search for package/info on packages

Finding our way around R/RStudio

Handy Commands:

• Basic Input and Output

Numeric input

x <- 4

variables

: store information :

Assignment operator

x <- “text goes in quotes”

Text (character) input

Handy Commands:

Get help on an R command:

If you know the name

:

?command name

• ?plot brings up html on plot command •

If you don’t know the name

: • • Use Google (my favorite)

??key word

Handy Commands:

R is driven by

functions

: func(arguement1, argument2)

function name input to function goes in

parenthesis

function returns something; gets dumped into x

x <- func(arg1, arg2)

Handy Commands:

Input from Excel

• Save spreadsheet as a CSV file • Use read.csv function • Needs the

path

to the file Mac e.g.: "/Users/npetraco/latex/papers/data.csv” Windows e.g.: “C:\Users\npetraco\latex\papers\data.csv” *Exercise: basicIO.R

Handy Commands:

• • Matrices:

X

• • • X[,1] returns column 1 of matrix

X

X[3,] returns row 3 of matrix

X

Handy functions for data frames and matrices: • dim, nrow, ncol, rbind, cbind User defined functions syntax: • func.name <- function(arguements) { do something return(output) } • To use it: func.name(values)

Handy Commands:

• User defined function example: • • • Compute the intensities of the Planck distribution Let the user input a Temperature Let the user input endpoint. Assume it is in

nm

• Careful here. Make sure wavelength units are consistent with the other constants.

• What is the “easiest” thing to do??

First Thing: Look at your Data

o Explore the Glass dataset of the

mlbench

package • • • Source (load) all_data_source.R

*visualize_with_plots.r

Scatter plots

: plot any two variables against each other 1.515

1.520

RI 1.525

1.530

First Thing: Look at your Data

Pairs plots

: do many scatter plots at once 0 1 2 3 4 5 6 Si K Ca 6 8 10 12 14 16 70 71 72 73 74 75

First Thing: Look at your Data

Histograms

: “bin” a variable and plot frequencies 60 10 0 30 20 50 40 1.510

1.515

1.520

RI 1.525

1.530

1.535

First Thing: Look at your Data

Histograms conditioned on other variables

: use

lattice

package 5 1.5101.5151.5201.5251.5301.535

6 7 80 60 40 20 0 RIs Conditioned on glass group membership 2 3 1 80 60 40 20 0 1.5101.5151.5201.5251.5301.535

1.5101.5151.5201.5251.5301.535

RI

First Thing: Look at your Data

Probability density plots

: also needs

lattice

200 150 100 50 0 1.510

1.515

1.520

RI 1.525

1.530

1.535

First Thing: Look at your Data

Empirical Probability Distribution plots

: also called empirical cumulative density 1.0

0.2

0.0

0.8

0.6

0.4

1.515

1.520

RI 1.525

1.530

1.535

First Thing: Look at your Data

Box and Whiskers plots

: range possible outliers 1 .5 1 8 8 25 th -%tile 1 st -quartile 1 .5 1 8 9 1 .5 1 9 0 median 50 th -%tile 1 .5 1 9 1 75 th -%tile 3 rd -quartile 1 .5 1 9 2 RI possible outliers

Visualizing Data

• Note the relationship:

First Thing: Look at your Data

Box and Whiskers plots

: 60 40 20 0 Al Ba Ca Fe K Mg Na Box-Whiskers plots for actual variable values RI Si 5 0 Al Ba Ca Fe K Mg Na RI Si Box-Whiskers plots for scaled variable values

Confidence Intervals

• A

confidence interval

(CI) gives a range in which a true population parameter may be found.

• Specifically, (1 – a )×100% CIs for a parameter, constructed from a random sample (of a given sample size), will contain the true value of the parameter approximately (1 – a )×100% of the time.

• Different from tolerance and prediction intervals

Confidence Intervals

Caution

: IT IS NOT CORRECT to say that there a (1 a )×100% probability that the true value of a parameter is between the bounds of any given CI.

Take a sample.

Compute a CI.

Graphical representation of 90% CIs is for a parameter: Here 90% of the CIs contain the true value of the parameter true value of parameter

Confidence Intervals

• Construction of a CI for a mean depends on: • Sample size

n

• Standard error for means

s x

s

• Level of confidence 1 a • a is significance level a

t c

-value • (1 a

n

)×100% CI for population mean using a sample average and standard error is: 

x

t s c x

,

x

t s c x

Confidence Intervals

• Compute a 99% confidence interval for the mean using this sample set: Fragment # Fragment nD 1 2 1.52005

1.52003

3 4 5 6 1.52001

1.52004

1.52000

1.52001

7 8 9 10 11 1.52008

1.52011

1.52008

1.52008

1.52008

x s s x

  1.52005

 0.0004

0.0001

( a /2=0.005)

t c

 = 3.17

Putting this together: [1.52005 - (3.17)(0.00001), 1.52005 + (3.17)(0.00001)] 99% CI for sample = [1.52002, 1.52009] *Try out confidence_intervals.R