Amazon PIRE Data Processing Tutorial

Download Report

Transcript Amazon PIRE Data Processing Tutorial

EPS 236 Preliminary Steps for Data Analysis Tutorial

Step-by-step preparations for effective file management, data formatting, visualization, and analysis S. C. Wofsy, January 2010 version

a

-1.0

Goals and scope of the introductory tutorial Scientific data

Scientific data are typically created as, or converted to, electronic files providing information in terms of numerical data complemented by metadata in terms of data descriptors (time, location, units, etc.). Analysis of these data usually proceeds in a series of steps: 1.Acquisition of the data in electronic file format 2.Formatting of the data set to enable it to be read using a data analysis application 3.QA/QC of the data, using visualization tools, statistical tools, etc.

4.Assessment of the data 5.Analysis of the data to provide quantitative information and data products.

A key principle for data analysis is that all of the steps 1-5 above must be traceable and reproducible. Often students may wish to explore and assess data files using graphical user interfaces (GUIs), and the tutorial will help students develop their skills with GUIs. However our key principle translates into the following requirements: the entire process must be repeatable starting from the most raw version of the data. We must therefore eschew commonly used spreadsheet programs in favor of much more capable object-oriented data analysis applications. These programs may be applied using both powerful GUIs, which record for future application each command that you execute, and scripts that are essentially sets of command line instructions. Also, analysis of environmental data will lead us from simple statistical tests and figures to sophisticated, rigorous procedures and carefully customized graphics, providing additional impetus to develop the expertise to use data manipulation and analysis programs.

Finally, colleagues do not use the same computer systems or have the same licenses for software. The applications we will use, and our other data products, will be independent of platform and operating system, and will be open source and free of licensing fees.

Table of Contents: Preliminary Data Tutorial

Session 1. Preparing your computer.

Session 2a. Basic R-tutorial, part 1.

Session 3a. Basic R-tutorial, part 2.

Session 4a. Intermediate R-tutorial.

Session 2b. Basic Octave/Matlab tutorial, part 1.

Session 3b. Basic Octave/Matlab tutorial, part 2.

Session 4b. Intermediate Octave/Matlab tutorial.

This tutorial provides training for students to analyze data using the following applications, free for downloading (with proprietary equivalents): R (Splus) GNU-Octave (Matlab) Students who know how to use IDL, and have licenses for this application, can use IDL for data analysis.

Important note:

Experience has shown that Excel and similar spreadsheet programs cannot be successfully applied to analysis of our data sets. Students can readily ingest data into the spreadsheets, but then find it extremely difficult to clean and assess data, and their manipulations are not traceable. Therefore the use of Excel for data analysis will not be permitted. The tutorial provides specific help for students using the following operating systems on their computers: Microsoft windows (XP) Apple/Mac Leopard and Snow Leopard Linux (Ubuntu) Adjustments may need to be made for other versions of these operating systems.

Preparing your computer

 Download or install easy-to-use syntax text editor – required so that you can edit data files and scripts without any changes in file format. These applications also provide color coded "syntax highlighting" that greatly facilitate writing and editing program scripts.

Win: notepad++ Mac: Jedit.app Linux: gedit (use the Java installation procedure); TextWrangler is 2 nd choice

"Office" type applications (MS-Word/wordpad, Pages, etc) cannot be used. Vanilla notepad (Win) and TextEdit (Mac) are inadequate.

 course) unless you are already familiar with Matlab/Octave R ( Download and install your data analysis application; use R (the language used in the http://www.r-project.org/ ) [Ubuntu users download from the Repository, r-cran] Octave (http:// ) or Matlab (installation disk or keyed install from Harvard FAS)  Win only: Download and install required file management tools from Gnuwin32 ( http://gnuwin32.sourceforge.net/packages/packages.html

): "coreutils", "which", "gzip", "tar", and "grep"

Important: Matlab or IDL installations require a license servers!

Preparing your computer (continued)

 Win only: Make adjustments to your path environment variable Start => control panel => system => advanced tab click "Environment Variables" Select "path" under system variables, add the following to the end of path: ;c:\program files\gnuwin32\bin;c:\program files\R\\bin;c:\program files\notepad++

[use c:\program files (x86)\... for 64-bit windows XP installations]

or similarly if you are using Octave or Matlab To find the full path to your R version, use Windows Explorer to navigate to the application file " R.exe

". You can copy the path from the address bar (do not include " R.exe

" itself in the path).

Install a shortcut to cmd.exe

on your desktop or quicklaunch (in C:\WINDOWS\system32). Change the "properties" (right click on the title bar) to include "quick edit mode" and to display more lines, and apply to all windows from the original shortcut.

In Windows Explorer, click tools => folder options, uncheck "Hide extensions" and check "Show hidden files and folders".

Preparing your computer (continued)

 Linux, Mac: Put the Terminal application on your application bar.

Mac only: Install X-code from your Mac installation CD (to install packages). Install MacPorts/Darwinports from the Internet (http:// ). Install "gfortran".

Open the Terminal application, and in your home folder (/Users/your_username) add the following lines to the file .profile

, using the editing program you have installed (create .profile if it does not exist, in folder $HOME).

defaults write com.apple.finder AppleShowAllFiles TRUE killall Finder Ubuntu only: Install c++, gcc, and gfortran from the repository (may be needed to install some R packages).

Learning to use your computer (as a computer) Ultar –simple hands-on activities (be sure you can do these!): •find out how to use a command •create your data file structure •list files; locate files (Gui ok) •find out how big a file is, how many lines it has, etc.

•create and edit a simple file: a data file; an R script •copy, move and delete files •search for strings within files

Learning to use your computer (continued)

Find information on how to use a command

Win: From the Desktop: Look up the help information for cmd , all the commands are listed From within the cmd window, type " /?

" or " -h " Examples: mkdir /?

pwd –h Linux, Mac: From the Terminal: " man " e.g. man mkdir

Learning to use your computer (continued)

Data file structure

You will need a convenient place to put your data files and the scripts that will analyze them. Since the path to this folder will have to be specified in your scripts, keep the name short and the location easy to find. Do not include any symbols other than letters, numbers, and "_"; no spaces should be used.

Good locations might be c:\eps236 (Win) or $HOME/EPS236 for Linux/Mac. You will need subfolders for data, scripts, etc.

You may use the GUI to do this (Windows Explorer (Win), Finder (Mac), or Nautilus (Ubuntu), but this is a good place to start using the command window (Win) or Terminal application (Mac/Linux). Using the command/terminal window: Win: mkdir c:\EPS236 mkdir c:\EPS236\scripts etc.

Linux, Mac: mkdir $HOME/EPS236 mkdir $HOME/EPS236 scripts etc.

Note: $HOME refers to your home directory on Linux/Mac (type " echo $HOME " from the terminal). Typing " mkdir EPS236 " has the same effect as the above if you are working in the home folder ( " cd c:\ " or " cd $HOME " , " cd " = change directory)

Learning to use your computer (continued)

Find properties of the files in a folder

Before leaving c:\ or your home directory, try finding out about the files in the folder. ls –al (lists files and their properties; ls -1 : short list; ls –alt in time order, …) wc (gives number of lines, number of words, and number of bytes in a file; wc reports on only the named file; "*" is a wildcard) Some notable anomalies: Linux treats upper and lower case commands, filenames etc as different. Windows ignores upper/lowercase. Mac sometimes ignores case and sometimes does not. To make your work transportable, assume upper and lower case filenames are different, but do not give different files the same name with different case.

Folder names in a path are distinguished by a forward slash " / " in Linux and Mac, and a backward slash " \ " in Windows. Windows also recognizes the " / " but inconsistently, and all three recognize the " \ " as an "escape character" that affects the treatment of the following character (Windows inconsistently). Spaces (" ") are used to separate parts of a command. To reference a file or folder with a in its name, the name should be surrounded by quotes.

Avoid putting spaces in file names

.

Learning to use your computer (continued)

Create and edit simple files from the cmd or Terminal window

Change directory to pire\data (Win) or pire/data (Linux/Mac) Open your editing application: Win: notepad++.exe Mac: open /Applications/Jedit.app

Linux: gedit Create a file with the following content, and save it into the folder EPS236/data with file name "testfile.txt: X Y 1 0.53

2 4.75

3 9.37

4 16.38

5 24.67

6 37.34

7 48.44

8 64.41

9 81.93

10 99.83

Learning to use your computer (continued)

Also, create the following file with name "testfile0.txt" x.1 x.2

0 0.9053910

1 -0.4245758

2 2.1638530

3 3.2013392

4 1.0568681

5 2.7682038

6 2.7272512

7 2.8435819

8 6.3021333

9 6.0850179

0 5.9820819

11 6.8738404

12 5.6178844

13 5.9797397

14 6.8131798

15 8.2127355

16 8.5939752

18 8.9382829

19 8.6808897

11 8.9008140

20 10.0755642

Exercise 1.

1.Make a copies of testfile.txt

called testfile_copy.txt and dummy.txt using the command cp (in windows, " copy " will also work). Check the result using your installed special editor (not the default editor). Then remove dummy.txt using the command rm ( de l will also work in Win). Then rename/move file testfile_copy.txt to testfile_newcopy.txt using the command mv ( move will also work in Win). Make a listing of the contents of your folder using ls –al > filelist.txt

.

2.The command grep " 6" filename > newfile selects every line in "filename" that contains a space followed by the number 6, and put the output of the command grep into a file called newfile . Apply this command to find the lines that have a 5 in testifle0.txt.

and put them into a file called result.txt

. Hints: First execute this command without the "> newfile" part, then inspect "newfile" to see if it contains the expected results. The symbol ">" directs the output of the command grep into file "newfile". 3.The command awk '{print $n}' filename > newfile extracts the

nth column

file "filename" and places the output into "newfile". Extract the 2 nd from column of textfile.txt and put it into a file called testfile_col2.txt

. Note: In Windows use " rather than ' in this command. Hand in testfile_col2.txt

.

R-tutorial

(Octave/Matlab users skip to "Octave Tutorial")

The basic R-tutorial covers the first two chapters (11 pages) of the document R intro.pdf "An introduction to R" by W. N. Venables, D. M. Smith and the R Development Core Team, plus items from some of the other sections listed below.

Getting started:

Read Chapters 1 and 2 of "An introduction to R", being sure to type into your computer each R command shown in the chapter. Take careful note of the results. Learn about, and try out, the command setwd("foldername") .

When you complete this reading, save the result in the file pire/scripts/tutorial.r

using the savehistory("filename") command. After closing R, open this file with your editor and note the syntax highlighting.

Some notable anomalies: Due to the conflict involving windows " \ " symbol, folder separators in filenames referenced within R are designated with two backslashes " \\ " or one forward slash (" / "). Don't mix these in one path/file name. When a data frame is created by reading a file into R using " read.table() ", columns of alphabetic data are by default made into "factors". This should be prevented using the argument " as.is=T " in the invocation of read.table() . Example: Win: read.table("c:/eps236e/data/testfile0.txt,as.is=T) or read.table("c:/eps236e/data/testfile0.txt", as is=T) . Linux, Mac: read.table("$HOME/EPS236/data/testfile0.txt,as.is=T)

R-tutorial

(continued)

*Basic Tutorial components: What is "Object oriented programming"? What are "attributes" ?

Matrix and data frame: creating and manipulating Plotting data, exploring data Fitting data to a straight line; to a curves line; ordinary regressions and RMA regressions.

Simple statistics on data: means, medians, quantiles, t-test, confidence intervals, Outliers; time series of data Saving your work: objects, commands, functions, graphs *Intermediate tutorial Scripting: what, why, how.

*Data sets Tree diameter data Soil flux chamber data Temperature data

R-tutorial

(continued)

Exercise 2

*Basic Tutorial 1.Create data frames from the files testfile.txt and testfile0.txt that you made earlier in the tutorial. Hints: use the header=T argument to ensure that the columns will have the colnames attribute. 2.Make graphs of Y vs X and x.2 vs x.1 using the names of the columns in the plotting command. Save the figures as " png " or " jpg " graphics files (use dev.copy( ) followed by dev.off( ) . 3.

Fit the data to polynomials e.g. Y = a1 + a2*X + a3*x^2 + … , selecting the order of the polynomial by looking at the graphs you have made. Hint: You will create an object with the command = lm ( Y ~ X + X^2 + ... , ) 4.Plot your best fit curve on the graph of Y vs X. Hint: look at what is accomplished by the function predict() .

5.Use summary() to examine the parameters of the fit and their uncertainties.

6.Read in the file T-test-file.txt downloaded from the website. Read about the t test (http:// …). Examine the paired variables A and B from the file as to whether their respective means are different in a statistically significant way.

Introduction and preliminaries 2 1.1 The R environment 2 1.2 Related software and documentation 2 1.3 R and statistics: 2 1.4 R and the window system 3 1.5 Using R interactively 3 1.6 An introductory session: 4 1.7 Getting help with functions and features 4 1.8 R commands, case sensitivity, etc. 4 1.9 Recall and correction of previous commands 5 1.10 Executing commands from/ diverting output to a file 5 1.11 Data permanency and removing objects: 5 2 Simple manipulations; numbers and vectors: 7 2.1 Vectors and assignment 7 2.2 Vector arithmetic 7 2.3 Generating regular sequences 8 2.4 Logical vectors 9 2.5 Missing values 9 2.6 Character vectors 10 2.7 Index vectors; selecting and modifying subsets of a data set 10 2.8 Other types of objects 11 3 Objects, their modes and attributes 13 3.1 Intrinsic attributes: mode and length 13 3.2 Changing the length of an object 14 3.3 Getting and setting attributes 14 5 Arrays and matrices 18 5.1 Arrays 18 5.2 Array indexing. Subsections of an array 18 6 Lists and data frames 26 6.1 Lists 26 6.2 Constructing and modifying lists 26 6.2.1 Concatenating lists 27 6.3 Data frames 27 6.3.1 Making data frames 27 6.3.2 attach() and detach() 27 6.3.3 Working with data frames 28 6.3.4 Attaching arbitrary lists 28 6.3.5 Managing the search path 29 7 Reading data from files 30 7.1 The read.table() function 30 7.2 The scan() function 31 7.3 Accessing builtin datasets 31 7.3.1 Loading data from other R packages 31 7.4 Editing data 32 9 Grouping, loops and conditional execution: 40 9.1 Grouped expressions: 40 9.2 Control statements 40 9.2.1 Conditional execution: if statements 40 9.2.2 Repetitive execution: for loops, repeat and while 40 Appendix A A sample session 78 Appendix B Invoking R: 81 B.1 Invoking R from the command line 81 B.2 Invoking R under Windows 85 B.3 Invoking R under Mac OS X 85 B.4 Scripting with R 86 iv Appendix C The command-line editor 87 C.1 Preliminaries 87 C.2 Editing actions 87 C.3 Command-line editor summary: 87 "R-intro_selection.txt" [New] 97L, 2962C written 10 Writing your own functions 42 10.1 Simple examples 42 12 Graphical procedures 62 12.1 High-level plotting commands 62 12.1.1 The plot() function 62 12.1.2 Displaying multivariate data 63 12.1.3 Display graphics: 63 12.1.4 Arguments to high-level plotting 64 12.2 Low-level plotting commands 65

optional reading

12.2.1 Mathematical annotation 66 12.3 Interacting with graphics 66 12.4 Using graphics parameters 67 12.4.1 Permanent changes: T par() 67 12.4.2 Arguments to graphics functions 68 12.5 Graphics parameters list: 68 12.5.1 Graphical elements 69 12.5.2 Axes and tick marks: 70 12.5.3 Figure margins 70 12.5.4 Multiple figure environment 72 12.6 Device drivers 73 12.6.1 PostScript diagrams for typeset documents 73 12.6.2 Multiple graphics devices 74 12.7 Dynamic graphics 75 13 Packages 76 13.1 Standard packages 76 13.2 Contributed packages and CRAN 76 13.3 Namespaces 76

Default installations of R should have the following packages: base, stats, stats4, graphics, grDevices, and a few others (type "library()" to list what you have).

Using "install.packages()", try adding the following packages (some may not install…don't be concerned). Type " help(install.packages) " or "help.search("install packages") to see how to use the function "install.packages()": akima Interpolation of irregularly spaced data datasets The R Datasets Package fields Tools for spatial data foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase, ...

gstat Geostatistical package lattice Lattice Graphics mapdata Extra Map Databases mapproj Map Projections maps Draw Geographical Maps matlab MATLAB emulation package Matrix Sparse and Dense Matrix Classes and Methods sp classes and methods for spatial data spatial Functions for Kriging and Point Pattern Analysis splines Regression Spline Functions and Classes splus2R S-PLUS functionality missing from R tseries Time series analysis and computational finance utils The R Utils Package