Programming Fundamentals

Download Report

Transcript Programming Fundamentals

CCPR Computing Services
More Efficient Programming
Courtney Engel
October 12, 2007
Outline



Overview of programming
Thinking through a programming task
Ways of efficiently documenting and organizing your
project





Programming constructs


Naming variables, programs, files
Commenting code
Including file header
Implementing directory structure
Examples
Raw data -> finished product: Replicable?
Overview

“Recipe” to complete given task



Commands that tell your computer what to do
Language standards determine correct
commands
Basic programming allows you to:



Read, write, and reformat data files
Perform data calculations
Have the computer complete mundane tasks and
minimize human error
Before you start coding…



Conceptualize
Clearly define the problem in writing
Write down the solution/algorithm in English





Modularity
Create test (if reasonable)
Translate one section to code
Test the section thoroughly
Translate/Test next section, etc.
Documentation - File Header

File header includes:
 Name (email)
 Project
 Project location
 Date
 Software Version
 Purpose of program
 Inputs
 Outputs
 Special Instructions
*Josie Bruin ([email protected])
*HRS project
*/u/socio/jbruin/HRS/
*October 5, 2007
*Stata version 8
*Purpose: Create and merge two datasets in Stata,
* then convert data to SAS
*Input programs:
* HRS/staprog/H2002.do,
* HRS/staprog/x2002.do,
* HRS/staprog/mergeFiles.do
*Output:
* HRS/stalog/H2002.log,
* HRS/stalog/x2002.log,
* HRS/stalog/mergeFiles.log
* HRS/stadata/Hx2002.dta
* HRS/sasdata/Hx2002.sas
*Special instructions: Check log files for errors
* check for duplicates upon new data release
Naming Files, Variables, and Functions






Use language standard (if it exists)
Be aware of language-specific rules
 Max length, underscore, case, reserved words
Meaningful variable names:
 LogWt vs. var1
 AgeLt30 vs. x
Procedure that cleans missing values of Age:
 fixMissingAge
Matrix multiplication X transpose times X
 matXX
Differentiating log files:
 Programs
MergeHH.sas, MergeHH.do
 Log files
MergeHHsas.log, MergeHHsta.log
Commenting Code
Good code is self-commenting


Comments should explain




Purpose of code, not every detail
Tricks used
Reasons for unusual coding
Comments do not




Naming conventions, structure/formatting, header should
explain 95%
fix sloppy code
translate syntax
If it takes longer to read the comment than to read the
code, don’t add a comment!
Commenting Code - Stata example
Compare formatting, comments, variable name and function names
SAMPLE 1
program def function1
foreach v of varlist _all {
local x = lower("`v'")
if `"`v'"' != `"`x'"' {
rename `v' `=lower("`v'")'
}
}
end
SAMPLE 2
*Convert names in dataset to
lowercase.
program def lowerVarNames
foreach v of varlist _all {
local LowName = lower("`v'")
if `"`v'"' != `"`LowName'"' {
rename `v' `=lower("`v'")'
}
}
end
Directory Structure




A project consists of
many different types of
files
Use folders to separate
files in a logical way
Be consistent across
projects if possible
ATTIC folder for older
versions
HOME
PROJECT NAME
DATA
RESULTS
LOG
PROGRAMS
ATTIC
Stata example: using directory structure
** Paths:
global parentpath "C:\Documents and Settings\jbruin\Fall07\prog\progtips"
global pgmsloc "$parentpath\pgms"
global logsloc "$parentpath\logs"
global cleandataloc "$parentpath\data\clean"
global rawdataloc "$parentpath\data\raw"
log using "$logsloc\test200710", text replace
*********************************************************************
*INSERT FILE HEADER HERE...then it’s included in log file.
*********************************************************************
macro list
webuse union, clear
save "$rawdataloc\union.dta", replace
keep idcode year age grade
save "$cleandataloc\unionLJP.dta", replace
log close
Programming Constructs



Tools to simplify and clarify your coding
Available in virtually all languages
Constructs




Loops - for, foreach, do, while
If/elseif/else– if, then, else, case
continue
exit
Loop Construct
The syntax for foreach is
foreach lname { in | of listtype } list {
Stata commands referring to lname}
where lname is the name of the new local
macro and listtype is the type of list on which
you want to operate.
Loop Example 1 – pulling from 2 lists
From Stata FAQ website
Code:
local animalgrp "cat dog cow pig"
local noisegrp "meow woof moo oinkoink"
local n : word count `animalgrp'

forvalues i = 1/`n' {
local animal : word `i' of `animalgrp'
local noise : word `i' of `noisegrp'
display "`animal’ says `noise'"
}
Resulting output:
cat says meow
dog says woof
cow says moo
pig says oinkoink
Loop Example 2
Given indicator variables white, black, other, and continuous
variable EducYrs, create interaction variables
Solution using loop:
local allraces "white black other"
foreach race of varlist `allraces' {
generate `race'_educ=`race‘*EducYrs
}


Obs #
White
Black
Other
EducYrs
White_
educ
Black_
educ
Other_
educ
1
1
0
0
10
10
0
0
2
0
1
0
15
0
15
0
3
0
0
1
20
0
0
20
Loop Example 3

Problem:


Dataset contains variables over multiple years (1970-1990)
Need to perform a number of commands separately for 1970, 1975,
1980, 1985.

Solution without loop
bysort year: command1 if year==70 | year==75 | year==80 | year==85
bysort year: command2 if year==70 | year==75 | year==80 | year==85

Solution with loop
foreach year in 70 75 80 85 {
display as result "***Regression for year = `year':"
regress ln_wage grade tenure ttl_exp if year==`year'
display as result "***Summarize for year = `year':"
summarize ln_wage if year==`year'
}
Constructs - If/then/else

Execute section of code if condition is true:
if condition then
{execute this code if condition true}
end

Execute one of two sections of code:
if condition then
{execute this code if condition true}
else
{execute this code if condition false}
end
If/Else Example


Problem: need to execute commands on an
operating system, but only if the os is Unix…the
commands will fail if os is anything else
Solution:
if "`c(os)'"~="Unix" {
display as err "Sorry; this section requires Unix OS."
}
else {
** continue with unix commands…
}
Constructs - Elseif/case

Elseif - Execute one of many sections of code:
if condition1 then
{execute this code if condition1 true}
elseif condition2 then
{execute this code if condition2 true}
else
{execute this code if condition1, condition2 are all false}
end

Case- same idea, different name
case condition1 then
{execute this code if condition1 true}
case condition2 then
{execute this code if condition2 true}
etc.
Elseif Example


Problem: Continue example from if…else, but execute different
section of code for Unix, Windows, and Mac
Solution:
if "`c(os)'"=="Unix" {
display "This is a Unix environment"
}
else if "`c(os)'" == "Windows" {
display "This is a Windows environment"
}
else if "`c(os)'" =="MacOSX" {
display "This is a MacOS” environment."
}
else {
display as err "`c(os)' not recognized."
}
Example

Problem: Given 4 indicator variables (south, union, black,
not_smsa) and 2 discrete variables (age, grade), generate 8 new
indicator variables:




Solution without loop
 8 lines of code similar to:



south_age21
=
south and age > 21,
south_gr12
=
south and grade > 12
Similarly for union, black, not_smsa
generate newvar = (south==1 & age>21 & age<.)
generate newvar = (south==1 & grade>12 & grade<.)
Solution with loop
foreach j in south union black not_smsa {
generate `j'_age21 = (age>21 & age<. & `j'==1)
generate `j'_gr12 = (grade>12 & grade<. & `j'==1)
}
Example, cont.
*CHECK GENERATED VARIABLES AGAINST ORIGINAL VARIABLES
foreach j in south union black not_smsa {
quietly count if `j'==1 & age>21 & age<.
local origCount = r(N)
quietly count if `j'_age21==1
if `origCount' ~= `r(N)' {
display "Counts do not match for `j'_age21!"
}
Obs
South
Age
Grade
else
#
display "Counts match for `j'_age21."
quietly count if `j'==1 & grade>12 & grade<.
local origCount = r(N)
quietly count if `j'_gr12==1
if `origCount' ~= `r(N)' {
display "Counts do not match for `j'_gr12!"
}
else
display "Counts match for `j'_gr12."
}
South_age21
South_gr12
1
1
10
5
0
0
2
1
35
16
1
1
3
0
14
9
0
0
4
0
39
20
0
0
5
1
56
n/a
1
0
6
1
20
13
0
1
7
0
38
11
0
0
total
4
2
2
Stata- If qualifier vs If command


ifcmd was designed to be used with a single expression
Example:
 Given variable x with 5 observations: 1, 1, 2, 1, 3
 Compare the following three pieces of Stata code:
if x==2 {
replace x=99
}
if x==1 {
replace x=99
}
replace x=99 if x==2
Stata- If qualifier vs If command
list x
+---+
|x|
|---|
1. | 1 |
2. | 1 |
3. | 2 |
4. | 1 |
5. | 3 |
+---+
if x==2 {
replace x=99
}
if x==1 {
replace x=99
(5 real changes made)
}
replace x=99 if x==1
(3 real changes made)
. list x
list x
list x
+---+
|x|
|---|
1. | 1 |
2. | 1 |
3. | 2 |
4. | 1 |
5. | 3 |
+---+
+----+
| x |
|---- |
1. | 99 |
2. | 99 |
3. | 99 |
4. | 99 |
5. | 99 |
+----+
+----+
| x |
|---- |
1. | 99 |
2. | 99 |
3. | 2 |
4. | 99 |
5. | 3 |
+----+
Constucts -- Continue
Example from Stata online help


3 R 1/3
3 10
- 9
mod(10,3)=1
1
Continue is used to exit current iteration of loop and
continue with next iteration
The following two loops produce the same result:
forvalues x = 1/10 {
if mod(`x',2)==1 {
display "`x' is odd"
}
else {
display "`x' is even"
}
}
forvalues x = 1/10 {
if mod(`x',2)==1 {
display "`x' is odd"
continue
}
display "`x' is even"
}
Constructs – Exit


display “hello”
exit
display “goodbye”
Stop execution of program (only “hello” displayed)
Examples:




Do-file contains a number of data checks followed by
analysis commands. If data checks reveal something
unacceptable, you can exit out of do-file before running
analysis.
Program requires user input. If user enters “bad”
information, need to quit program.
Debugging. If particular error occurs then break.
Check denominator prior to dividing. If equals zero, exit.
Raw data to finished product
Raw data
Analysis data
Runs/results
Finished product
Raw Data -> Analysis Data



Always have two distinct data files- the raw
data and analysis data
A program should completely re-create
analysis data from raw data
NO interactive changes!! Final changes must
go in a program!!
Raw Data -> Analysis Data

Document all of the following:





Outliers?
Errors?
Missing data?
Changes to the data?
Remember to check


Consistency across variables
Duplicates
Individual records, not just summary stats
Analysis Data -> Results



All results should be produced by a program
Program should use analysis data (not raw)
Have a “translation” of raw variable names ->
analysis variable names -> publication
variable names
Analysis Data -> Results

Document



How were variances estimated? Why?
What algorithms were used and why? Were
results robust?
What starting values were used? Was
convergence sensitive?
Did you perform diagnostics? Include in
programs/documentation.
Log files




Your log file should tell a story to the reader.
As you print results to the log file, include
words explaining the results
Include not only what your code is doing, but
your reasoning and thought process
Don’t output everything to the log-file- use
quietly and noisily in a meaningful way.
Project Clean-up





Create a zip file that contains everything
necessary for complete replication
Use a readme.txt file to describe zip contents
Delete/archive unused or old files
Include any referenced files in zip
When you have a final zip archive containing
everything

Open it in it’s own directory and run the script
Check that all the results match
CCPR’s Cluster and helping your research

Software and Data



Efficiency





STATA, SAS, R, Compilers, text editors, etc
HRS, CPS (Unicon version), AddHealth, IFLS, etc
Your PC is available for other work when you submit a job
to the cluster
Faster processors
More RAM
Easy to share data, programs, etc. with colleagues via the
cluster
Obtain access by requesting an account

http://lexis.ccpr.ucla.edu/account/request/
Questions/Feedback

Please email me if you need help in the future

[email protected]