Notes for SAS programming - University of Maryland

Download Report

Transcript Notes for SAS programming - University of Maryland

Notes for SAS programming
Econ424
Fall 2007
Why SAS?
•
Able to process large data set(s)
•
Easy to cope with multiple variables
•
Able to track all the operations on the data set(s)
•
Generate systematic output
• Summary statistics
• Graphs
• Regression results
•
Most government agencies and private sectors use
SAS
Roadmap
•
•
•
•
•
•
•
Basic rules
Read in data
Data cleaning commands
Summary statistics
Combine two or more datasets
Hypothesis testing
Regression
Basic rules (1) – organize files
•
•
•
•
•
.sas – program file
.log – notes, errors, warnings
.lst – output
.sas7bdat – data file
library – a cabinet to put data in
– Default: Work library
• temporary, erased after you close the session
– Permanent library
• libname mylib “m:\”;
• mylib.mydata
= a sas data file named “mydata” in library “mylib”
• run and recall .sas
Basic rules (2) -- program
• every command ends with ;
• format does not matter
if x=1 then y=1; else y=2; is the same as
if x=1 then y=1;
else y=2;
• case insensitive
• comment
* this is comment;
/* this is comment */;
Basic rule (3) – variable
• Type
– numeric (default, 8 digit, . stands for missing value)
– character ($, default 8 digit, blank stands for missing)
• Variable names
– <=32 characters if SAS 9.0 or above
– <=8 characters if SAS 8 or below
– case insensitive
• Must start with letter or “_”
_name, my_name, zip5, u_and_me
-name, my-name, 5zip, per%, u&me, my@w, my$sign
Basic rules (4) – data step
• Data step
DATA distancenew;
set distance;
miles=26.22;
kilometers=1.61*miles;
input data
“distance”
obs1
obs2
define “miles”
output data
“distancenew”
define “kilometers”
obs1
…
…
obs n
obs n
Basic rules (5) – proc step
• PROC step
PROC PRINT data=distancenew;
var miles kilometers;
title “print out distancenew”;
run;
Read in data (1) – table editor
• Tools – table editor
• choose an existing library or define a new
library
• rename variable
• save as
Read in data (2) -- datalines
data testdata1;
infile datalines;
input id height weight gender $ age;
datalines;
1 68 144 M 23
2 78 . M 34
No ; until you finish all
3 62 99 F 37
the data lines
;
/* you only need one semicolon at the end of all data lines, but the
semicolon must stand alone in one line */
proc contents data=testdata1; run;
proc print data=testdata1; run;
Read in data (3) – more datalines
/* read in data in fixed columns */
data testdata1;
infile datalines;
input id 1 height 2-3 weight 4-6 gender $7 age 8-9;
datalines;
168144M23
278 M34
36299 F37
;
Read in data (4) – more datalines
data testdata1;
infile datalines;
input id : 1. height : 2. weight : 3. gender : $1. age : 2.;
datalines;
1 68 144 M 23
2 78 . M 34
3 62 99 F 37
;
Read in data (4) – more datalines
*alternatively;
data testdata1;
infile datalines;
informat id 1. height 2. weight 3. gender $1. age 2.;
input id height weight gender age;
datalines;
1 68 144 M 23
2 78 . M 34
3 62 99 F 37
;
Read in data (5) – .csv
data testdata1;
infile datalines dlm=‘,’ dsd missover;
/* what if you do not have dsd and/or missover */
input id height weight gender $ age ;
datalines;
1, 68, 144, M, 23
2, 78, , M, 34
3, 62, 99, F, 37
; /* what if you forget to type 23 */
run;
Read in data (6) – .csv file
/* save the midterm data in a comma delimited file (.csv)
before running the following codes */
libname mylib "M:\";
filename midterm "M:\midterm-fall2005-1997rank.csv";
data mylib.midterm;
infile midterm firstobs=3 dlm=',' dsd missover lrecl=900;
input year: 4. name: $50. score reputation accept_rate graduate_rate;
run;
proc contents data=mylib.midterm;
run;
proc print data=mylib.midterm;
run;
Note the options in the infile command!
Read in data (7) – from excel
filename myexcel “M:\midterm-fall2005-1997rank.xls”;
proc import datafile=myexcel out=readin1
DBMS=excel replace;
No ; until
run;
the end of
data readin1;
the whole
set readin1;
sentence
acceptper=accept_rate*100;
gradper=graduate_rate*100;
run;
proc contents data=readin1; run;
proc print data=readin1; run;
16
Read in data (7) – from excel
Be careful !
SAS will read the first line as variable names, our original
data has two non-data lines – the first as labels, the
second as variable names. You need to delete the first line
to make sure correct data input into SAS.
What happens if you don’t delete the first line?
What happens if you delete the second line?
SAS assigns numeric and character type automatically.
Sometime it does make mistake.
17
Data cleaning (1) – if then
Format:
IF condition THEN action;
ELSE IF condition THEN action;
ELSE action;
Note:
(1) the if-then-else can be nested as many as you
want
(2) if you need multiple actions instead of one
action, use “DO; action1; action2; END; ”
Data cleaning (1) – if then
•
•
•
•
•
•
•
=
~=
>
<
>=
<=
in
or
or
or
or
or
or
EQ
NE
GT
LT
GE
LE
means equals
means not equal
means greater than
means less than
means greater than or equal
means less than or equal
means subset
– if gender in (‘M’, ‘F’) then ..;
• Multiple conditions: AND (&), OR(|)
Data cleaning (1) – if then
*reading in program of testdata1 is on page 14;
data testdata1;
set mylib.testdata1;
IF gender='M' AND age<=25 THEN testgroup=0;
ELSE testgroup=1;
run;
proc contents data=testdata1; run;
proc print data=testdata1; run;
Note:
(1) the code is less efficient if you replace ELSE ..; with
IF gender~= 'M' OR age>25 THEN ..;
(2) missing value is always counted as the smallest negative, so
age=. will satisfy the condition age<=25. If you want to ignore
the obs with age=. set the condition as 0<age<=25.
Data cleaning (1) – if then
* Multiple actions in each branch;
data testdata1;
set testdata1;
IF gender='M' AND age<=25 THEN DO;
group=0; activity='ballgame';
END;
ELSE DO;
group=1; activity='movie';
END;
run;
proc print data=testdata1; run;
Data cleaning (1) – if then
*Use if commands to choose a subsample;
data testdata2; /* note here we generate a new data set */
set testdata1;
IF age=. Then delete;
If age>18 & weight~=.;
run;
proc print data=testdata2; run;
Data cleaning (1) – exercise
still use testdata1.
define
weight in pounds
bmi=--------------------------------- x 703
(height in inches)^2
define group = 1 if bmi<18.5 (underweight)
2 if 18.5 <= bmi < 25 (normal)
3 if 25<= bmi <30 (overweight)
4 if bmi>=30 (obese).
Data cleaning (1) – exercise answer
data testdata1;
set testdata1;
bmi=(weight/(height*height))*703;
if bmi<18.5 then group=1;
else if bmi<25 then group=2;
else if bmi<30 then group=3;
else group=4;
run;
proc contents data=testdata1; run;
proc print data=testdata1; run;
Question: What if one observation has weight=.?
Pages 24-27 are optional material for
data cleaning.
They are not required, but you may
find them useful in the future.
We skip them in the regular class.
Data cleaning (2)
– convert variable type
Numeric to character:
age1=put(age, $2.);
age and age1 have the same contents but different formats
Character to numeric:
age2=input(age1, 1.);
now age and age2 are both numeric, but age2 is chopped at the first
digit
Take a sub string of a character
age3=substr(age1,2,1);
now age3 is a sub string of age1, starting from the second digit of
age1 (the meaning of “2”) and having one digit in total (the meaning
of “1”).
Data cleaning (2) - example
* we want to convert studid 012345678 to 012-345-678;
data testdata2;
infile datalines;
input studid : 9. studname : $1.;
datalines;
012345678 A
135792468 B
009876543 C
;
proc print; run;
data testdata2;
set testdata2;
if studid LT 1E+7 then studid1= '00’||compress(put(studid, $9.));
else if 1E+7 LE studid LT 1E+8 then studid1='0'||compress(put(studid,
$9.));
else studid1= put(studid, $9.);
studid2=substr(studid1,1,3)||'-'||substr(studid1,4,3)||''||substr(studid1,7,3);
proc print; run;
Data cleaning (2) - exercise
You have the following data, variables in sequence are
SSN, score1, score2, score3, score4, score5:
123-45-6789 100 98 96 95 92
344-56-7234 69 79 82 65 88
898-23-1234 80 80 82 86 92
Calculate the average and standard deviation of the five
scores for each individual. Use if-then command to find
out who has the highest average score, and report his SSN
without dashes.
Data summary roadmap
•
•
•
•
•
•
•
•
•
Proc contents – variable definitions
Proc print – raw data
Proc format – make your print look nicer
Proc sort – sort the data
Proc means – basic summary statistics
Proc univariate – detailed summary stat
Proc freq – frequency count
Proc chart – histogram
proc plot – scatter plot
proc format (1)
*Continue the data cleaning exercise on page 23;
data testdata1;
set testdata1;
bmi=(weight/(height*height))*703;
if bmi<18.5 then group=1;
else if bmi<25 then group=2;
Defining “group” as a
else if bmi<30 then group=3;
numeric variable will
save space
else group=4;
run;
proc format (2)
proc format;
value bmigroup 1=‘under weight’
2=‘normal weight’
3=‘overweight’
4=‘obese’;
run;
proc print data=testdata1;
format group bmigroup. bmi 2.1;
title ‘report bmi group in words’;
label bmi=‘body mass index’
group=‘bmi group 1-4’;
var bmi group;
run;
no ; here until
the end of the
value command
no ; here until
the end of the
label command
proc sort
proc sort data=testdata1;
by gender age;
run;
proc sort data=testdata1 out=testdata2;
by gender descending age;
run;
* note that missing value is always counted as the
smallest;
proc means and proc univariate
proc means data=testdata1;
class gender;
var height weight bmi;
run;
By default, proc means report
mean, stdev, min, max
Could choose what to report:
proc means data=testdata1
n mean median;
proc sort data=testdata1; by gender; run;
proc univariate data=testdata1;
by gender;
var bmi;
run;
By default, proc univariate
report median, and many
other statistics
Notes on proc means and proc univariate
*if you do not use class or by command, the
statistics are based on the full sample. If you use
class or by var x, the statistics are based on the
subsample defined by each value of var x.
*You can use class or by in proc means, but only by
in proc univariate;
*whenever you use “by var x”, the data set should
be sorted by var x beforehand;
More example on
proc means and proc univariate
data readin1;
set readin1;
if accept_rate<0.20 then acceptgrp=“low”;
By default, proc means report
else acceptgrp=“high”;
mean, stdev, min, max
run;
Could choose what to report:
proc means data=readin1;
proc means data=readin1
n mean median;
class acceptgrp;
var reputation accept_rate graduate_rate;
run;
proc sort data=readin1; by acceptgrp; run;
By default, proc univariate
proc univariate data=readin1;
report median, and many
by acceptgrp;
other statistics
var reputation;
run;
proc freq
data readin1;
set readin1;
if graduate_rate<0.5 then gradgrp=“low”;
else if 0.5<=graduate_rate<0.8 then gradgrp=‘middle’;
else gradgrp=‘high’;
run;
proc freq data=readin1;
tables acceptgrp
gradgrp
acceptgrp*gradgrp;
run;
proc chart -- histogram
proc chart data=readin1;
title ‘histogram for acceptgrp’;
vbar acceptgrp;
run;
proc chart data=readin1;
title ‘frequency by two variables’;
vbar acceptgrp / group=gradgrp;
run;
proc chart – continuous variable
proc chart data=readin1;
title “histogram for continuous variable’;
vbar accept_rate;
run;
proc chart data=readin1;
title ‘histogram with specific midpoints’;
vbar accept_rate / midpoints=0 to 1 by 0.1;
run;
proc plot – scatter plot
proc plot data=readin1;
title ‘scatter plot of accept rate and grad
rate’;
plot accept_rate*graduate_rate;
run;
fancy proc means
proc means data=readin1;
class acceptgrp gradgrp;
var score reputation;
output
out = summary1
mean = avgscore avgreputn;
run;
proc print data=summary1;
run;
The following page may be
useful in practice, but I am not
going to cover it in class.
some summary stat. in proc print
filename myexcel “M:\midterm-fall2005-1997rank.xls”;;
proc import datafile=myexcel out=readin1 DBMS=excel replace; run;
data readin1;
set readin1;
if accept_rate<0.2 then acceptgrp=‘low’;
else if 0.2<=accept_rate<0.5 then acceptgrp=‘middle’;
else acceptgrp=‘high’;
run;
proc sort data=readin1; by acceptgrp; run;
proc print data=readin1 n;
where graduate_rate>=0.8;
by acceptgrp;
sum score;
var name score accept_rate graduate_rate;
run;
merge and append
readin1: name
Harvard
score
reputation acceptgrp gradgrp
100
3.9
summary1:
low
high
acceptgrp gradgrp avgscore avgreputn
low
merged: name
score
Harvard 100
high
80
3.3
reputation acceptgrp gradgrp avgscore avgreputn
3.9
low
high
80
3.3
appended:
name
score
Harvard 100
.
reputation acceptgrp gradgrp avgscore avgreputn
3.9
low
.
low
high
high
.
80
.
3.3
merge two datasets
proc sort data=readin1;
by acceptgrp gradgrp;
run;
proc sort data=summary1;
by acceptgrp gradgrp;
run;
data merged;
merge readin1 (in=one) summary1 (in=two);
by acceptgrp gradgrp;
if one=1 & two=1;
What if this line is
run;
“if one=1 OR two=1;”?
Keep track of matched and
unmatched records
data allrecords;
merge readin1 (in=one) summary1 (in=two);
by acceptgrp gradgrp;
myone=one;
SAS will drop variables
mytwo=two;
“one” and “two”
if one=1 or two=1;
automatically at the end of
run;
the DATA step. If you want
proc freq data=allrecords;
to keep them, you can copy
tables myone*mytwo;
them into new variables
“myone” and “mytwo”
run;
be careful about merge!
• always put the merged data into a new data set
• must sort by the key variables before merge
• ok for one-to-one, multi-to-one, one-to-multi, but
no good for multi-to-multi
• be careful of what records you want to keep, and
what records you want to delete
• what if variable x appears in both datasets, but x
is not in the “by” statement?
– after the merge x takes the value defined in the last
dataset of the “merge” statement
append
data appended;
set readin1 summary1;
run;
proc print data=appended;
run;
proc print data=merged;
run;
Application : Project 5
rest_rating:
Restname Ratings Price
Loc1
Phone1
Addie’s
1
30
Rockville
301-881-0081
Andalucia
2
30
Rockville
301-770-1880
Haandi
2
10
Bethesda
301-718-0121 Falls Church
Loc
State
Pop
Income
Bethesda
MD
62936
38976
Falls Church
VA
9578
26709
Rockville
MD
44835
21484
city_demo:
How to merge the two data sets?
Loc2
Phone2
703-533-3501
Steps to merge the datasets
1.
reshape data set 1 – focus on loc1
rest_rating:
Restname Ratings Price
Loc1
Addie’s
1
30
Rockville
301-881-0081
Andalucia
2
30
Rockville
301-770-1880
Haandi
2
10
Bethesda
301-718-0121 Falls Church
restloc1
restname ..
loc
phone
Addie’s
Rockville
301-881-0081
Andalucia Rockville
301-770-1880
Haandi
301-718-0121
Bethesda
Phone1
Loc2
Phone2
703-533-3501
data restloc1;
set rest_rating;
loc=loc1;
phone=phone1;
drop loc1 loc2 loc3 loc4 phone1
phone2 phone3 phone4;
run;
Steps to merge the datasets
1.
reshape data set 1 – focus on loc2
rest_rating:
Restname Ratings Price
Loc1
Addie’s
1
30
Rockville
301-881-0081
Andalucia
2
30
Rockville
301-770-1880
Haandi
2
10
Bethesda
301-718-0121 Falls Church
restloc2
restname ..
loc
phone
Haandi
Falls Church 703-533-3501
Phone1
Loc2
Phone2
703-533-3501
data restloc2;
set rest_rating;
loc=loc2;
phone=phone2;
drop loc1 loc2 loc3 loc4 phone1
phone2 phone3 phone4;
If loc=‘’ then delete;
run;
Steps to merge the datasets
Similarly create restloc3 and restloc4
Steps to merge the datasets
2. append restloc1, restloc2, restloc3, restloc4 into restloc
restloc1
restloc
restname ..
loc
phone
Addie’s
Rockville
301-881-0081
Andalucia
Rockville
301-770-1880
Haandi
Bethesda
301-718-0121
loc
phone
Addie’s
Rockville
301-881-0081
Andalucia
Rockville
301-770-1880
Haandi
Bethesda
301-718-0121
Haandi
restloc2
restname .. loc
Haandi
restname ..
phone
Falls Church 703-533-3501
Falls Church 703-533-3501
data restloc;
set restloc1 restloc2 restloc3 restloc4;
run;
Steps to merge the datasets
3.
merge restloc with citydemo by loc
restloc
proc sort data=restloc;
restname ..
loc
phone
by loc; run;
Addie’s
Rockville
301-881-0081
proc sort data=city_demo;
Andalucia
Rockville
301-770-1880
by loc; run;
Haandi
Bethesda
301-718-0121
data merged;
Haandi
Falls Church 703-533-3501
merge restloc (in=one)
city_demo (in=two);
city_demo:
Loc
State Pop
Income
by loc;
Bethesda
MD
62936
38976
if one=1 | two=1;
Falls Church VA
9578
26709
inone=one; intwo=two;
Rockville
44835
21484
run;
MD
Questions
• How many restaurants have two locations? how
many have three?
• How many restaurants locate in Rockville?
• How many restaurants do not have any match
with city demographics?
• How many restaurants in cities where per capita
income is at least 30,000?
• How many are asian and locate in areas with per
capita income at least 30,000?
mean comparison: two groups
* compare the mean of two groups;
* specific question: do Asian and non-Asian restaurants charge systematically different prices?;
* define a dummy variable asianstyle=1 if the food style is Asian, Chinese, Japanese, Thai, Vietnamese,
Lebanese, or Korean, 0 otherwise;
proc glm data=merged;
class asianstyle;
model price=asianstyle; /* alternatively, model price=asianstyle/solution; */
means asianstyle;
run;
The F-value tests
H0: regression has no power explaining variation in income
H1: regression has some power.
In our context, this is equivalent to
H0: income is no different across asianstyle or non-asianstyle
H1: income is different.
This amounts to a two-tail test of equal mean.
If you use “/solution” at the end of the model command, the output would report regression results. The
t-stat of the coefficient for asianstyle or the p-value next to the t-stat could apply to a two-tail test
or a one-tail test of equal mean.
mean comparison: two groups
Alternatively, you can use
proc reg data=merged;
model price=asianstyle;
Run;
The t-stat of the coefficient for asianstyle or the p-value next to the t-stat could apply to a two-tail test
or a one-tail test of equal mean.
mean comparison: three or more groups
* compare the mean of four groups – ratings =1,2,3,4;
* specific question: do restaurants of different ratings charge
systematically different prices?;
proc glm data=merged;
class ratings;
model price=ratings;
means ratings/waller;
means ratings/lsd cldiff;
run;
notes on mean comparison
1.
The logic of mean comparison is the same as in Excel
2.
Be careful about one-tail and two-tail tests.
The standard SAS output of coefficient t-stat and p-value are based on a
two-tail test of H0: coeff=0, but could be used for a one-tail test with
appropriate adjustment on p-value. The F-value reported as the overall
regression statistics only applies to the right tail of F distribution, because
F is bounded to be positive. The F-value can test H0: regression explains
no variation in your dependent variable, against H1: regression explains
some variation. This is equivalent to a two-tail test of equal mean.
3.
Comparison across more than two groups
H0: all groups have the same mean  F-test of whole regression
OR
H0: group x and group y has the same mean  waller or lsd statistics
regression in SAS
• specific question: how do restaurant price relate
to ratings, city population and per capita income?
libname n “N:\share\notes”;
proc reg data=n.merged;
model price=ratings population
per_capita_income;
run;
Equivalent to:
Price=alpha+beta1*ratings+beta2*pop+beta3*per_cap_inc
regression in SAS
* specific question: how do restaurant prices relate to
ratings, city population, per capita income, and state
specifics?
* note that state itself is a character variable, and we would
like to separate MD, VA and DC. This amounts to define
three dummy variables denoting each state. This is easy
to achieve in SAS;
proc glm data=merged;
class state;
model price=ratings population per_capita_income
state / solution;
run;
Regression: exercise 1
• The coefficient of ratings in the first regression tells us
how much price will change by a unit change in ratings.
• However, one may think the change from 1 star to 2 star
is different from the change from 2 star to 3 star, or from
3 star to 4 star.
• To capture these differences, we need more than one
coefficents for ratings.
proc glm data=merged;
class ratings;
model price=ratings population per_capita_income
/ solution;
run;
Regression: exercise 2
•
•
•
What if we also want to control for states?
SAS does not allow more than one variables in the “class ” command, so we need to define one set
of dummy variables by ourselves;
Below we define 4 dummy variables for each value of ratings. Since the sum of the four variables
is always 1, we need to drop one of them in the regression to avoid their colinearity with the
intercept.
data merged;
set merged;
if ratings=1 then rating_1star=1; else rating_1star=0;
if ratings=2 then rating_2star=1; else rating_2star=0;
if ratings=3 then rating_3star=1; else rating_3star=0;
if ratings=4 then rating_4star=1; else rating_4star=0;
proc glm data=merged;
class state;
model price=rating_2star rating_3star rating_4star population per_capita_income state / solution;
run;
Regression: exercise 3
•
•
What if we focus on the average price change by one unit increase of ratings (like in regression 1
on page 59), but we would like to see whether this price change vary by states?
We need to define ratings*(dummy for MD), ratings*(dummy for VA),ratings*(dummy for DC)
data merged;
set merged;
if state=“MD”then state_MD=1; else state_MD=0;
if state=“VA” then state_VA=1; else state_VA=0;
if state=“DC” then state_DC=1; else state_DC=0;
ratings_MD=ratings*state_MD;
ratings_VA=ratings*state_VA;
ratings_DC=ratings*state_DC;
run;
proc reg data=merged;
model price=ratings_MD ratings_VA ratings_DC population per_capita_income;
run;
* Control for states;
proc glm data=merged;
class state;
model price=ratings_MD ratings_VA ratings_DC population per_capita_income state/solution;
run;
A comprehensive example
A review of
1. readin data
2. summary statistics
3. mean comparison
4. regression
reg-cityreg.sas in N:\share\notes\
A Case Study of Los Angeles Restaurants

Nov. 16-18, 1997 CBS 2 News “Behind the Kitchen Door”

January 16, 1998, LA county inspectors start issuing hygiene
grade cards


A grade if score of 90 to 100

B grade if score of 80 to 89

C grade if score of 70 to 79

score below 70 actual score shown
Grade cards are prominently displayed
in restaurant windows

Score not shown on grade cards
Take the ideas to data
better
Information
better
quality
better
revenue
regulation
hygiene
scores
restaurant
revenue
calculated by
sales tax
by county
by city
Data complications
Unit of analysis:
individual restaurant? city? zipcode? census tract?
Unit of time:
each inspection? per month? per quarter? per year?
Define information:
county regulation? city regulation? the date of passing the regulation?
days since passing the regulation? % of days under regulation?
Define quality:
average hygiene score? the number of A restaurants? % of A
restaurants?
Define revenue: (only collected by quarter!)
total revenue? average revenue per restaurant?
How to test the idea?
• Regression 1:
quality = α+β*information+error
+something else?
• Regression 2:
revenue=α+β*quality+γ*information+error
+something else?
What is the something else?
year trend, seasonality, quality*information, ....
real test
reg-cityreg.sas in N:\share\notes\
Questions
• How many observations in the sample?
– log of the first data step, or output from proc contents
• How many variables in the sample? How many are numerical, how
many are characters?
– Output from proc contents
• On average, how much tax revenue can we collect from one city in
one quarter?
– Output from proc means, on sumtax
• How many percentage of restaurants have A quality in a typical citymonth?
– Output from proc means, on per_A
• What is the maximum tax revenue in a city-quarter? What is the
minimum?
– Output from proc means, on sumtax
Questions
• What is the difference between cityreg and ctyreg? We know county
regulation came earlier than city regulation, is that reflected in our
data?
– Yes, cityreg<=ctyreg in every observation
– We can check this in proc means for cityreg and ctyreg, or add a proc
print to eyeball each obs
• What is the difference between cityreg and citymper? What is the
mean of cityreg? What is the mean of citymper? Are they consistent
with their definitions?
– The unit of cityreg is # of days, so it should be a non-negative integer
– The unit of citymper is % of days, so it should be a real number between
0 and 1
– To check this, we can add a proc means for cityreg and citymper
Questions
• Economic theories suggest quality be higher after the
regulation if regulation gives consumers better
information. Is that true in summary statistics?
– by summary statistics, I mean a simple mean comparison of
quality by different regulation environments. Which procedure
tells us that?
– Yes, the two proc means, class citym_g or ctym_g
• How about revenue? Do summary statistics suggest that
better information better revenue?
– Yes, the two proc means, class citym_g or ctym_g
– The two proc means only show us group-average. For formal
tests, we can use proc glm or proc reg as in our lecture notes on
mean comparison (do you know how to do it?)
Questions
Facts reported in summary statistics often reflect many
economic factors, not only the one in our mind. Summary
statistics are also crude for easy presentation. That is why
we explore regressions.
In the quality regression, does more city regulation associate
with higher quality?
– is the coefficient of city regulation positive and
significantly different from zero? (proc reg)
– is the coefficient of county regulation positive and
significantly different from zero? (proc reg)
– Do we omit other sensible explanations for quality
changes? What are they? (yes, they are ….)
Questions
• What does “proc means” do on page 3? Why does it
allow us to collapse the data of city-months into cityquarters?
• Why do we use lnsumtax as the dependent variable
instead of sumtax?
– Reg of lnsumtax tells us the effect of regulation on the percentage
change in sumtax
• In the first regression, what does the “cityqper”
coefficient mean? what does the “per_A” coefficient
mean? What does the “cityq_A” (which equals to
cityqper*per_A) coefficient mean?
Questions
• We report four sets of regression results for
the revenue regression. How do they differ
from each other?
• How does the coefficient of “cityqper”
change across these regressions? What
does the change suggest?
• Which set of results makes more sense?
Count duplicates (not required)
http://support.sas.com/ctx/samples/index.jsp?sid=67&tab=co
de
data dups nodups ;
set clasdata ;
by name class ;
/* If the combination of NAME and CLASS is in the data
set once, output NODUPS, else output DUPS. */
if first.class and last.class then output nodups ;
else output dups ;
run;
Some project 6 ideas
from previous students
• Does super-bowl advertising boost stock price?
• Does financial aid promote GDP growth?
• What are the most important determinants of
house price?
• Advice on smart investment: which stock(s)
should John Doe invest in given his preference
for return and risk?
• Does a chain store charge different prices in
different locations?
Idea I: Does super-bowl advertising
boost stock price?
• Theory
– advertising in super-bowl is extremely expensive and therefore
increase the cost of the advertiser
 negative effect on stock price
– advertising at the superbowl signals good finanical status of the
advertiser
 positive effect on stock price
• Data:
– list of superbowl advertisers from major newspapers
– list of non-advertisers who are major competitors with the
advertisers in the same industry
– New York Stock Exchange
• Timing: right before and after super-bowl (event study)
Idea II: Does financial aid promote
GDP growth?
• Theory:
– positive effect because aid means help
– negative effect because aids tend to go to the most
under-developed countries
• Data
– focus on Sub-Sahara Africa
– World Development Indicators (WDI 2002) database
from the World Bank
– 1990-2000: GDP per capita, aid, FDI, adjusted net
savings, health expenditures, adjult illiteracy rate
Idea III: determinants of house
price?
• Theory:
– house attributes
– location
– time of sale
• Data
– zillows.org
Idea IV: Smart Investment
• Theory: CAPM
• John Doe’s preference: risk and return
• Sample:
–
–
–
–
treasury bills
S&P 500 index
six individual stocks
monthly data in the past 60 months
Idea V: Does a chain store charge different
prices in different locations?
• Theory:
– Yes: A chain store will charge different prices because
different locations have different rent costs and
different consumer willingness to pay
– No: same pricing facilitates advertising from the chain
headquarter
• Data:
– Hand collected price for 10 CVS stores
– Choose a basket of goods that represent CVS sales
– Land price and demographic data from Census