Transcript Slide 1

Key Data Management Tasks in
Stata
FHSS Research Support Center
fhssrsc.byu.edu
115 and 116 SWKT
Investigate Duplicates in the Data
If you suspect that duplicates exist
in your data, as in this example…
. use http://fhssweb4:1019/duplicates.dta, clear
You can use duplicates report to
investigate…
Most
observations
are unique
.
. list in 1/15, noobs compress separator(15)
duplicates
Duplicates
id
female
ses
read
write
math
1
1
2
2
2
3
3
4
4
5
5
6
7
8
9
female
female
female
female
female
male
male
female
female
male
male
female
male
female
male
low
low
middle
middle
middle
low
low
low
low
low
low
low
middle
low
middle
34
34
39
39
39
63
63
44
44
47
47
47
57
39
48
44
44
41
41
41
65
65
50
50
40
40
41
54
44
49
84
40
33
33
33
48
48
41
41
43
43
46
59
52
52
. //Note different math score for 1st & 2nd obs
(1a.)
Observations
with 1, 2, or 3
copies
report
in
terms
of
all
variables
copies
observations
surplus
1
197
0
2
6
3
3
3
2
3 observations have 2 copies
1 observation has 3 copies
.
duplicates
Duplicates
When the report
is given in terms
of only some of
the variables,
there are more
duplicated obs.
in
report
id
female
terms
of
id
ses
female
ses
copies
observations
surplus
1
195
0
2
8
4
3
3
2
View the Duplicates in the Data
.
duplicates
Duplicates
in
list,
terms
sepby(id)
of
all
//new
line
when
id
changes
(1b.)
variables
group:
obs:
id
female
ses
read
write
math
1
3
2
female
middle
39
41
33
1
4
2
female
middle
39
41
33
1
5
2
female
middle
39
41
33
2
6
3
male
low
63
65
48
2
7
3
male
low
63
65
48
3
8
4
female
low
44
50
41
3
9
4
female
low
44
50
41
4
10
5
male
low
47
40
43
4
11
5
male
low
47
40
43
4 observations are completely duplicated in all
variables: the first one 3 times and the others
twice; Stata creates a different “group:” for
each observation that appears duplicated
.
duplicates
Duplicates
in
list
id
terms
female
of
id
ses,
female
sepby(id)
ses
group:
obs:
id
female
ses
1
1
1
female
low
1
2
1
female
low
2
3
2
female
middle
2
4
2
female
middle
2
5
2
female
middle
3
6
3
male
low
3
7
3
male
low
4
8
4
female
low
4
9
4
female
low
5
10
5
male
low
5
11
5
male
low
5 observations are duplicated in id,
female, and ses, because observations
1 and 2 only differ in math
Create a Variable to Tag Duplicates (1c.)
. duplicates tag id female ses, gen(dup_id)
Duplicates in terms of id female ses
. list if dup_id >=1, sepby(id)
id
female
ses
read
write
math
dup_id
1.
2.
1
1
female
female
low
low
34
34
44
44
84
40
1
1
3.
4.
5.
2
2
2
female
female
female
middle
middle
middle
39
39
39
41
41
41
33
33
33
2
2
2
6.
7.
3
3
male
male
low
low
63
63
65
65
48
48
1
1
8.
9.
4
4
female
female
low
low
44
44
50
50
41
41
1
1
10.
11.
5
5
male
male
low
low
47
47
40
40
43
43
1
1
New variable is 0 if the observation is
unique, 1 if there is one duplicate of
it, 2 if there are two duplicates of it,
etc.
We can see the difference in math
scores for observation 1 and 2, which
is why duplicates report and
duplicates report id female ses gave us
different outputs. Let’s set them both
equal to 84.
. replace math = 84 if id ==1
(1 real change made)
Drop the Duplicate Observations
(1d.)
The command duplicates drop drops
all observations that are duplicated,
leaving just the first observation in
each group.
. duplicates drop
Duplicates in terms of all variables
(6 observations deleted)
Now we run duplicates report to
check that all of the duplicate
observations have been deleted.
. duplicates report
Duplicates in terms of all variables
copies
observations
surplus
1
200
0
Label the Values of a Numeric Variable
. use val_labels.dta, clear
. use http://fhssweb4:1019/valuelabels.dta, clear
. list in 1/10, noobs
make
price
mpg
foreign
Merc. Zephyr
Chev. Chevette
Chev. Monza
Toyota Corolla
Subaru
3,291
3,299
3,667
3,748
3,798
20
29
24
31
35
0
0
0
1
1
AMC Spirit
Merc. Bobcat
Renault Le Car
Chev. Nova
Dodge Colt
3,799
3,829
3,895
3,955
3,984
22
22
26
19
30
0
0
1
0
0
Creates labeling scheme called
“foreign_lbl”, but nothing happens to
data yet
Applies labeling scheme
“foreign_lbl” to the variable foreign
(2a.)
Variable foreign currently displayed as
binary numeric variable.
. label define foreign_lbl 0 "domestic car" 1 "foreign car"
. label values foreign foreign_lbl
. list in 1/10, noobs
make
price
mpg
foreign
Merc. Zephyr
Chev. Chevette
Chev. Monza
Toyota Corolla
Subaru
3,291
3,299
3,667
3,748
3,798
20
29
24
31
35
domestic
domestic
domestic
foreign
foreign
car
car
car
car
car
AMC Spirit
Merc. Bobcat
Renault Le Car
Chev. Nova
Dodge Colt
3,799
3,829
3,895
3,955
3,984
22
22
26
19
30
domestic
domestic
foreign
domestic
domestic
car
car
car
car
car
The labels are now displayed for the Variable
foreign, which is more helpful, but the actual
values in the data are still 0 and 1.
Now Let’s Look at the Code In-Depth
Says we want to define
a labeling scheme that
will be stored in Stata’s
memory, and later
applied to variables
Name of the labeling
scheme that we want to
create
(2a.)
The actual labeling
scheme: which labels
go with which numbers
. label define foreign_lbl 0 "domestic car" 1 "foreign car"
. label values foreign foreign_lbl
. list in 1/10, noobs
Says we want to apply
make scheme to a
a labeling
specific variable
Merc. Zephyr
Chev. Chevette
Chev. Monza
Name of the
Name of the
price
foreign
variablempg
to which
labeling scheme we
we want to apply
want to apply
the labeling
3,291
20
domestic car
scheme 29
3,299
domestic car
3,667
24
domestic car
Create Variable Labels
(2b.)
Variable we want to label
. webuse hbp4
. webuse hbp4, clear
Label we want to give it
. label variable hbp "high blood pressure"
. describe
Contains data from http://www.stata-press.com/data/r12/hbp4.dta
obs:
1,130
vars:
7
22 Jan 2011 11:12
size:
19,210
storage
variable name
type
display
format
id
city
year
age_grp
race
hbp
female
%10s
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
str10
byte
int
byte
byte
byte
byte
value
label
variable label
Record identification number
high blood pressure
sexlbl
Note the difference
between variable label
and value label
Create a Labeled Categorical Variable from a Continuous
Numeric Variable (3.)
. use http://fhssweb4:1019/recode.dta, clear
. list in 1/8, noobs
make
mpg
We have a continuous
numeric variable (mpg)…
foreign
Cad. Seville
Cad. Eldorado
Linc. Mark V
Linc. Versailles
Peugeot 604
21
14
12
14
14
domestic
domestic
domestic
domestic
foreign
car
car
car
car
car
Volvo 260
Linc. Continental
Cad. Deville
17
12
14
foreign car
domestic car
domestic car
…but instead we want a
variable which groups
observations into 3
categories, based on mpg …
. recode mpg (min/14=1 "inefficient") (15/30=2 "moderately efficient")
> (30/max=3 "efficient"), gen(efficiency) label(effcny_lbl)
(74 differences between mpg and efficiency)
. label variable efficiency "1-14 mpg=inefficient; 15-30 mpg=efficient;
> greater than 30 mpg=efficient"
…note that the
actual values of
the new variable
are numbers, but
it will display
value labels. This
is what we need
for analysis.
. list in 1/8, noobs
make
mpg
foreign
efficiency
. list in 1/8, noobs nolabel ab(10)
Cad. Seville
Cad. Eldorado
Linc. Mark V
Linc. Versailles
Peugeot 604
21
14
12
14
14
domestic
domestic
domestic
domestic
foreign
car
car
car
car
car
moderately efficient
inefficient
inefficient
inefficient
inefficient
make
mpg
foreign
efficiency
Cad. Seville
Cad. Eldorado
Linc. Mark V
Linc. Versailles
Peugeot 604
21
14
12
14
14
0
0
0
0
1
2
1
1
1
1
Volvo 260
Linc. Continental
Cad. Deville
17
12
14
foreign car
domestic car
domestic car
moderately efficient
inefficient
inefficient
Volvo 260
Linc. Continental
Cad. Deville
17
12
14
1
0
0
2
1
1
Now Let’s Look at the Code In-Depth
Change the
values of a
variable
based on
some
coding rules
1
Variable
who’s
values I
want to
change
4
2
First rule: If the
value is
between the
lowest number
and 14, make it
to a 1…
(3.)
…and give it a
value label of
“inefficient”
5
. recode mpg (min/14=1 "inefficient") (15/30=2 "moderately efficient")
> (30/max=3 "efficient"), gen(efficiency) label(effcny_lbl)
3
6
This just means that
the command took
up more than one
line
Says that rather than alter
the values of mpg, we
want to just create a new
variable called efficiency
8
7
The set of value labels that
we are defining will be
saved as effcny_lbl in
Stata’s memory
. label variable efficiency "1-14 mpg=inefficient; 15-30 mpg=efficient;
> greater than 30 mpg=efficient"
Create a variable label (not to be confused with a value label) describing how the
coding rules work
Covert a String Variable Containing
Digits into a Numeric Variable
(4a.)
. use http://www.stata-press.com/data/r12/hbp2, clear
Use fixed format to display
Create numeric variable
. destring id, generate(numid)
id has all characters numeric; numid generated as double
. list id numid in 1/10
. format numid %10.0f
id
numid
1.
2.
3.
4.
5.
8008238923
8007143470
8000468015
8006167153
8006142590
8.008e+09
8.007e+09
8.000e+09
8.006e+09
8.006e+09
6.
7.
8.
9.
10.
8007340259
8004411604
8006962950
8005012348
8003187296
8.007e+09
8.004e+09
8.007e+09
8.005e+09
8.003e+09
. list id numid in 1/10
Notice the default
exponential format
. describe id numid
variable name
id
numid
storage
type
display
format
str10 %10s
double %10.0g
value
label
variable label
Record identification number
Record identification number
id
numid
1.
2.
3.
4.
5.
8008238923
8007143470
8000468015
8006167153
8006142590
8008238923
8007143470
8000468015
8006167153
8006142590
6.
7.
8.
9.
10.
8007340259
8004411604
8006962950
8005012348
8003187296
8007340259
8004411604
8006962950
8005012348
8003187296
Automatically Create a Labeled Numeric
Variable from a String Variable
(4b.)
. use http://www.stata-press.com/data/r12/hbp2, clear
. encode sex, generate(gender)
. describe sex gender
variable name
storage
type
sex
gender
str6
long
display
format
value
label
%9s
%8.0g
gender
Makes a new numeric variable,
with value labels containing the
text from the original variable
variable label
Original string variable
. tab sex gender, nolabel
Data
values
sex
gender
1
2
Total
female
male
433
0
0
695
433
695
Total
433
695
1,128
male
Total
. tab sex gender
Value
labels
sex
gender
female
female
male
433
0
0
695
433
695
Total
433
695
1,128
New labeled numeric variable
Note: The numeric values assigned
as integers beginning with 1 are
ordered by the alphabetized values
of the original string variable
Reshape Wide to Long
When you have a wide dataset …
but need a long one
wide
1.
(5a.1)
long
. webuse reshape1, clear
. list
1.
2.
3.
id
sex
inc80
inc81
inc82
ue80
ue81
ue82
1
2
3
0
1
0
5000
2000
3000
5500
2200
2000
6000
3300
1000
0
1
0
1
0
0
0
0
1
You can reshape the data from wide to long
. reshape long inc ue, i(id) j(year)
(note: j = 80 81 82)
Data
Number of obs.
Number of variables
j variable (3 values)
xij variables:
wide
->
3
8
->
->
->
year
->
->
inc
ue
inc80 inc81 inc82
ue80 ue81 ue82
. list
long
9
5
id
year
sex
inc
ue
2.
3.
4.
5.
1
1
1
2
2
80
81
82
80
81
0
0
0
1
1
5000
5500
6000
2000
2200
0
1
0
1
0
6.
7.
8.
9.
2
3
3
3
82
80
81
82
1
0
0
0
3300
3000
2000
1000
0
0
0
1
Why would you do this?
Some Stata statistical
procedures (e.g. xtreg for
panel data) require the
data to be in long form
Let’s Look at the Code In-Depth
(5a.1)
. webuse reshape1, clear
id
year
sex
inc
ue
1.
2.
3.
4.
5.
1
1
1
2
2
80
81
82
80
81
0
0
0
1
1
5000
5500
6000
2000
2200
0
1
0
1
0
6.
7.
8.
9.
2
3
3
3
82
80
81
82
1
0
0
0
3300
3000
2000
1000
0
0
0
1
. list
1.
2.
3.
id
sex
inc80
inc81
inc82
ue80
ue81
ue82
1
2
3
0
1
0
5000
2000
3000
5500
2200
2000
6000
3300
1000
0
1
0
1
0
0
0
0
1
We want our
data to end
up in long
form
The two vars that
currently have numbers
tacked on the end of their
names; the ones we want
to reshape. In Stata these
are called “stubs”.
This specifies a
unique
individual
. reshape long inc ue, i(id) j(year)
(note: j = 80 81 82)
Data
wide
Take the
numbers off the
end of the
reshape vars,
and put them in
a new var called
“year”
->
long
Reshape Wide to Long Without ID
. webuse reshape1, clear
. drop id
1.
2.
3.
. reshape long inc ue, i(id) j(year)
(note: j = 80 81 82)
What if there is no ID variable?
. list
sex
inc80
inc81
inc82
ue80
ue81
ue82
0
1
0
5000
2000
3000
5500
2200
2000
6000
3300
1000
0
1
0
1
0
0
0
0
1
Data
Number of obs.
Number of variables
j variable (3 values)
xij variables:
. list, sepby(id)
. list
1.
2.
3.
wide
->
3
8
->
->
->
year
->
->
inc
ue
inc80 inc81 inc82
ue80 ue81 ue82
Let’s create one
. generate id=_n
(5a.2)
sex
inc80
inc81
inc82
ue80
ue81
ue82
id
0
1
0
5000
2000
3000
5500
2200
2000
6000
3300
1000
0
1
0
1
0
0
0
0
1
1
2
3
id
year
sex
inc
ue
1.
2.
3.
1
1
1
80
81
82
0
0
0
5000
5500
6000
0
1
0
4.
5.
6.
2
2
2
80
81
82
1
1
1
2000
2200
3300
1
0
0
7.
8.
9.
3
3
3
80
81
82
0
0
0
3000
2000
1000
0
0
1
long
9
5
Reshape Long to Wide
(5b.)
wide
long
id
year
sex
inc
ue
1.
2.
3.
4.
5.
1
1
1
2
2
80
81
82
80
81
0
0
0
1
1
5000
5500
6000
2000
2200
0
1
0
1
0
6.
7.
8.
9.
2
3
3
3
82
80
81
82
1
0
0
0
3300
3000
2000
1000
0
0
0
1
1.
2.
3.
id
sex
inc80
inc81
inc82
ue80
ue81
ue82
1
2
3
0
1
0
5000
2000
3000
5500
2200
2000
6000
3300
1000
0
1
0
1
0
0
0
0
1
When you have a long dataset
… but need a wide dataset
You can reshape the data from long to wide
… and optionally reorder the variables
. reshape wide inc ue, i(id) j(year)
(note: j = 80 81 82)
Data
long
->
wide
Number of obs.
Number of variables
j variable (3 values)
xij variables:
9
5
year
->
->
->
3
8
(dropped)
inc
ue
->
->
inc80 inc81 inc82
ue80 ue81 ue82
. order id sex inc80 inc81 inc82 ue80 ue81 ue82
. list
The order command serves only
to rearrange the sequence of
the variables on the file
Let’s Look at the Code In-Depth
wide
long
id
year
sex
inc
ue
1.
2.
3.
4.
5.
1
1
1
2
2
80
81
82
80
81
0
0
0
1
1
5000
5500
6000
2000
2200
0
1
0
1
0
6.
7.
8.
9.
2
3
3
3
82
80
81
82
1
0
0
0
3300
3000
2000
1000
0
0
0
1
We want our
data to end
up in wide
form
(5b.)
The two vars that change
each year, that we want
to stick numbers on the
end of
1.
2.
3.
id
sex
inc80
inc81
inc82
ue80
ue81
ue82
1
2
3
0
1
0
5000
2000
3000
5500
2200
2000
6000
3300
1000
0
1
0
1
0
0
0
0
1
This specifies a
unique
individual
. reshape wide inc ue, i(id) j(year)
(note: j = 80 81 82)
Take the values
in the variable
“year”, and stick
them on the
end of inc and
ue
What We Will Cover After the Break
• Combining multiple datasets vertically (append and preserve/restore)
• Save subsets of observations to different datasets
• Combining multiple datasets horizontally (1:1 merge)
• Save subsets of variables to different datasets
• m:1 (many-to-one) merging of datasets
• Extract group and individual data from multilevel datasets (collapse)
• Execute commands by groups (bysort)
• Create new variables based on data summaries and functions (egen)
• Create standardized scores and deviation scores (sd and std)
• Automate the same tasks for multiple variables (foreach loops)
• Global and local macros and looping
(6.)
Append Multiple Datasets and Generate a
Labeled Source Identifier
(7a.)
capop
1.
2.
3.
ilpop
1.
2.
3.
txpop
1.
2.
3.
county
pop
Los Angeles
Orange
Ventura
9878554
2997033
798364
county
pop
Cook
DeKalb
Will
5285107
103729
673586
county
pop
Brazos
Johnson
Harris
152415
149797
4011475
Combine several datasets
with the same variables but
different observations …
into a single dataset,
while identifying the
source of the data
1.
2.
3.
4.
5.
6.
7.
8.
9.
county
pop
state
Los Angeles
Orange
Ventura
Cook
DeKalb
Will
Brazos
Johnson
Harris
9878554
2997033
798364
5285107
103729
673586
152415
149797
4011475
CA
CA
CA
IL
IL
IL
TX
TX
TX
. use capop, clear
. append using ilpop txpop, generate(state)
. label define statelab 0 "CA" 1 "IL" 2 "TX"
. label values state statelab
. list, sep(0)
Appending Datasets
Open the master
datasets
Append the other
datasets to the first one
. use capop, clear
(7a.)
Generate a variable identifying
the data source: Consecutive
integers beginning with 0
. append using ilpop txpop, generate(state)
. label define statelab 0 "CA" 1 "IL" 2 "TX"
. label values state statelab
. list, sep(0)
Define and name a label for the
new source identifier variable
Apply the label to the
source identifier variable
Save Subsets of Observations to
Separate Datasets
(7b.)
. use CA, clear
. list
. ***Save subsets of cases
. preserve
. list, nolabel sep(0)
. keep if (state==0)
(6 observations deleted)
county
1.
2.
3.
4.
5.
6.
7.
8.
9.
Los Angeles
Orange
Ventura
Cook
DeKalb
Will
Brazos
Johnson
Harris
pop
9878554
2997033
798364
5285107
103729
673586
152415
149797
4011475
state
0
0
0
1
1
1
2
2
2
. save CA, replace
file CA.dta saved
. restore
. preserve
1.
2.
3.
pop
state
Los Angeles
Orange
Ventura
9878554
2997033
798364
CA
CA
CA
. use IL, clear
. list
. keep if (state==1)
(6 observations deleted)
. save IL, replace
file IL.dta saved
county
1.
2.
3.
county
pop
state
Cook
DeKalb
Will
5285107
103729
673586
IL
IL
IL
. restore
. preserve
. keep if (state==2)
(6 observations deleted)
. save TX, replace
file TX.dta saved
. restore
. use TX, clear
. list
1.
2.
3.
county
pop
state
Brazos
Johnson
Harris
152415
149797
4011475
TX
TX
TX
Create Separate files Containing
Subsets of the Observations
(7b.)
Create a temporary
backup of the dataset
Save the
subset
dataset
. ***Save subsets of cases
. preserve
Keep only a subset
of the observations
. keep if (state==0)
(6 observations deleted)
. save CA, replace
file CA.dta saved
. restore
. preserve
. keep if (state==1)
Restore the dataset to its
original state from the
temporary backup
Merge Files Containing the Same
Observations but Different Variables
autoexpense (using)
autosize (master)
make
weight
length
1.
2.
3.
4.
5.
Toyota Celica
BMW 320i
Cad. Seville
Pont. Grand Prix
Datsun 210
2,410
2,650
4,290
3,210
2,020
174
177
204
201
165
6.
Plym. Arrow
3,260
170
key
. use autosize, clear
(1978 Automobile Data)
make
1.
2.
3.
4.
5.
price
mpg
Toyota Celica
BMW 320i
Cad. Seville
Pont. Grand Prix
Datsun 210
5,899
9,735
15,906
5,222
4,589
18
25
21
19
35
make
weight
length
price
mpg
merged
1.
2.
3.
4.
5.
BMW 320i
Cad. Seville
Datsun 210
Plym. Arrow
Pont. Grand Prix
2,650
4,290
2,020
3,260
3,210
177
204
165
170
201
9,735
15,906
4,589
.
5,222
25
21
35
.
19
6.
Toyota Celica
2,410
174
5,899
18
. merge 1:1 make using autoexpense
Result
Merge data from two
datasets with the same
observations, but
different variables
(except for the key)
# of obs.
not matched
from master
from using
1
1
0
(_merge==1)
(_merge==2)
matched
5
(_merge==3)
_merge
matched
matched
matched
master only
matched
(3)
(3)
(3)
(1)
(3)
matched (3)
(8a.)
1:1 (Match) Merging
Open one of
the datasets
Based on a common key variable
which uniquely identifies each
observation across both datasets
. use autosize, clear
(1978 Automobile Data)
Merge with the
other dataset
. merge 1:1 make using autoexpense
Result
Do a
match
merge
(8a.)
# of obs.
Observations
with data from
just one dataset
not matched
from master
from using
1
1
0
(_merge==1)
(_merge==2)
matched
5
(_merge==3)
Observations with data
from both datasets
Save Subsets of Variables to
Separate Datasets
(8b.)
make
weight
length
price
mpg
1.
2.
3.
4.
5.
BMW 320i
Cad. Seville
Datsun 210
Plym. Arrow
Pont. Grand Prix
2,650
4,290
2,020
3,260
3,210
177
204
165
170
201
9,735
15,906
4,589
.
5,222
25
21
35
.
19
6.
Toyota Celica
2,410
174
5,899
18
_merge
matched
matched
matched
master only
matched
(3)
(3)
(3)
(1)
(3)
matched (3)
.
. ***Save subsets of variables
. preserve
. use SIZE, clear
(1978 Automobile Data)
. list, sep(0)
make
1.
2.
3.
4.
5.
6.
BMW 320i
Cad. Seville
Datsun 210
Plym. Arrow
Pont. Grand Prix
Toyota Celica
weight
length
2,650
4,290
2,020
3,260
3,210
2,410
177
204
165
170
201
174
. keep make weight length
. save SIZE, replace
file SIZE.dta saved
. restore
. use EXPENSE, clear
(1978 Automobile Data)
. list, sep(0)
. preserve
. keep make price mpg
. save EXPENSE, replace
file EXPENSE.dta saved
. restore
make
1.
2.
3.
4.
5.
6.
BMW 320i
Cad. Seville
Datsun 210
Plym. Arrow
Pont. Grand Prix
Toyota Celica
price
mpg
9,735
15,906
4,589
.
5,222
5,899
25
21
35
.
19
18
Save Subsets of Variables to
Separate Datasets
Backup before
subsetting
variables
Save the first
subset as a
Stata data file
.
. ***Save subsets of variables
. preserve
. keep make weight length
Keep the first
variable subset
. save SIZE, replace
file SIZE.dta saved
. restore
Make sure the
key variable is
included in both
subsets
(8b.)
Restore the
backup dataset
. preserve
. keep make price mpg
. save EXPENSE, replace
file EXPENSE.dta saved
. restore
Distribute Group-level Information
Across Individual-level Observations
Look up the variable values
in “dollars” and attach them
to the records in “sforce”
sforce
region
1.
2.
3.
4.
5.
N Cntrl
N Cntrl
N Cntrl
NE
NE
6.
7.
8.
9.
10.
South
South
South
South
West
11.
12.
West
West
. use sforce, clear
(Sales Force)
name
Krantz
Phipps
Willis
Ecklund
Franks
. merge m:1 region using dollars
(label region already defined)
Result
1.
2.
3.
4.
N Cntrl
NE
South
West
0
12
(_merge==3)
merged
region
Cobb
Grant
1.
2.
3.
4.
5.
dollars
region
# of obs.
not matched
matched
key
Anderson
Dubnoff
Lee
McNeil
Charles
sales
cost
419,472
360,523
532,399
310,565
227,677
138,097
330,499
165,348
N Cntrl
N Cntrl
N Cntrl
NE
NE
6.
7.
8.
9.
10.
South
South
South
South
West
11.
12.
West
West
(9a.)
name
sales
cost
_merge
Krantz
Phipps
Willis
Ecklund
Franks
419,472
419,472
419,472
360,523
360,523
227,677
227,677
227,677
138,097
138,097
matched
matched
matched
matched
matched
(3)
(3)
(3)
(3)
(3)
Anderson
Dubnoff
Lee
McNeil
Charles
532,399
532,399
532,399
532,399
310,565
330,499
330,499
330,499
330,499
165,348
matched
matched
matched
matched
matched
(3)
(3)
(3)
(3)
(3)
Cobb
Grant
310,565
310,565
165,348
165,348
matched (3)
matched (3)
m:1 Many-to-One (Lookup) Merging
Level 1
dataset
Lookup
merging
. use sforce, clear
(Sales Force)
Key
Variable
. merge m:1 region using dollars
(label region already defined)
Result
not matched
matched
Level 2
dataset
# of obs.
0
12
(_merge==3)
(9a.)
Extract the Individual- and Group-Level
Data from a Multilevel Data Set
(9b.)
. ***Get and sort the multilevel data
. use http://www.ats.ucla.edu/stat/hlm/faq/hsball, clear
. save lev2, replace
file lev2.dta saved
. sort id
.
. ***Write out the individual-level data
. preserve
. codebook, compact
. keep id minority female ses mathach
. codebook, compact
Variable
id
minority
female
ses
mathach
Obs Unique
7185
7185
7185
7185
7185
160
2
2
373
6031
. save lev1, replace
file lev1.dta saved
. restore
. ***Write out the school-level data
. collapse (mean) meanses size sector pracad disclim himinty, by(id)
Mean
Min
Max
.
.274739
.5281837
.0001434
12.74785
.
0
0
-3.758
-2.832
.
1
1
2.692
24.993
Label
Number of students
Variable
Obs Unique
id
meanses
size
sector
pracad
disclim
himinty
160
160
160
160
160
160
160
160
150
149
2
73
159
2
Mean
Min
Max
.
-.0001875
1097.825
.4375
.5139375
-.015125
.275
.
-1.188
100
0
0
-2.416
0
.
.831
2713
1
1
2.756
1
Label
(mean)
(mean)
(mean)
(mean)
(mean)
(mean)
meanses
size
sector
pracad
disclim
himinty
Number of schools
Note: Requires that the school-level
variables in the original multilevel data
have the same (constant) values for
every student within a given school.
. sort id
.
. ***Write out the individual-level data
. preserve
Separating Level 1 and Level 2 Data
. keep id minority female ses mathach
. ***Get and sort the multilevel data
. use http://www.ats.ucla.edu/stat/hlm/faq/hsball, clear
. codebook, compact
. sort id
Variable
Sort by the
Mean
group identifier
Obs Unique
Min
.
id
7185
160
.
.
. ***Write out the individual-level data
minority
7185
2
.274739
0
. preserve
female
7185
2 .5281837
0
ses
7185
373 .0001434 -3.758
. keep id minority female ses mathach
mathach
7185
6031 12.74785 -2.832
. codebook, compact
. save lev1, replace
Variable
Obs Unique
file lev1.dta saved
Save the
level 1Min
data
Mean
Max
Label
Keep. the level
1
1 variables
1
2.692
24.993
Max
Get the group
means of the
level 2 variables
Label
id
7185
160
.
.
.
. restore
minority
7185
2
.274739
0
1
.
***Write
out
the
school-level
data
female
7185
2 .5281837
0
1
. collapse7185
(mean)373
meanses
size-3.758
sector 2.692
pracad disclim himinty, by(id)
ses
.0001434
mathach
7185
6031 12.74785 -2.832 24.993
. save lev2, replace
file lev2.dta saved
. save lev1, replace
file lev1.dta saved
. codebook, compact
Save the level
2 dataset
(9b.)
Aggregating Data by Subgroups [With
Frequency Weights]
(10.)
aggregated
college
gpa
hour
year
number
1.
2.
3.
4.
3.2
3.5
2.8
2.1
30
34
28
30
1
1
1
1
3
2
9
4
5.
6.
7.
3.8
2.5
2.9
29
30
35
2
2
2
3
4
5
8.
9.
10.
3.7
2.2
3.3
30
35
33
3
3
3
4
2
3
. use college, clear
11.
12.
3.4
2.9
32
31
4
4
5
2
. list
1.
2.
3.
4.
year
gpa
hour
medgpa
medhour
1
2
3
4
2.788889
2.991667
3.233333
3.257143
29.44444
31.83333
32.11111
31.71428
2.8
2.9
3.3
3.4
29
30
33
32
. collapse (mean) gpa hour
(median) medgpa=gpa medhour=hour [ fw = number ], by(year)
frequency weights
Produce a new file with a single observation for each
group of records in the original data set. This
example produces the group means and medians.
Execute Commands by Subgroups
•
- bysort runs a stata command separately for each value of a
for each value of a variable
consideration. bysort does that
(11a.)
• ‘bysort’ runs a command
separately for each
value of a variable
• Using just ‘by’ requires
the data to be sorted by
the variable in
consideration. ‘bysort’
does that for you
Runs separate regressions for
observations when
foreign=“domestic” and when
foreign=“foreign”
Summarizes the variables price &
mpg when foreign=“domestic” and
foreign=“foreign”
Using bysort to Identify Duplicates
(11b.)
It is important to note that
bysort cannot be used
with every stata
commands
eg- scatter, histogram etc.
4 groups of duplicates
Within-observation Across-variables
Data Summaries
(12a.)
. use http://www.stata-press.com/data/r12/egenxmpl4, clear
. egen rtot = rowtotal(a b c)
//row total
. egen rn = rownonmiss(a b c)
//row n
. egen rmean = rowmean(a b c)
//row mean
Create new variables that are statistical
functions of multiple original variables for
each observation
. egen rmed = rowmedian(a b c) //row median
. egen rmin = rowmin(a b c)
//row minimum
. egen rmax = rowmax(a b c)
//row maximum
Example statistical functions
. list
1.
2.
3.
4.
a
b
c
rtot
rn
rmean
rmed
rmin
rmax
.
4
7
10
2
.
8
11
3
6
.
12
5
10
15
33
2
2
2
3
2.5
5
7.5
11
2.5
5
7.5
11
2
4
7
10
3
6
8
12
Within-variable Across-observations
Data Summaries
(12b.)
Create new variables that are statistical
functions of individual original variables
across all, or groups of, the observations
. sysuse autornd.dta, clear
(1978 Automobile Data)
//get the data
. keep make mpg
//keep make and mpg
. keep in 1/10
(64 observations deleted)
//keep the first 10 observations
. generate mfg=word(make,1)
//extract manufacturer from make
. format mfg %-7s
//left align the mfg variable
. egen vm_mpg=mean(mpg)
//mpg dataset mean
Means for subgroups
. bysort mfg: egen gm_mpg=mean(mpg) //mpg group mean
. list, sepby(mfg)
make
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
//list by mfg
mpg
mfg
vm_mpg
gm_mpg
AMC Concord
AMC Pacer
AMC Spirit
20
15
20
AMC
AMC
AMC
19
19
19
18.33333
18.33333
18.33333
Buick
Buick
Buick
Buick
Buick
Buick
Buick
20
15
20
25
20
15
20
Buick
Buick
Buick
Buick
Buick
Buick
Buick
19
19
19
19
19
19
19
19.28572
19.28572
19.28572
19.28572
19.28572
19.28572
19.28572
Century
Electra
LeSabre
Opel
Regal
Riviera
Skylark
Means for the whole sample
Creating Standardized Scores and
Deviation Scores
(13.)
. sysuse autornd.dta, clear
(1978 Automobile Data)
//get the data
. keep make mpg
//keep make and mpg
. keep in 1/10
(64 observations deleted)
//keep the first 10 observations
. egen vm_mpg=mean(mpg)
//mpg dataset mean
. egen vs_mpg=sd(mpg)
//mpg standard deviation
. egen vz_mpg=std(mpg)
//mpg z-scores
. generate vd_mpg=mpg-vm_mpg
//mpg deviation scores
Deviations from the variable’s mean
AKA Grand mean centering
. list
make
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Standardized scores
mpg
vm_mpg
vs_mpg
vz_mpg
vd_mpg
AMC Concord
AMC Pacer
AMC Spirit
Buick Century
Buick Electra
20
15
20
20
15
19
19
19
19
19
3.162278
3.162278
3.162278
3.162278
3.162278
.3162278
-1.264911
.3162278
.3162278
-1.264911
1
-4
1
1
-4
Buick
Buick
Buick
Buick
Buick
20
25
20
15
20
19
19
19
19
19
3.162278
3.162278
3.162278
3.162278
3.162278
.3162278
1.897367
.3162278
-1.264911
.3162278
1
6
1
-4
1
LeSabre
Opel
Regal
Riviera
Skylark
Create and Format Multiple
Variables at Once
(14a.)
Stata puts these line
numbers in the output
even though they are
not in the do file
. sysuse auto.dta, clear
(1978 Automobile Data)
. keep make price mpg headroom
. list in 1/10
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
make
price
mpg
headroom
AMC Concord
AMC Pacer
AMC Spirit
Buick Century
Buick Electra
4,099
4,749
3,799
4,816
7,827
22
17
22
20
15
2.5
3.0
3.0
4.5
4.0
1.
2.
3.
4.
5.
5,788
4,453
5,189
10,372
4,082
18
26
20
16
19
4.0
3.0
2.0
3.5
3.5
6.
7.
8.
9.
10.
Buick
Buick
Buick
Buick
Buick
LeSabre
Opel
Regal
Riviera
Skylark
. foreach v in price mpg headroom {
2. egen z_`v'=std(`v')
3. format z_`v' %6.2f
4. }
. list in 1/10
make
price
mpg
headroom
z_price
z_mpg
z_head~m
AMC Concord
AMC Pacer
AMC Spirit
Buick Century
Buick Electra
4,099
4,749
3,799
4,816
7,827
22
17
22
20
15
2.5
3.0
3.0
4.5
4.0
-0.70
-0.48
-0.80
-0.46
0.56
0.12
-0.74
0.12
-0.22
-1.09
-0.58
0.01
0.01
1.78
1.19
5,788
4,453
5,189
10,372
4,082
18
26
20
16
19
4.0
3.0
2.0
3.5
3.5
-0.13
-0.58
-0.33
1.43
-0.71
-0.57
0.81
-0.22
-0.92
-0.40
1.19
0.01
-1.17
0.60
0.60
Buick
Buick
Buick
Buick
Buick
LeSabre
Opel
Regal
Riviera
Skylark
Create and Check Dummy
Variables
(14b.)
. *Dummy variables
. use http://www.stata-press.com/data/r12/nlswork.dta,clear
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. *Tabulate to verify
. foreach x of varlist yr1-yr15{
2. tab `x'
3. }
. tabulate year, generate(yr)
interview
year
Freq.
Percent
68
69
70
71
72
73
75
77
78
80
82
83
85
87
88
1,375
1,232
1,686
1,851
1,693
1,981
2,141
2,171
1,964
1,847
2,085
1,987
2,085
2,164
2,272
4.82
4.32
5.91
6.49
5.93
6.94
7.50
7.61
6.88
6.47
7.31
6.96
7.31
7.58
7.96
Total
28,534
100.00
year==
68.0000
Freq.
Percent
Cum.
0
1
27,159
1,375
95.18
4.82
95.18
100.00
Total
28,534
100.00
Cum.
4.82
9.14
15.05
21.53
27.47
34.41
41.91
49.52
56.40
62.88
70.18
77.15
84.45
92.04
100.00
--Some output omitted-year==
88.0000
Freq.
Percent
Cum.
0
1
26,262
2,272
92.04
7.96
92.04
100.00
Total
28,534
100.00
Macros
. // Lists
. global names "Ballav Nick ChongMing Joe David"
. local names2
"Jake Steven Jose Tyrell Martin"
(15.)
Global – Exists until STATA is closed, or
a “clear all” command is used.
. di "`names2'"
Jake Steven Jose Tyrell Martin
. foreach x in $names {
di "`x'"
2.
3. }
Ballav
Nick
ChongMing
Joe
David
. foreach x in `names2' {
di "`x'"
2.
3. }
Jake
Steven
Jose
Tyrell
Martin
Macros can be used for many
things. Two examples are:
1) Lists or other storage
2) Variables
Local – temporary macro, disappears
when do file finishes running
. // Macros can also be used to specify variables.
. global ind_vars "female read write math"
. reg ses $ind_vars
Source
SS
df
MS
Model
Residual
11.6241389
97.137997
4
201
2.90603473
.483273617
Total
108.762136
205
.530547004
ses
Coef.
female
read
write
math
_cons
-.17103
.0135099
.0051675
.0066498
.8044773
Std. Err.
.1049502
.0066418
.0073557
.0067761
.3080703
t
-1.63
2.03
0.70
0.98
2.61
Number of obs
F( 4,
201)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.105
0.043
0.483
0.328
0.010
=
=
=
=
=
=
206
6.01
0.0001
0.1069
0.0891
.69518
[95% Conf. Interval]
-.3779747
.0004133
-.0093367
-.0067116
.197013
.0359146
.0266065
.0196717
.0200113
1.411942