Dexter - University of Missouri

Transcript Dexter - University of Missouri

Dexter
The Missouri Census
Data Center’s
Data Extraction
Utility
John Blodgett: OSEDA, University of Missouri
Rev.14May2007, jgb
What Is Dexter?
A web utility for performing simple data
queries, or extracts.
An integral part of the MCDC’s Uexplorebased data exploration/access system.
Written in SAS© to access data stored in
SAS datasets but requires no knowledge
of nor access to SAS.
Who Uses Dexter?
Anyone interested in accessing the MCDC data
archive, especially anyone who wants to directly
access and manipulate the data.
Not (directly) intended for the very casual data
user. Has a small but non-trivial learning curve.
Understanding the mechanics of Dexter is easy
compared to understanding the data to be
extracted.
Dexter’s Role Within Uexplore
Dexter accepts parameters that identify a
database file/table from which data are to be
extracted.
Uexplore provides the navigation tools to help
locate and understand the content of datasets.
Uexplore hyperlinks actually invoke uex2dex,
the dexter preprocessor, which in turn invokes
dexter.
Uexplore Page With Hyperlinks
The URL Used to Invoke Dexter
On the previous screen the dataset name
(usccflows.sas7bdat) is a hyperlink. The URL
associated with it is:
http://mcdc2.missouri.edu/cgibin/broker?_PROGRAM=websas.uex2dex.sa
s&_SERVICE=appdev9&path=/pub/data/mig2
000&dset=usccflows&view=0
It calls a program named uex2dex, written in SAS, and
passes parms to ID the data table to be queried.
Dexter and Census Data
Dexter doesn’t really know much about the
datasets from which it extracts data.
It is not American FactFinder . It is just a
generic extraction tool.
It uses only very basic metadata tools.
Other tools must be used to assist users in
navigating the database.
Dexter and the MCDC Data
Archive
Technically, there is nothing inherent in Dexter
that ties it to this archive.
In practice, however, the collection of public data
files that we call the “MCDC Data Archive” is
what Dexter was created for.
It is very probable the only reason you’re
reading this is because you want to access
something in that archive.
How Do You Invoke Dexter?
Most people will start at the uexplore home
page http://mcdc.missouri.edu/applications/uexplore.shtml
You navigate the data collection by choosing
“filetype” directories and at some point (…yada
yada yada) you wind up selecting (clicking on) a
file that is a data table.
Clicking on the data table invokes the uex2dex
preprocessor. You fill out the form which uex2dex
generates and click on an “Extract Data” button to
actually invoke Dexter.
Accessing Uexplore
(Home Page)
From the MCDC home page (or any page with the navy
blue navigation bar) click on “MCDC Data Archive”.
Or enter the URL:
http://mcdc2.missouri.edu/applications/uexplore.shtml
Choose Major Category
(from the links in teal box)
Scroll Within the Filetype
Descriptions to Find the Type
(mig2000)
Click on the Filetype Name
(links to uexplore for that directory/filetype)
In this case we want to click on the mig2000
filetype. The text tells us what kind of data we
can expect to find in this directory.
Uexplore Page - mig2000 Filetype
This page is all about hyperlinks (all the blue text). Before
proceeding to the the Dexter-invocation links we want to
back up and look at the data archive structure.
(Back to)
The Uexplore/Dexter Home Page
The Archive Directory
(on the Uexplore/Dexter home page)
The teal box contains links to 8 major data
categories (2000 Census thru Compendia)
The rest of the page consists mostly of
descriptions of and hyperlinks to the
archive’s data categories (which we refer
to as filetypes.)
Filetypes within the major categories are
sorted in descending order of what we
think will be their popularity.
Sf32000x is our most popular filetype.
What’s In the Archive?
Very important question. But not the focus
of this tutorial. Some day we’ll do a
separate, long tutorial just on that topic.
Not all filetypes are created equal. We
spend 90% of our resources on maybe
10% of our data directories.
Filetypes that are in bold are the MCDC
“house specialties”.
The Data Archive – General Info
We keep the data table files (the things Dexter
accesses) in the same directories along with other
related files (metadata, spreadsheets, csv files,
Readme.html files, etc.)
Each filetype directory has a special Tools subdirectory
where we keep program code and other tool modules
related to the data.
Subdirectories & Files starting with uppercase letters are
listed first and are usually worth looking at.
Dexter-accessible table files (“SAS datasets”) have
extensions of sas7bdat or sas7bvew.
Exercise
The Bureau of Economic Analysis
disseminates its REIS data with key
economic indictors for US geography
down to the county level.
Locate the filetype corresponding to this
data collection and navigate to the
directory page.
What’s the major category?
Uexplore Data Directory Page
What you see when you click on the beareis link on the Uexplore
home page. It displays a list of files within the directory. The
“File” column entries are hyperlinks. With a few exceptions the
files are displayed in alphabetical order.
Datasets.html is a special file providing enhanced navigation of
the data files in this dir. It displays just the data-table files, but
in a more logical order and with additional metadata.
Datasets.html page
Datasets.html Columns
The Name column is also a link to uex2dex /
dexter.
Label is a short description of the dataset.
#Rows (# of observations) and #Cols (# of
columns/variables) are taken from the datasets
metadata set. As are the Geographic Universe
and Units.
Link to Details is the most important column.
Universe and Units
The majority of datasets in the archive contain
summary data for geographic areas. For
example, a dataset in the popests directory
might contain the latest estimates for all counties
in the state of Missouri. The geographic
universe is Missouri, and the units are counties.
When we have many datasets in a directory it’s
usually because we have many different
combinations of universe and units.
Common Universes
Missouri (the state of) is by far the most common
universe for the MCDC archive.
United States is second – we have quite a
number of national datasets.
Illinois and Kansas are also very common since
we routinely download and convert census files
for these key neighbor states.
A common sort order for files on Datasets.html
pages is Missouri files first, then US, then IL/KS
and then other states.
Rows & Columns
The rows of the data tables are typically
geographic entities: a state, a county, a city, etc
Most of the columns in the data tables are
summary stats for the entity: e.g. the 2000 pop
count, the latest estimated pop, the change and
percent change, etc.
Other columns (“variables”) are identifiers with
names such as sumlev, geocode and areaname
.
Numeric vs. Character Variables
SAS© stores data as character strings or as
numerics.
We store all identifiers (geographic codes, etc)
as character strings even if they are made up of
numeric digits.
So the value of the state code for CT is “09”, not
9. The leading “0” matters.
Unfortunately, Excel ignores the distinction when
importing csv files.
Dataset Naming Conventions
All filetype names are 8 characters or less.
Dataset names were limited to 8 characters by the
software until recently.
The first characters of the dataset name often
correspond to the universe – e.g. “mo”, “il”, “us”.
The geo units are often part of the ds-name – e.g.
“motracts”, “uszips”.
For time series data the name usually ends with a
time indicator – e.g. “uscom03” contains data thru
2003.
Variable Naming Conventions
Not as rigorously applied as we might like, esp.
for older datasets (conventions used for 1980
datasets differ a little from 2K and 1990 sets, for
example)
Certain names appear on many datasets and
are consistent. These are mostly identifier
variables, the ones used in creating filters and
for merging data from different files.
Consistency With Census Bureau
Data Dictionary Names
The Bureau often distributes data dictionary files
with their data that include suggested names for
the fields.
Their name for the field (variable) containing the
name of the geographic area being summarized
is ANPSADPI. We decided to go with
AreaName instead.
But in most cases we try to use the same name
as in the data dictionary.
Common ID Variables
SumLev: Geographic summary level
codes as used in 2K census. (3-char)
State: 2-char state FIPS code.
County: 5-char county FIPS code, incl. the
state.
Geocode: A composite code to id a
geographic area. E.g. the value for a
census tract might be “29019-0010.00”.
AreaName: Name of the area.
Common ID Variables (cont)
Tract: census tract in tttt.ss format, always
7 characters with leading 0s and 00
suffixes. E.g. “0012.00” .
Esriid: Similar to geocode but intended to
use as a key for linking to shape files from
ESRI (the ArcInfo people). When
geocode=“29019-0010.00” the value of
esriid=“29019001000”.
SAS Formats
Some variables have custom formats
associated with them, which cause them to
display a name instead of their actual value.
E.g. the variable County may have a value of
“29019” but displays as “Boone MO” using the
format. Most Dexter output has the formatted
values.
Click the “View qmeta Metadata report” option at
the end of Section II on the Dexter form to see
which variables have formats associated.
More About the MCDC Data
Archive
http://mcdc2.missouri.edu/tutorials/
mcdc_data_archive.ppt
Details Page
We get here by
clicking on the
Details link on
Datasets.html
page.
Lots of content
here – but will
vary.
Key variables
is often
extremely
useful when
doing filters.
Note the direct
link to Dexter
under Access
the dataset near
the bottom.
Increase Text Size to Read Fine Print
Exercise – Navigate to Dataset
Earlier we were looking at datasets in the 2000
Census category, filetype mig2000.
Go to the Uexplore home page and navigate to
this filetype.
Use the Datasets.html page to display the
datasets within the directory.
Find the row for the usccflows data table and
click on the Details link for this table.
From the Details page click on the keyvals link
for the variable State.
Key Variables Report: State
Tells you that
the variable
State has a
value of 01 (for
“Alabama”) in
22137 rows of
this dataset.
This can be
very helpful
when doing a
data filter in
Dexter.
Finally…
Time to See Dexter
Dexter Input Page (Top)
Sec. I. Output
Format(s): csv
file (into Excel)
most common.
Sec. II is where
the work is. Only
2 of 5 rows
shown here.
User fills out the
entire form before
using Extract Data
button to invoke
Dexter.
Dexter Section II
Filters
“A filter is a logical condition that references values of columns
within a row. For each row, the condition is evaluated and, if
true, the row is selected for output. (If not, the row is omitted,
or "filtered".) To keep all rows, just skip this section. The filter
being created here can consist of up to 5 logical segments,
each referencing a data set Variable, a relational Operator,
and a data Value (or values) -- constants that the user
must type in. The segments are evaluated as true or false.
Logical operators (which default to And and appear between
the segment specification rows) relate the segments when
more than one is specified, creating a compound logical
condition.”
If this explanation makes sense to you then you are going to
have an easy time with Dexter. If not, follow through the
examples and then try reading it again.
Example of Defining a Filter:
What We Want
Assuming we are running dexter to access the
mig2000.usccflows dataset we want to select
only those rows that:
– have Missouri as the anchor state, and
– have at least 100 gross flows.
We’ll just assume you’ve read the descriptions
and have some clue regarding what an anchor
state and a gross flow are. (People interested
in population migration would be likely to know
this.)
Select Variable for Filter
Click on the Variable/Column drop-down
menu in the 1st row and select State.
Select Comparison Operator
Select “Equal to” as the Operator from drop down
menu in the middle column.
Enter Value to Complete Row
Remember the Key Values report showing all the values for the variable
State? If you did not know the code for Missouri you could find it there.
What We Have So Far
We have created a logical condition that
can be evaluated for each row of the
dataset:
State = ’29’
According the key values report for
State we know that this condition is true
for 38,316 rows in the dataset. The
filter we are building will select just
those 38,316 rows out of the 1.1+
million in the full dataset.
Adding a Second Condition
But we do not want all the cases pertaining to
Missouri as the anchor state. We only want
those where we have at least 100 gross flows
(whatever those are).
So we need to fill out a second row, adding
this condition. We select GrossMig as the
variable, Greater Than or Equal To as the
Operator and enter 100 in the Value field.
We leave the logical operator radio button set
to “And” to indicate that this is an additional
necessary condition.
The Completed Filter
You are now ready to scroll down to Section III.
Section III: Choose Variables
Conceptually simple section:
just select the variables you
want on your output from
scrollable (if needed) menu
lists.
Identifiers (character type
variables) are listed separate
from numerics. Important
MCDC Data Archive
convention.
Typing names instead of
selecting is possible but not
recommended.
Here we select all variables
except State.
Section IV: Title & Sort Order
Entirely optional,
typically not used
section.
Title value is used as
report title if you asked
for one, which we did
not in this example.
Sort specs are handy.
Note use of minus sign
(hyphen) to indicate a
descending sort.
Another Extract Data
button to use to run
query.
Dexter Output Page
The first output you
see is this results
“index” page.
Always a link to a
Summary Log page
Additional links
depend on output
formats requested.
Dexter Summary Log
This file always
generated. Important
for documenting the
query.
Indicates what file(s),
when run, as well as
any filter and the
variables kept.
Output directory
details can usually be
ignored.
Select Output File(s)
Click on Delimited File Link
What happens when
you click on this file
depends on how your
browser is configured.
The file referenced
has a .csv extension
which IE usually
associates with the
Excel plugin.
Clicking this link will
typically invoke Excel.
Viewing .csv Output in Excel
The csv file is
read into Excel.
Rows 1 & 2
have names &
labels.
Other rows
contain the
selected data.
Note sort order.
Some Key Points So Far
Navigation tools such as the uexplore home
page, the uexplore directory navigator and
Datasets.html reference pages are used to
make accessing data with Dexter easier.
You get to select rows (“filter”) and columns
as well as the format(s) of your extracted
data.
Filtering often requires knowledge of code
values. These can sometimes be accessed
from the Key Values reports on the Details
page referenced by a Datasets.html page.
The query generated is summarized on a
Summary.log page.
Pop Quiz
1.
2.
3.
4.
5.
6.
Can Dexter be used to access an xls file?
How are the files sorted on a directory page
displayed by uexplore?
What does the uex2dex interface app do?
What is the fastest way to tell how many rows
were selected by your query?
Which of the 5 sections of the Dexter query
form must be filled out to have a valid
request?
What’s a filetype? What does it mean when
one is displayed in bold on the Uexplore home
(Archive Directory) page?
Sample Query 2:
What We Want
We want data from the 2000 Census, Summary File 3
regarding poverty in Missouri – in cities and counties.
We want the number and the % of poor persons, as
well as the median household income.
We only want the data for cities of at least 5000
persons, but for all counties and for the state as a
whole.
We want output as an HTML file sorted by the type of
geography (state, county, city) and then by descending
poverty rate.
What You Need to Know
You need to know where these kinds of data
are stored. It is 2000 census data, but where
among all those different summary files?
Read the brief descriptions on the uexplore
home page. The sf32000 filetype looks good,
but it turns out that it is too big. The standard
extract version, sf32000x, has what we need.
An alternate way by which users may arrive
here is via links on the MCDC Demographic
Profile reports.
A Demographic Profile Report
A link at the bottom of this report page invokes Dexter with the appropriate
dataset selected. Follow the link (in title of this page) and try it.
(Back on uexplore home page)
Click on sf32000x to Start
Descriptions with links from the uexplore home
page.
The sf32000x Directory
(As seen by uexplore)
Subdirectories &
files with upcased
first letters are
shown first.
Index.html,
Readme.html and,
of course,
Datasets.html are
required reading
(browsing).
Files are in
alphabetical (not
logical) order.
(sf32000x) Readme.html
The Datasets.html Page
(for the sf32000x filetype)
Details Page -- sf32000x.moi
Lots of info here. Most important is perhaps the Key variables
link for variable SumLev (geographic summary level).
Key Variables Report for SumLev
(stf32000x.moi)
Filters Based on SumLev
Var
SumLev
Operator Value
Equals
040
Results
State Level
Summary (only 1
row selected)
140
Census Tract
Summaries – 1320
rows selected.
040:050:160 1 State level , 115
County level & 972
Place level rows.
SumLev
Equals
SumLev
In List
Sample Query 2: What We Want
(Repeated in case you forgot)
We want data from the 2000 Census, Summary File
3 regarding poverty in Missouri cities and counties.
We want the number and the % of poor persons, as
well as the median household income.
We only want the data for cities of at least 5000
persons, but for all counties and for the state.
We want output as an HTML file sorted by the type
of geography (state, county, city) and then by
descending poverty rate.
A Complex Filter
The Filter Explained
There are 2 logical parts to the filter:
1. SumLev In (‘040’,’050’)
2. Sumlev = ‘160’ and TotPop >= 5000
The parentheses checkboxes are used
to group the 2nd & 3rd lines. The and
between lines 2 and 3 is executed before
the or between lines 1 and 2.
The Filter Explained, cont.
The SAS© code generatd by these menu
choices :
where sumlev in (‘040’,’050’) or
(sumlev=‘160’ and totpop >=5000);
The “in” operator (called “In List” on Operator
pull-down menu) allows specifying that the value
of a variable should be one of a list of values.
Those values are entered separated by :’s in
the Value column of the filter specs form.
Completing the Query: Parts 3 & 4
HTML Output
We see that
Pemiscot has
the highest
poverty rate of
any county.
How do we
know this?
Why don’t we
see any data
for cities?
Exercise
Access the same dataset as in the example:
sf32000x.moi
Select census tract summaries in Greene co…
… with a poverty rate of at least 10%.
Keep all identifiers necessary to identify the tract,
and all variables related to poverty.
Generate a csv file and load it into a spreadsheet
(probably Excel).
Exercise 2
Repeat the previous exercise except do it
for all counties (instead of census tracts) in
the states of Arkansas and Oklahoma.
Sort the results by descending poverty
rate and generate output in pdf format as
well as a csv file.
Hint: A good place is start is with the
Datasets.html page.
End of Show
Questions and Comments to:
[email protected]