NW Startups CUDE Parser

Transcript NW Startups CUDE Parser

Andrew Ferlitsch
17-Jun 2013
OpenGeoCode.Org
“Open Data Project”
Paper: www.opengeocode.org/articles/BigData - Using USGov Data.docx
1




US government agencies have been mandated to make data collected
and compiled by US tax dollars accessible to the public.
In 2009, the Obama administration launched the Open Government
Initiative (OGI) to provide a centralized repository for “public access to
high value, machine readable datasets generated by the Executive Branch of
the Federal Government.” (data.gov).
In May 2013, President Obama launched an initiative “to build a 21st
century platform for government. It requires federal agencies to make key
services available to mobile users and to release important government data
in ways that will allow citizen developers and entrepreneurs to turn them
into new products, services, and jobs.” (whitehouse.gov)
What we’ve found is that there is no common convention
within the US government on defining public access to data
collected.
2
Federal Agencies with Large Public Dataset Repositories









US Census Bureau – Census Data
US Geological Survey (USGS) – Domestic Geographic Features
National Agency of Geospatial Intelligence (NGA) – Foreign Features
US Department of Education – CPI and Employment
United States Department of Agriculture (USDA) – Ag Census
National Weather Service (NWS) – Weather & Climate
Bureau of Land Management (BLM) – GIS & Map data
US Energy Information Agency (EIA) – Energy Production/Usage
Environmental Protection Agency
3
Interactive Forms

Some datasets are only accessible through an interactive form. Interactive
forms generally occur on sites which do not allow you access to the entire
dataset in a single query, but require you to request subsets of the
datasets through a form interaction. Form selections typically for
narrowing your selection typically include one or more of:
 Data Category
 Geographic (state, county, city, zip)
 Time Period

Obtaining the entire dataset therefore is very tedious and labor
consuming.
4
5
6
Tabular

Some datasets are only available as data displayed in tabular format on
the website. In this case, one must either copy/paste the data or hand
copy it. Below is an example from the BLM:
7
FTP Download

Other government websites allow you to download the data using
anonymous FTP. Below are FTP instructions for obtaining NCEP datasets
from NOAA.
8
HTTP Download

One of the most common access methods to datasets is as a HTTP download.
Typically a page will be displayed with one or more selectable datasets and an
accompanying link. Upon clicking on the link, an HTTP download of the dataset
(as a file) is performed. Below is an example of downloading the Public Use Micro
Sample (PUMS) of the Survery of Business Owners (SBO) from the 2007 US
Economic Survey:
9
REST API
 Dataset (portions of) accessed by HTTP call
10

In recent years, local, state and federal governmental agencies have been
migrating to standardizing access interface to public datasets based on
the open data solution from Socrata, Inc (socrata.com). The agencies that
have adopted using this interface use “data” as the subdomain, such as:
data.gov
data.wa.gov
data.seattle.gov
centralized interface to federal government datasets
centralized interface to State of Washington government datasets
centralized interface to City of Seattle government datasets
11
Below is a snapshot of the access/search interface for Seattle:
12
Flat Files/CSV

The most common format for downloading government datasets are in flat file
formats. Flat file formats can be generally defined as:



In older formats (such as still found in US Census Gazetteer datasets), fields are
typically delimited by:




one entry (record) per line
ordered list of delimited fields
fixed column length
Pipe (|) symbols
Tabs
These formats are giving away to standardizing to CSV formats. CSV
formats are identical to older flat file formats, but standardize on
delimitation:


Use commas (North America) or semi-colons (Europe) to delimit fields
Use double quotes around field value if value contains a delimiter.
13
Flat file formats continue to be popular because:




Require only minimal computer skills to be parsed.
Are human readable .
CSV files can be imported into Excel Spreadsheets .
Are exchangeable between common databases (e.., MySQL, MSSQL,
Postgres, etc).
Open Data initiatives (e.g., data.gov) commonly identify
older flat file formats and CSV formats as TXT and CSV
respectively.
14
XML
XML is an alternate format to flat files for representing datasets. While these formats
require more sophistication in parsing, they have the advantage of being able to represent
hierarchical relationships in data, that otherwise would be difficult in a flat file.
Below is an example XML representation for location information which is hierarchically
related to a city, which is hierarchically related to a state.
<state locale=”Washington”>
<city locale=”Seattle”>
<feature location=”lat,lng”> … </feature>
</city>
<city locale=”Tacoma”>
</city>
</state>
RDF/XML
RDF/XML is a W3C standard for self-describing the representation of information
(datasets) that are exchanged via the Internet. The format is in XML with the layout
specified in a “subject-predicate-object” data model.
15
Below is an example of building permit data for the District of Columbia (from
data.dc.gov):
16
Below is an example of EPA dataset that’s been converted to an RDF format by
the Data-gov Wiki, a project by Tetherless World Constellation at Rensselaer
Polytechnic Institute.
17
JSON
JSON is another alternate to flat files for representing datasets. This format
(JavaScript Object Notation) has the same advantages in hierarchical
representation as XML. The format is standardized and widely used for
exchanging data across the web with browser client applications written in
JavaScript.
18
Some datasets are still provided only in a document format, which makes it
difficult for end-users to extract the information without manual
copy/paste.
Excel
Microsoft Excel is one document format which is commonly used for representing
datasets in a tabular format. An advantage of this document format is that Excel has
built-in a ‘Save As’ file option that will convert and save (export) the data into a CSV
format. Once the data has been exported into a CSV format, the dataset can then be
imported into databases in more conventional methods.
PDF
The Adobe Portable Document Format (PDF) is another commonly used
document format for representing datasets; particularly if they come with
accompanying textual information. There is no standard way of handling
datasets in this form. Typically, an enduser has to browse the document locating
the dataset, separate from the accompanying text, and copy/paste for extraction.
19
Below is an example PDF publication from the US Census from the 2012
Statistical Abstract:
20


Government datasets use a variety of identifiers to
uniquely identify one record from another (primary
keys).
Many of these identifiers are specific to the dataset which
make it difficult to cross correlate records from one
dataset from one agency to another. A variety of
approaches can be used to correlate records relating to
the same item/feature but do not share the same
identification system.
21
Below are some examples I have found useful for this purpose:

Business Records:



Matching by business name, DBAs and name on business license.
Matching by phone numbers.
Matching by reducing street address to standard abbreviated form.
130 N.E. Fourth Avenue => 130 ne 4th ave

Public Buildings:

Matching by reducing names to standard abbreviations.
Mount Vernon Junior High School => Mt Vernon Jr High

Geographic Locations

Matching by short, formal , common nickname, and FIPS codes.
State of Vermont => Vermont
Vermont, State of => Vermont

Matching within a radius of a geographic coordinate.
22
Datasets are usually delineated by some hierarchical grouping of the
records. One of the most common delineations is by geographic region.
For example, records may be delineated by state, then county and then
by city. In the case of a flat file, the delineation typically immediately
follows the record’s unique identifier, such as:
Identifier , state , county, city , data ….
The state, county and city identifiers may appear in short or formal
name, in gazetteer or proper reading order, formal abbreviation or FIPS
code. The following all refer to the State of Vermont
Vermont
State of Vermont
Vermont, State of
VT
50
23
In the case of an XML file, the delineation typically encapsulates all the
records within the geographic boundary, such as:
<state name=”Vermont”>
<county name=”Franklin”>
<city name=”Bakersfield”>
<record 1 />
<record 2 />
…
</city>
<city name=”Berkshire”>
…
</city>
…
</county>
<county name=”Essex”>
…
</county>
…
</state>
24

A variety of methods for delimiting fields are found
within government datasets. For flat files, fields are
generally delimited by either:
Fixed Column Width
 Tab
 Special Character: pipe symbol (|), comma (,) or semi-colon
(;)


In the latter case, if the field value contains a
delimiter, then the value is generally encapsulated
with double quotes or preceded with a backslash to
turn off its meaning as a delimiter.
25
Below is a screenshot of the US Census 2000 Places gazetteer dataset. In this
example, the fields are delimited by a fixed column width. The first column (2
chars) is the state abbreviation, followed by the state FIPS code (2 chars), which
is then followed by the place FIPs code (5 chars) which is used as the primary
key. Following this is a combination of the place name and locality type. Note
the locality type is part of the place name and is inconsistently delimited. In this
case, it is the last word, separated by a space, in the field (e.g., city ).
26
Other issues that need to be considered when using government datasets are that
many of these datasets are a continuum of data collected over a long period of
time, and may contain records that are historical or otherwise have out-of-date
information.
Class
County
State
Latitude
Longitude
Ele(ft)*
Map**
BGN Date
Entry Date
San Francisco
For example,
the USGS
search form
for records
containing
“San
San Francisco using
CA
374642N GNIS
1222633W
256
11-MAR-2008
North
CivilFrancisco”
San Mateo
373910N “Civil”
1222036W
0
Hunters
Point
11-MAR-2008
in theCAfeature class
will produce
the
following
results
which
Civil
Los Angeles
CA
342425N
1183643W
1440
Newhall
19-JAN-1981
include historical records, such as record 3 which is aSanreference
to an old land
Francisco
Civil
San Francisco
CA
374642N
1222633W
256
01-FEB-1993
North
grant (1842) in Los Angeles County, whose formal name
is Rancho San
Civil
Santa Clara
CA
370427N
1213722W
315
Gilroy
19-JAN-1981
Francisco.
San Francisco
Civil
San Francisco
CA
374642N
1222633W
256
26-SEP-2001
Civil
North
Civil
White Pine
NV
392430N
1145133W
7034
McGill
-
01-MAY-1990
Civil
Humacao
PR
180911N
0654921W
59
Humacao
-
21-MAR-2008
Civil
San Juan
PR
182757N
0660647W
36
San Juan
-
21-MAR-2008
-
26-SEP-2001
Civil
San Mateo
CA
373842N
1222443W
13
San Francisco
South
Civil
Carver
MN
444133N
0934235W
889
Jordan West
-
01-SEP-1995
Civil
Wake Island
UM
191655N
1663901E
30
Unknown
-
06-OCT-1998
27

Another issue to consider is the handling of non-ASCII
characters. Since the character set for the English
language is ASCII-only, many datasets store the dataset
in ASCII format (7-bit chars). This can be problematic if
the dataset contains place names with Latin diacritic
characters, such as place names from the US territory of
Puerto Rico or other names whose official spelling is nonUS native.
28

NW Startups CUDE Parser

Transcript NW Startups CUDE Parser

Directory