Cleaning - iDigBio

Download Report

Transcript Cleaning - iDigBio

Cleaning
Validating and
Enhancing Data with
Open Refine
iDigBio – University of Florida
2nd Train-the-Trainers Georeferencing Workshop
Deborah Paul, iDigInfo and iDigBio
August 12 - 16, 2013
Pre & Post-Digitization
• Exposing Data to Outside Curation – Yipee! Feedback
• Data Discovery
• dupes, grey literature, more complete records,
annotations of many kinds, georeferenced records
• Filtered PUSH Project
• Scatter, Gather, Reconcile – Specify
• iDigBio
• Planning for Ingestion of Feedback – Policy Decisions
• re-determinations & the annotation dilemma
• to re-image or not to re-image
• “annotated after imaged”
• to attach a physical annotation label to the
specimen from a digital annotation or not
Data curation, management, and
enhancement
• querying dataset to find & fix errors
• kinds of errors
• filename errors
• typos
• georeferencing errors
• taxonomic errors
• identifier and guid errors
• format errors (dates)
• mapping
• enhancing data
• repetitive data cleaning, enhancement using scripts
Clean & Enhance Data with Tools
• Query / Report / Update features of Databases
• Learn how to query your databases effectively
• Learn SQL (MySQL, it’s not hard – really!)
• Gives you a communication tool with your IT staff
• Using new tools
• Kepler Kurator – Data Cleaning and Enhancement
• Open Refine, desktop app
• from messy to marvelous
• http://code.google.com/p/google-refine/
• http://openrefine.org/
• remove leading / trailing white spaces
• standardize values
• call services for more data
• just what is a “service” anyway?
• the magic of undo
• using scripts for repetitive data manipulation tasks
• Google Fusion Tables
OpenRefine
• A power tool for working with messy data.
• Got Data in a Spreadsheet or other format…
• TSV, CSV, *SV, Excel (.xls and .xlsx),
• JSON,
• XML,
• RDF as XML,
• Wiki markup, and
• Google Data documents are all supported.
• the software tool formerly known as
GoogleRefine
http://openrefine.org/
• Install
Configure parsing options
then click Create Project
Facet, text facet
Enhance Data
• Call “web services”
• GEOLocate example
• your data has locality, county, state, country fields
• limit data to a given state, county
• build query
• "http://www.museum.tulane.edu/webservices/g
eolocatesvcv2/glcwrap.aspx?Country=USA&state
=ID&county=Idaho&fmt=json&Locality="+escape(
value,'url')
• service returns json output
• latitude, longitude values now in your dataset.
• Google Fusion tables
Add column by fetching URLs…
Data back from GEOLocate in JSON format
Parsing json
• How do we get our longitude and latitude
out of the json?
• Parsing (it’s not hard – don’t panic)!
Parsing json
• Copy and paste the text below into
• http://jsonformatter.curiousconcept.com/
• { "engineVersion" : "GLC:4.92|U:1.01374|eng:1.0", "numResults" : 2,
"executionTimems" : 187.2048, "resultSet" : { "type": "FeatureCollection",
"features": [ { "type": "Feature", "geometry": {"type": "Point",
"coordinates": [-115.11639, 46.33778]}, "properties": { "parsePattern" :
"EAGLE MOUNTAIN", "precision" : "Low", "score" : 29,
"uncertaintyRadiusMeters" : 1807, "uncertaintyPolygon" : "Unavailable",
"displacedDistanceMiles" : 0, "displacedHeadingDegrees" : 0, "debug" :
":GazPartMatch=False|:inAdm=True|:Adm=IDAHO|:NPExtent=3000|:NP=EA
GLE MOUNTAIN|:KFID=|EAGLE MOUNTAIN" } }, { "type": "Feature",
"geometry": {"type": "Point", "coordinates": [-115.89, 46.0225]},
"properties": { "parsePattern" : "CLEARWATER", "precision" : "Low", "score"
: 25, "uncertaintyRadiusMeters" : 3036, "uncertaintyPolygon" :
"Unavailable", "displacedDistanceMiles" : 0, "displacedHeadingDegrees" : 0,
"debug" :
":GazPartMatch=False|:inAdm=True|:Adm=IDAHO|:NPExtent=5040|:NP=CL
EARWATER|:KFID=|CLEARWATER" } } ], "crs": { "type" : "EPSG", "properties"
: { "code" : 4326 }} } }
http://jsonformatter.curiousconcept.com/
Copy json output in the spreadsheet, paste it here.
Click on process button (lower right of this screen).
Parsing json
Parsing latitude
Parsing longitude
The Results!
Reconciliation, an example
• What if you’d like to compare your taxon
names (or author names, perhaps) to
those from a standardized source?
• Try reconciliation using Refine to call some
taxonomic name resolution service.
• What follows is only an example. There are
other services against which to compare
and enhance your data.
• The Name Validation Example that follows
was set up by Gaurav Vaidya (University of
Colorado). Click link to read more.
• Next, see screen shots for steps.
• All links subject to changing over time.
1st, duplicate and rename the scientificName column
2nd add a Reconciliation Service
• from drop-down of
reconcileScientificName
click Start reconciling
• in next pop-up click Add
Standard Service
• in pop-up enter url:
http://refine.taxonomics.org/gbifchecklists/reconcile
• click Add Service
• cells with Create new topic
found no match
• click to accept this value for
this cell only.
• click the  to accept and
apply this value to all cells that
had the original value
• choosing the 1st option with
 creates a cell with value of
Abrus precatorius L. (Plantae)
• to undo, click Choose new
match
Refine data using and re-using scripts
• Refine records each step done and
creates a reusable script.
• to undo step 5 (reconcile), just
click on 5.
• to undo create new column (step
4) click on step 4.
• Extract all steps as a script, to
reuse with the next dataset that
has identical issues and is
formatted the same way.
• Save script as a .txt file
• Paste script in new dataset* and
click Apply
• dataset must be like this one
(same original headers and
column order)
click Extract
copy and save output as a .txt file
In a new dataset, easily repeat the same tasks
• got a dataset just like the one before?
• different values, same columns?
• open the dataset in Refine
• click on Undo / Redo
• click Apply…
• in the Apply Operation History
• paste the same script in here
• click Perform Operations
• all the steps run
• Voila, you are done and the data is
refined!
Recipes for enhancing your data!
• Someone’s already written the same script
to do the same task.
• What scripts have others written?
• You can use them, you don’t have to write
your own.
• Where are they?
• See https://github.com/OpenRefine/OpenRefine/wiki/Recipes
How to get started?
• This powerpoint and accompanying CSV
• Open Refine videos and tutorials and recipes
• see http://openrefine.org
• Join Google+ Open Refine Community
• Read about Google Fusion Tables
• Please teach others about these power tools
• Pay-it-forward
• Get data that is “fit-for-research-use”
• and have fun with your data
Thanks for trying out this new tool!
This work would not be possible without prior input, assistance, contributions, and kindness from
many including: Nelson Rios, Gaurav Vaidya, Guillaume Jimenez, Liath Appleton, Katja Seltmann,
Laura Russell, Greg Riccardi, Stephen Flager, and Kevin Love. To each of you who contributed to
demistifying data cleaning and enhancement, who provided new how-to’s, and stepped in new with
explanations and examples, thank you!
Deborah Paul, iDigBio
iDigBio is funded by a grant from the National Science Foundation's Advancing
Digitization of Biodiversity Collections Program (#EF1115210). Views and
opinions expressed are those of the author not necessarily those of the NSF.
Have fun with the data, no matter where you find it!