Transcript Document
GCE Data Toolbox -- metadata-based tools for automated data processing and analysis Wade Sheldon University of Georgia GCE-LTER Rationale Data processing, quality control, data analysis and metadata generation traditionally carried out as separate activities, often in different time frames using different technologies Problems: Metadata may not reflect all processing steps Much routine data analysis done w/o Q/C, metadata No economy of scale – leads to “one-off” solutions Metadata generation should ideally occur throughout the data cycle and “inform” data analysis Design Goals Develop Integrated Storage Standard Develop Software to Support Standard Tabular Data QA/QC Information Metadata (overall data set & columns/attributes) Code Library/API User Interfaces Apply Technology to Acquire, Manage, Distribute GCE-LTER Data Explore Use as Prototype Technology for Metadata-based Data Processing, Synthesis Storage Standard Developed Using MATLAB® Local expertise, large scientific user base Cross-platform (Win32, Solaris, *nix, Mac OS/x) Rapid development environment Supports multiple interfaces (interactive command line, batchmode scripts, GUI, WWW) Good interoperability with other technologies (Java, PERL, SQL) Defined “GCE Data Structure” Spec. (based on MATLAB/C structures) Structure with 17 named fields Specific content rules for each field (software validation) Combines data, metadata, QA/QC, processing history Storage Standard GCE Data Structure Specification (v1.1) Category Field Description Structure Info title version datafile createdate editdate history metadata name description units datatype variabletype numbertype precision values criteria flags Title of the Overall Data Set List of Toolbox Versions Used List of Data Files Processed Date of Creation Date of Last Edit Processing History General Metadata (parseable array) Column Names Column Descriptions Column Units Physical Data Types (Storage types) Logical Data Types (Variable types) Numerical Types Decimal Places to Display Table of Data Values (numerical, text) QA/QC Criteria QA/QC Flags Assigned Metadata Data Table QA/QC Info Software – GCE Data Toolbox Core Function Library Create, Validate Structures Import Data, Metadata (ASCII, MATLAB, SQL) Manipulate Data, Metadata (unit conversions, add/delete/update) Export Data, Metadata (various formats) Dynamic, Rule-base QA/QC Flagging Self-documenting Processing Operation Logging (Processing History) Transparent Metadata Creation/Updating Dynamic (JIT) Metadata Generation for Columns Support for Metadata “Templating” Application of Boilerplate Metadata based on Parameter Matching Supports Rapid Documentation of Routine Data Sources Software – GCE Data Toolbox Support for Analysis Descriptive Statistics, Reports Visualization, Mapping Support for Synthesis Composite Data Set Creation Multiple Data Set Merge/Concatenation Relational Join Metadata Content Meshing Data Set Summarization Statistical Data Reduction/Re-sampling Data Set Standardization Unit Conversions (automatic, interactive) Template-based Semantic Mapping Automatic Semantic Mediation (prototype stage) Software – User Interfaces Unattended Batch Mode Processing Interactive Command Line Processing (conventional MATLAB UI) GUI Applications Full help text for each function Well-defined input/output arguments Standard Forms, Dialogs, Controls No MATLAB Experience Required WWW – MATLAB Web Server HTML Forms, Querystring Input HTML Pages and/or Static File Output Command-Line Interface GUI Applications WWW Interface Current Applications Automated Data Processing Direct data import from data logger files, WWW data sources (USGS), SQL queries Automatic metadata creation (templates, data mining) Rule-based QA/QC flagging Data Set Packaging Batch processing to create/update data, metadata products On-demand generation of data, metadata, stat reports in custom formats (end-user scripts, GUI applications, WWW forms) Current Applications Data Exploration/Analysis by PIs Descriptive Statistics based on attribute metadata Visualization with Interactive Filtering (Frequency Histograms, 2D Plots, Map Plots) Data Reduction/Re-sampling to Provide Customized Data at Various “Scales” Aggregated Statistics Binned Statistics Query/Filtering (sub-selection) Current Applications Data Harvesting (GCE) USGS Data (WWW real-time, daily, finalized data) Campbell Scientific Data Arrays (post-processing triggered after LoggerNet Retrieval) Sea-Bird Hydrographic Data USGS Data Harvesting Service for HydroDB Weekly harvest for 31 stations/7 LTER Sites Automatic Resampling, Unit Conversions, Q/C Availability Description, Screen-shots, Fully-functional Toolbox Available on WWW: http://gce-lter.marsci.uga.edu/lter/research/tools/data_toolbox.htm Requires MATLAB 5.3, 6.0, 6.5 (any platform) “Public” Version Compiled Source Code Requests Considered on Case-byCase Basis Future Development Plans EML 2.0 Support Metadata-mediated Data Set Integration Unit conversions Re-sampling More WWW Interface Development