Transcript Slide 1
Dynamic, Rule-based Quality Control
Framework for Real-time Sensor Data
Wade Sheldon
Georgia Coastal Ecosystems LTER
University of Georgia
Introduction
Quality Control of high volume, real-time data from automated sensors is an
emerging challenge
Traditional techniques (plotting, stats) often don’t scale well
Data validation and Q/C can be limiting factor in getting data “online”
Difficulties lead to release delays or posting provisional data
Software developed at Georgia Coastal Ecosystems LTER has proven
useful for Q/C of real-time data
Designed to automate GCE data processing and metadata generation, but
very generalized and supports any tabular data
Provides dynamic, rule-based Q/C framework for data processing, analysis
and synthesis
Framework Components
Comprehensive data model
Implemented as hierarchical MATLAB ‘structure’ arrays
Package dataset & attribute metadata, data, Q/C rules, qualifier flags
Metadata-based MATLAB software (GCE Data Toolbox)
Automatic (rule-based) and manual assignment of Q/C qualifier flags
Transparent management of flags throughout all data manipulation
Q/C-aware data management and analysis tools
Q/C-aware data integration and synthesis tools
Modular implementation supports many scenarios
Interactive (command-line API and GUI forms)
Automated workflows (timed or triggered)
End-to-end (logger-to-scientist) or part of larger workflow
Runs natively on multiple platforms (PC, *nix, MacOS)
GCE Data Toolbox Data Model
Quality Control Rules
Basic syntax: [logical expression]=’[flag code]’
Logical Expressions:
Any conditional statement or call to MATLAB function that returns
logical array (0 = false, 1 = true)
Dataset columns referenced in statements as:
“x” – alias for current column (e.g. x<0)
“col_[name]” – any dataset column by name (e.g. “col_Depth<0”)
Flag Codes:
Alphanumeric character to assign when expression true (I, q, 9, *)
Codes defined in the dataset metadata (I = invalid value, …)
Unlimited rules per attribute, multiple flags per value
Quality Control Rule Examples
Numeric Comparisons:
Simple:
x<0=‘I’ (flags negative values)
x<0=‘I’;x>100=‘I’;x<20=‘Q’;x>80=‘Q’ (overlapping bounds checks)
Quality Control Rule Examples
Numeric Comparisons:
Simple:
x<0=‘I’ (flags negative values)
x<0=‘I’;x>100=‘I’;x<20=‘Q’;x>80=‘Q’ (overlapping bounds checks)
Statistical:
x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’
(flags values more than 3 standard deviations from column mean)
Quality Control Rule Examples
Numeric Comparisons:
Simple:
x<0=‘I’ (flags negative values)
x<0=‘I’;x>100=‘I’;x<20=‘Q’;x>80=‘Q’ (overlapping bounds checks)
Statistical:
x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’
(flags values more than 3 standard deviations from column mean)
Multi-column:
col_DOC>col_TOC=‘I’ (in column DOC; flags DOC exceeding TOC)
col_Dry_Weight<(col_Wet_Weight-col_Ash_Weight)*0.90 =’I’
(flags dry weights below 90% wet weight – ash weight)
col_Depth<0=‘I’ (in column Salinity; flags Salinity when Depth < 0)
Quality Control Rule Examples
Numeric Comparisons:
Simple:
x<0=‘I’ (flags negative values)
x<0=‘I’;x>100=‘I’;x<20=‘Q’;x>80=‘Q’ (overlapping bounds checks)
Statistical:
x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’
(flags values more than 3 standard deviations from column mean)
Multi-column:
col_DOC>col_TOC=‘I’ (in column DOC; flags DOC exceeding TOC)
col_Dry_Weight<(col_Wet_Weight-col_Ash_Weight)*0.90 =’I’
(flags dry weights below 90% wet weight – ash weight)
col_Depth<0=‘I’ (in column Salinity; flags Salinity when Depth < 0)
Compound (Boolean operators):
col_RH_Percent>100&col_Precip<=0.1=‘Q’ (flags humidity > 100% except during
significant precipitation events)
Quality Control Rule Examples (cont.)
Text Comparisons:
“IS”, “NOT” for string literals, “IN”, “NOT IN” for lists
flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’
Quality Control Rule Examples (cont.)
Text Comparisons:
“IS”, “NOT” for string literals, “IN”, “NOT IN” for lists
flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’
Algorithmic Criteria (custom functions):
fn(columns,parameters)=‘Q’
Various included Q/C functions
pattern checks, geographic checks, specialized algorithms (O2 saturation, etc)
User-defined functions:
Any MATLAB code or “wrapped” calls to FORTRAN, Java, Python, etc
Unlimited scope
Quality Control Rule Examples (cont.)
Text Comparisons:
“IS”, “NOT” for strings, “IN”, “NOT IN” for lists
flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’
Algorithmic Criteria (custom functions):
fn(parameters)=‘Q’
Various included Q/C functions
pattern checks, geographic checks, specialized algorithms (O2 saturation, etc)
User-defined functions:
Any MATLAB code or “wrapped” calls to FORTRAN, Java, Python, etc
Unlimited scope
Full suite of MATLAB numeric analysis capabilities supported, and
extensible to use other technology
Q/C Rule Management
Rule definitions can be defined in metadata “templates”, automatically
applied to attributes when raw data imported
Rules can also be created, managed using a GUI form
Q/C Flag Assignment
Q/C criteria evaluated to assign/clear flags when:
Flags can also be assigned/cleared manually by:
Metadata template applied or Q/C criteria edited
New data records, columns added
Values edited (GUI) or columns updated (CLI)
Evaluation function (dataflag) invoked directly
Clicking/dragging on plots with the mouse
Using a spreadsheet-like grid
Importing from text attributes (e.g. 3rd party codes)
Propagating flags from source column(s) to dependent column(s)
Manual assignment locks flags by inserting “manual” token in criteria,
removing “manual” restores automatic evaluation
Q/C-Aware Data Management & Analysis
Q/C flags can be visualized in data editor grid and plots
Flagged values can be selectively removed from data sets
Statistics can be generated with/without flagged values
Flags can be instantiated as coded text columns for export
Flagged, missing values can be summarized by parameter and date for
metadata
Q/C-Aware Data Synthesis
Flagged, missing values
summarized in re-sampled data
(aggregated, binned, date-time
resampled), with automatic Q/C
rule creation
Flags automatically “locked” when
merging multiple data sets (i.e.
unions)
All Q/C operations logged to
processing history, reported in
metadata to document lineage
Implementation Scenarios
End-to-End (logger-to-scientist)
Acquire raw data from logger or file system (standard or custom import filters)
Assign metadata from template or using forms to validate and flag data
Review data and fine-tune flag assignments
Generate distribution files & plots, archive data, index for searching
Desktop data management solution
Data Pre-processing
Acquire, validate and flag raw data (on demand or timed/triggered)
Upload processed data files (e.g. csv) or value & flag arrays to RDBMS
Workflow Step
Call toolbox functions as part of another workflow process, custom program
Kepler MATLAB actor?
Suitability for Real-Time Sensor Data
Good Scalability
Data volumes only limited by computer memory (tested >2 GB data sets)
Multiple instances can be run on high-end, 64bit, clustered workstations
Good flag evaluation performance in use, testing with diverse rule sets
Good scope for automation
Timed and triggered workflow implementations easy to deploy
Support for multiple I/O formats, transport protocols
Formats: ASCII, MATLAB, SQL, XML (partially implemented)
Transport: local file system, UNC paths, HTTP, FTP, SOAP
Already used for real-time GCE data, USGS data harvesting service (LTER
HydroDB, CWT)
Concluding Remarks
Benefits
Flexible, modular design
No qualifier vocabulary, semantics assumed – many purposes, standards
Many operations on flagged values – supports different strategies for archiving
and distributing data at different processing levels
Limitations
Requires MATLAB
Rule syntax environment-specific – a more open standard would be ideal
Support for XML metadata immature (but more development planned)
More information and downloads at:
http://gce-lter.marsci.uga.edu/public/im/tools/data_toolbox.htm
This work was supported by the National Science Foundation under grant numbers
OCE-9982133 and OCE-0620959