Transcript Slide 1
Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia Introduction Quality Control of high volume, real-time data from automated sensors is an emerging challenge Traditional techniques (plotting, stats) often don’t scale well Data validation and Q/C can be limiting factor in getting data “online” Difficulties lead to release delays or posting provisional data Software developed at Georgia Coastal Ecosystems LTER has proven useful for Q/C of real-time data Designed to automate GCE data processing and metadata generation, but very generalized and supports any tabular data Provides dynamic, rule-based Q/C framework for data processing, analysis and synthesis Framework Components Comprehensive data model Implemented as hierarchical MATLAB ‘structure’ arrays Package dataset & attribute metadata, data, Q/C rules, qualifier flags Metadata-based MATLAB software (GCE Data Toolbox) Automatic (rule-based) and manual assignment of Q/C qualifier flags Transparent management of flags throughout all data manipulation Q/C-aware data management and analysis tools Q/C-aware data integration and synthesis tools Modular implementation supports many scenarios Interactive (command-line API and GUI forms) Automated workflows (timed or triggered) End-to-end (logger-to-scientist) or part of larger workflow Runs natively on multiple platforms (PC, *nix, MacOS) GCE Data Toolbox Data Model Quality Control Rules Basic syntax: [logical expression]=’[flag code]’ Logical Expressions: Any conditional statement or call to MATLAB function that returns logical array (0 = false, 1 = true) Dataset columns referenced in statements as: “x” – alias for current column (e.g. x<0) “col_[name]” – any dataset column by name (e.g. “col_Depth<0”) Flag Codes: Alphanumeric character to assign when expression true (I, q, 9, *) Codes defined in the dataset metadata (I = invalid value, …) Unlimited rules per attribute, multiple flags per value Quality Control Rule Examples Numeric Comparisons: Simple: x<0=‘I’ (flags negative values) x<0=‘I’;x>100=‘I’;x<20=‘Q’;x>80=‘Q’ (overlapping bounds checks) Quality Control Rule Examples Numeric Comparisons: Simple: x<0=‘I’ (flags negative values) x<0=‘I’;x>100=‘I’;x<20=‘Q’;x>80=‘Q’ (overlapping bounds checks) Statistical: x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’ (flags values more than 3 standard deviations from column mean) Quality Control Rule Examples Numeric Comparisons: Simple: x<0=‘I’ (flags negative values) x<0=‘I’;x>100=‘I’;x<20=‘Q’;x>80=‘Q’ (overlapping bounds checks) Statistical: x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’ (flags values more than 3 standard deviations from column mean) Multi-column: col_DOC>col_TOC=‘I’ (in column DOC; flags DOC exceeding TOC) col_Dry_Weight<(col_Wet_Weight-col_Ash_Weight)*0.90 =’I’ (flags dry weights below 90% wet weight – ash weight) col_Depth<0=‘I’ (in column Salinity; flags Salinity when Depth < 0) Quality Control Rule Examples Numeric Comparisons: Simple: x<0=‘I’ (flags negative values) x<0=‘I’;x>100=‘I’;x<20=‘Q’;x>80=‘Q’ (overlapping bounds checks) Statistical: x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’ (flags values more than 3 standard deviations from column mean) Multi-column: col_DOC>col_TOC=‘I’ (in column DOC; flags DOC exceeding TOC) col_Dry_Weight<(col_Wet_Weight-col_Ash_Weight)*0.90 =’I’ (flags dry weights below 90% wet weight – ash weight) col_Depth<0=‘I’ (in column Salinity; flags Salinity when Depth < 0) Compound (Boolean operators): col_RH_Percent>100&col_Precip<=0.1=‘Q’ (flags humidity > 100% except during significant precipitation events) Quality Control Rule Examples (cont.) Text Comparisons: “IS”, “NOT” for string literals, “IN”, “NOT IN” for lists flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’ Quality Control Rule Examples (cont.) Text Comparisons: “IS”, “NOT” for string literals, “IN”, “NOT IN” for lists flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’ Algorithmic Criteria (custom functions): fn(columns,parameters)=‘Q’ Various included Q/C functions pattern checks, geographic checks, specialized algorithms (O2 saturation, etc) User-defined functions: Any MATLAB code or “wrapped” calls to FORTRAN, Java, Python, etc Unlimited scope Quality Control Rule Examples (cont.) Text Comparisons: “IS”, “NOT” for strings, “IN”, “NOT IN” for lists flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’ Algorithmic Criteria (custom functions): fn(parameters)=‘Q’ Various included Q/C functions pattern checks, geographic checks, specialized algorithms (O2 saturation, etc) User-defined functions: Any MATLAB code or “wrapped” calls to FORTRAN, Java, Python, etc Unlimited scope Full suite of MATLAB numeric analysis capabilities supported, and extensible to use other technology Q/C Rule Management Rule definitions can be defined in metadata “templates”, automatically applied to attributes when raw data imported Rules can also be created, managed using a GUI form Q/C Flag Assignment Q/C criteria evaluated to assign/clear flags when: Flags can also be assigned/cleared manually by: Metadata template applied or Q/C criteria edited New data records, columns added Values edited (GUI) or columns updated (CLI) Evaluation function (dataflag) invoked directly Clicking/dragging on plots with the mouse Using a spreadsheet-like grid Importing from text attributes (e.g. 3rd party codes) Propagating flags from source column(s) to dependent column(s) Manual assignment locks flags by inserting “manual” token in criteria, removing “manual” restores automatic evaluation Q/C-Aware Data Management & Analysis Q/C flags can be visualized in data editor grid and plots Flagged values can be selectively removed from data sets Statistics can be generated with/without flagged values Flags can be instantiated as coded text columns for export Flagged, missing values can be summarized by parameter and date for metadata Q/C-Aware Data Synthesis Flagged, missing values summarized in re-sampled data (aggregated, binned, date-time resampled), with automatic Q/C rule creation Flags automatically “locked” when merging multiple data sets (i.e. unions) All Q/C operations logged to processing history, reported in metadata to document lineage Implementation Scenarios End-to-End (logger-to-scientist) Acquire raw data from logger or file system (standard or custom import filters) Assign metadata from template or using forms to validate and flag data Review data and fine-tune flag assignments Generate distribution files & plots, archive data, index for searching Desktop data management solution Data Pre-processing Acquire, validate and flag raw data (on demand or timed/triggered) Upload processed data files (e.g. csv) or value & flag arrays to RDBMS Workflow Step Call toolbox functions as part of another workflow process, custom program Kepler MATLAB actor? Suitability for Real-Time Sensor Data Good Scalability Data volumes only limited by computer memory (tested >2 GB data sets) Multiple instances can be run on high-end, 64bit, clustered workstations Good flag evaluation performance in use, testing with diverse rule sets Good scope for automation Timed and triggered workflow implementations easy to deploy Support for multiple I/O formats, transport protocols Formats: ASCII, MATLAB, SQL, XML (partially implemented) Transport: local file system, UNC paths, HTTP, FTP, SOAP Already used for real-time GCE data, USGS data harvesting service (LTER HydroDB, CWT) Concluding Remarks Benefits Flexible, modular design No qualifier vocabulary, semantics assumed – many purposes, standards Many operations on flagged values – supports different strategies for archiving and distributing data at different processing levels Limitations Requires MATLAB Rule syntax environment-specific – a more open standard would be ideal Support for XML metadata immature (but more development planned) More information and downloads at: http://gce-lter.marsci.uga.edu/public/im/tools/data_toolbox.htm This work was supported by the National Science Foundation under grant numbers OCE-9982133 and OCE-0620959