PowerPoint - CUAHSI-HIS

Download Report

Transcript PowerPoint - CUAHSI-HIS

Automated Anomaly Detection, Data Validation and
Correction for Environmental Sensors using Statistical
Machine Learning Techniques
Touraj Farahmand - Aquatic Informatics Inc.
Kevin Swersky - Aquatic Informatics Inc.
Nando de Freitas - Department of Computer Science –
Machine Learning University of British Columbia (UBC)
www.aquaticinformatics.com | 1
Automated data validation and QA/QC is becoming increasingly
important
Growing real-time monitoring sites with huge amount of high
sampling rate data
Ensuring quality controlled and clean real-time data
continuously available for:
 Publishing services
 Online data mining and analysis tools
 Online warning and alert system to minimize false positive
alerts
 Mission critical modeling systems such as flood forecasting
and event detection
www.aquaticinformatics.com | 2
Observed telemetry signal after
comm. reception and decoding
Real Parameter from Natural Environment
Comm.
outliers
Comm.
Gap
Real abnormal
event
Data Logger
Comm. Link
Data
Acquisition
and Decoding
Telemetry
Data
Data
Management
System
Site visit and logger data files
Sensor Signal before comm.
transmission (Logger signal)
Sensor
outliers
Field measurements
Calibration Errors
Fouling Errors
Logger data file
Sensor
Drift
www.aquaticinformatics.com | 3
www.aquaticinformatics.com | 4
Environmental time series in general
are complex and hard to model
Problems:
Highly non-stationary
Highly non-linear
Many changes in dynamics
Can contains outliers, anomalies, gaps, etc.
Our models need to be:
General
Flexible
Robust
Interpretable
Fast and efficient for real-time application
Easy to setup and use
Can provide the uncertainty of the results
www.aquaticinformatics.com | 5
Data Modeling Approaches
The (traditional) frequentist approach
Examples:
• Linear regression
• Hypothesis testing
• Confidence intervals
In frequentist paradigm Probability is defined in terms of the frequencies of random
repeatable events
Here, we create a model with parameters Θ, and fit the model to data X. This forms
a probability distribution P(X| Θ) which is the likelihood of data given the parameter
We can create very flexible models by adding more parameters
With enough parameters we can fit almost anything!
www.aquaticinformatics.com | 6
Data Modeling Approaches
“With enough parameters we can fit
almost anything!”
This sounds nice, but adding too many
parameters means we will overfit
Overfitting means we can get very low error
on training data, but this model will be useless
in practice
But a model that is too simple will also do a
poor job
We need some sort of tradeoff between
model complexity and model generalization
This is difficult and tedious with frequentist
methods
www.aquaticinformatics.com | 7
Data Modeling Approaches
Bayesian methods solve these issues
In Bayesian paradigm, probability provides quantification of uncertainty and
makes precise revision of uncertainty in light of new observation
Highly flexible, very general, interpretable and easy to work with
Automatically finds the correct model complexity
Bonus: naturally incorporates uncertainty and prior knowledge about the
problem
Some Applications of statistical machine learning:
 Financial prediction
 Fraud detection (e.g. credit cards)
 Spam detection
 Search and recommendation (e.g. Google, Amazon)
 Automatic speech recognition & speaker verification
 Face location and identification
 Troubleshooting and fault detection/correction
 Printed and handwritten text parsing
www.aquaticinformatics.com |
 Much more…
8
Data Modeling Approaches
The Bayesian approach
Rather than assuming there is one true Θ that generates our data, we assume
there is a distribution over possible Θ’s
Our goal is now to find P(Θ|X) and we use Baye’s rule
P(Θ) is called the prior, it is used to express prior knowledge
Although simple, this idea provides a powerful modeling framework, and naturally
guards against overfitting
We can now use infinitely many parameters! P(Θ|X) will only be high when Θ
appropriately models the data
This gives us very flexible and very powerful models
www.aquaticinformatics.com | 9
R&D Status and Results
Generic Bayesian inference framework has been developed and compiled into
AQUARIUS scripting toolbox for Alpha tests
A fast and efficient (real-time) linear and piecewise (switching) linear dynamical
machine learning model has been developed and compiled into AQUARIUS scripting
toolbox:





Sensor fault/anomaly detection. E.g. outlier, stuck sensor, offset,…
Data correction and estimation. E.g. gap filling
Short term prediction
Smoothing
Minimal user interaction since it learns all parameter from data
Nonlinear dynamical machine learning models is under research:
 They are more accurate for modeling highly chaotic signals
 The big challenge is computational complexity and speed of training and
inference
The framework of suggested correction/flagging and audit trail has already been
added into Data Correction toolbox for automated processes
No UI and front end available for modeling yet. It is coming soon…
We have started a pilot project with one of our clients
www.aquaticinformatics.com | 10
AQUARIUS Whiteboard For Training/Test for models
We can run this on the server
as part of data pre-processing
workflow
www.aquaticinformatics.com | 11
Univariate Model Results: Gap Filling/Prediction
www.aquaticinformatics.com | 12
Multivariate Model Results: Gap Filling/Prediction
www.aquaticinformatics.com | 13
Multivariate Model Results: Gap Filling/Prediction
www.aquaticinformatics.com | 14
Univariate Model Results: Sensor fault detection
Flags
www.aquaticinformatics.com | 15
Univariate Model Results: Anomaly detection
www.aquaticinformatics.com | 16
Univariate Model Results: Spike detection
www.aquaticinformatics.com | 17
Univariate Model Results: Offset detection
Flags
www.aquaticinformatics.com | 18
Summary
Automated anomaly detection, data validation and QA/QC is becoming
increasingly important
Bayesian techniques and probabilistic models give us very flexible and
powerful framework for modeling sequential data and time series
They naturally incorporate uncertainty and prior knowledge not supported by
other techniques
They naturally guard against overfitting which is a serious problem of
traditional methods
They provide the distribution of model parameters given the observation
In most of the use cases they learn required parameter from data and
metadata with minimal user interaction
They can be used for:
 Anomaly detection
 Data correction (estimation)
 Prediction
 Smoothing
 Sensor fault detection and diagnosis
 Uncertainty propagation for derived data
www.aquaticinformatics.com
| 19
Questions?
www.aquaticinformatics.com | 20