A Unified Approach To Detecting Spatial Outliers Presentation By Date: 03/06/2006

Download Report

Transcript A Unified Approach To Detecting Spatial Outliers Presentation By Date: 03/06/2006

A Unified Approach To Detecting Spatial Outliers Shashi Shekhar, Chang Tien Lu, Pusheng Zhang GeoInformatica 7:2 pp 139-166, 2003 Presentation By Antony Philip (Group 7) http://webpages.charter.net/antonyp/csci8715.htm

Date: 03/06/2006

Definition

    A spatial outlier (S-outlier) is a spatially referenced object O whose non spatial attribute values f(x) are significantly different from those of other spatially referenced objects in its spatial neighborhood N(x) Spatial Neighborhood  Domain specific  Example:  In the traffic sensor data set sensor A is a neighbor to sensor B if they are adjacent and traffic flows to B from A.  Possible representation => N x N sparse Boolean matrix Non Spatial Attributes f(x)  Single valued or multi-valued function  Example:  In the traffic sensor data set sensor reading, say number of vehicles crossed is non spatial Global Outliers are different - Found by looking for inconsistency in the entire data set.

Motivation

 Example application domains   Transportation Ecology   Public safety, Public health, Climatology Location Based Services.

 Relevance to Spatial Databases course  Data considered is spatial in nature  Outlier test is based on location and its spatial relation to other data sources.

Problem Statement

    Inputs  Spatial framework consisting of N data locations    Neighborhood relations (N x N) between the data locations M sets of clean training data for N locations to build the model Test data set to identify outliers Outputs  Identified Outliers from the test data set Objectives  Minimize computation time Constraints  Size of the data set is much larger than the main memory

Key Concepts - I

  Spatial outlier detection algorithms  Spatial Statistics  Scatterplot  Moran Scatterplot Unified algorithm steps I Model building  Build the distribution model (mean, std. deviation, parameters for the least squares line) using the training data set. II Outlier Test  Compare the deviation of the test data from the model with the preset threshold to detect outlier.

Key Concepts II

   Outlier Test Strategies  Route outlier detection (ROD)  Random Node Verification (RNV) The ROD detects spatial outliers in the user specified route. The RNV finds the outlier nodes in a given set of arbitrary nodes.

Exercise – Outlier in 1D

    Given  10 data locations[1..10] with values as [2,2,2,2,2,2,2,2,2,2], f(i), i=1..10 = 2     Assume, this is the only training data Neighbors of i th location are (i-1) th location and (i+1) th location, if they exist Confidence level 95%, theta = 2, Assume std. deviation is made non-zero to avoid ‘div by zero’ sd = sd+error(.00001) Use spatial statistics Find  Test set f(4) = 5, f(3)=2 and f(5)=2, Is data @location 4 an Outlier?

Solution Steps   Find the mean and standard deviation of S(x) = [f(x) - Average f(x) in N(x)] Test outlierness of location 4 using ABS((S(x)-mean)/std.dev) > 2 Answer  ?

Validation Methodology

 Experimental verification using real traffic data   Verification with a real data is always better.

Paper explains in detail how we can optimize the I/O time using spatial clustering techniques  Proofs & Lemmas   Formal proofs and lemmas are scattered throughout the paper Not sure whether proofs and lemmas add value to this paper or not. [I would prefer less use for that in an application oriented paper]

Contributions

    Authors provide a unified framework for model building and outlier test Algorithms exploit computational structure to find model parameters in one scan through the data set.

Experimental verification using real traffic data Experimental proof of I/O time minimization using spatial properties

Assumptions

 Focuses on one non-spatial attribute.  Temporal and Spatio-Temporal outliers are not detected directly  Outlier test assumes normal distribution for S functions.

Rewrite Today..

 Following are some options  Generalize Non-Spatial Attribute f(x) as multi-valued function   Use distance metric to compare f(x) with its neighborhood average Generalize spatial(2-D) framework to N-dimensional framework    Definition of neighborhood in N-D space. 3-D framework = Spatio-Temporal (Time + Space) Explain the importance of clean data set for model building