Transcript Document
Detection Algorithms for Biosurveillance: A tutorial RODS: http://www.health.pitt.edu/rods Auton Lab: http://www.autonlab.org Copyright © 2002, 2003, 2004 Andrew Moore Biosurveillance Detection Algorithms: Slide 1 Signal The Basic Task: Analyze a time series data stream to find outbreaks without sounding too many false alarms Time Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 2 Many Methods! Method Time-weighted averaging Serfling ARIMA SARIMA + External Factors Univariate HMM Kalman Filter Recursive Least Squares Support Vector Machine Neural Nets Randomization Spatial Scan Statistics Bayesian Networks Contingency Tables Scalar Outlier (SQC) Multivariate Anomalies Change-point statistics FDR Tests WSARE (Recent patterns) PANDA (Causal Model) FLUMOD (space/Time HMM) Has Pitt/CMU tried it? Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Tried but little used Yes Yes Yes Yes Yes Yes Tried and used Under development Multivariate signal tracking? Spatial ? Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes (w/ Howard Burkom) Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Details of these methods and bibliography available from “Summary of Biosurveillance-relevant statistical and data mining technologies” by Moore, Cooper, Tsui and Wagner. Downloadable (PDF format) from www.cs.cmu.edu/~awm/biosurv-methods.pdf Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 3 What you’ll learn about • Noticing events in bioevent time series • Tracking many series at once • Detecting geographic hotspots • Finding emerging new patterns Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 4 What you’ll learn about • Noticing events in bioevent time series • Tracking many series at once • Detecting geographic hotspots • Finding emerging new patterns Copyright © 2002, 2003, Andrew Moore These are all powerful statistical methods, which means they all have to have one thing in common… Biosurveillance Detection Algorithms: Slide 5 What you’ll learn about • Noticing events in bioevent time series • Tracking many series at once • Detecting geographic hotspots • Finding emerging new patterns Copyright © 2002, 2003, Andrew Moore These are all powerful statistical methods, which means they all have to have one thing in common… Boring Names. Biosurveillance Detection Algorithms: Slide 6 What you’ll learn about • Noticing events in bioevent time series • Tracking many series at once • Detecting geographic hotspots • Finding emerging new patterns WSARE Copyright © 2002, 2003, Andrew Moore These are all powerful statistical methods, which means they all have to have one thing in common… Boring Names. Univariate Anomaly Detection Multivariate Anomaly Detection Spatial Scan Statistics Biosurveillance Detection Algorithms: Slide 7 What you’ll learn about • Noticing events in bioevent time series • Tracking many series at once • Detecting geographic hotspots • Finding emerging new patterns WSARE Copyright © 2002, 2003, Andrew Moore Univariate Anomaly Detection Multivariate Anomaly Detection Spatial Scan Statistics Biosurveillance Detection Algorithms: Slide 8 Signal Univariate Time Series Time Example Signals: • • • • • Copyright © 2002, 2003, Andrew Moore Number of ED visits today Number of ED visits this hour Number of Respiratory Cases Today School absenteeism today Nyquil Sales today Biosurveillance Detection Algorithms: Slide 9 (When) is there an anomaly? Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 10 (When) is there an anomaly? This is a time series of counts of primary-physician visits in data from Norfolk in December 2001. I added a fake outbreak, starting at a certain date. Can you guess the start date? Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 11 (When) is there an anomaly? This is a time series of counts of primary-physician visits in data from Norfolk in December 2001. I added a fake outbreak, starting at a certain date. Can you guess when? Here (much too high for a Friday) (injected outbreak) Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 12 Signal An easy case Time Dealt with by Statistical Quality Control Record the mean and standard deviation up to the current time. Signal an alarm if we go outside 3 sigmas Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 13 An easy case: Control Charts Signal Upper Safe Range Mean Time Dealt with by Statistical Quality Control Record the mean and standard deviation up to the current time. Signal an alarm if we go outside 3 sigmas Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 14 Control Charts on the Norfolk Data Alarm Level (injected outbreak) Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 15 Control Charts on the Norfolk Data Alarm Level (injected outbreak) Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 16 Control Charts on the Norfolk Data Alarm Level Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 17 Looking at changes from yesterday Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 18 Looking at changes from yesterday Alarm Level Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 19 Looking at changes from yesterday Alarm Level Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 20 We need a happy medium: Control Chart: Too insensitive to recent changes Copyright © 2002, 2003, Andrew Moore Change from yesterday: Too sensitive to recent changes Biosurveillance Detection Algorithms: Slide 21 Moving Average Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 22 Moving Average Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 23 Moving Average Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 24 Moving Average Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 25 Algorithm Performance Allowing one False Alarm per TWO weeks… standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA Copyright © 2002, 2003, Andrew Moore Allowing one False Alarm per SIX weeks… 0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65 3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78 0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57 4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24 Biosurveillance Detection Algorithms: Slide 26 Algorithm Performance Allowing one False Alarm per TWO weeks… standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA Copyright © 2002, 2003, Andrew Moore Allowing one False Alarm per SIX weeks… 0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65 3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78 0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57 4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24 Biosurveillance Detection Algorithms: Slide 27 Algorithm Performance Allowing one False Alarm per TWO weeks… standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA Copyright © 2002, 2003, Andrew Moore Allowing one False Alarm per SIX weeks… 0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65 3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78 0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57 4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24 Biosurveillance Detection Algorithms: Slide 28 Signal Seasonal Effects Time Fit a periodic function (e.g. sine wave) to previous data. Predict today’s signal and 3-sigma confidence intervals. Signal an alarm if we’re off. Reduces False alarms from Natural outbreaks. Different times of year deserve different thresholds. Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 29 Algorithm Performance Allowing one False Alarm per TWO weeks… standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA Copyright © 2002, 2003, Andrew Moore Allowing one False Alarm per SIX weeks… 0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65 3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78 0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57 4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24 Biosurveillance Detection Algorithms: Slide 30 Day-of-week effects Fit a day-of-week component E[Signal] = a + deltaday E.G: deltamon= +5.42, deltatue= +2.20, deltawed= +3.33, deltathu= +3.10, deltafri= +4.02, deltasat= -12.2, deltasun= -23.42 A simple form of ANOVA Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 31 Regression using Hours-in-day & IsMonday Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 32 Regression using Hours-in-day & IsMonday Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 33 Algorithm Performance Allowing one False Alarm per TWO weeks… standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA Copyright © 2002, 2003, Andrew Moore Allowing one False Alarm per SIX weeks… 0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65 3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78 0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57 4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24 Biosurveillance Detection Algorithms: Slide 34 Regression using Mon-Tue Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 35 Algorithm Performance Allowing one False Alarm per TWO weeks… standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA Copyright © 2002, 2003, Andrew Moore Allowing one False Alarm per SIX weeks… 0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65 3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78 0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57 4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24 Biosurveillance Detection Algorithms: Slide 36 CUSUM • CUmulative SUM Statistics • Keep a running sum of “surprises”: a sum of excesses each day over the prediction • When this sum exceeds threshold, signal alarm and reset sum Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 37 CUSUM Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 38 CUSUM Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 39 Algorithm Performance Allowing one False Alarm per TWO weeks… standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA Copyright © 2002, 2003, Andrew Moore Allowing one False Alarm per SIX weeks… 0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65 3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78 0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57 4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24 Biosurveillance Detection Algorithms: Slide 40 The Sickness/Availability Model Counts = sickness * availability Plot this Sickness = counts / availability Sick people may seek care more often on certain days due to availability of medical services or time in their schedules, so adjust for that phenomenon Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 41 The Sickness/Availability Model Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 42 The Sickness/Availability Model Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 43 The Sickness/Availability Model Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 44 The Sickness/Availability Model Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 45 The Sickness/Availability Model Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 46 The Sickness/Availability Model Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 47 The Sickness/Availability Model Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 48 Algorithm Performance Allowing one False Alarm per TWO weeks… standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA Copyright © 2002, 2003, Andrew Moore Allowing one False Alarm per SIX weeks… 0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65 3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78 0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57 4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24 Biosurveillance Detection Algorithms: Slide 49 Algorithm Performance Allowing one False Alarm per TWO weeks… standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA Copyright © 2002, 2003, Andrew Moore Allowing one False Alarm per SIX weeks… 0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65 3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78 0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57 4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24 Biosurveillance Detection Algorithms: Slide 50 Exploiting Denominator Data Normalize (divide) by total visits Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 51 Exploiting Denominator Data Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 52 Exploiting Denominator Data Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 53 Exploiting Denominator Data and Smoothing Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 54 Algorithm Performance Allowing one False Alarm per TWO weeks… standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA Copyright © 2002, 2003, Andrew Moore Allowing one False Alarm per SIX weeks… 0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65 3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78 0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57 4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24 Biosurveillance Detection Algorithms: Slide 55 Other state-of-the-art methods • • • • Wavelets Change-point detection Kalman filters Hidden Markov Models Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 56 What you’ll learn about • Noticing events in bioevent time series • Tracking many series at once • Detecting geographic hotspots • Finding emerging new patterns WSARE Copyright © 2002, 2003, Andrew Moore Univariate Anomaly Detection Multivariate Anomaly Detection Spatial Scan Statistics Biosurveillance Detection Algorithms: Slide 57 Multiple Signals Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 58 Multivariate Signals (relevant to inhalational diseases) cough.syr.liq.dec tabs.caps throat.cough nasal 2000 daily sales 1500 1000 500 0 7/1/99 10/1/99 Copyright © 2002, 2003, Andrew Moore 1/1/00 4/1/00 date 7/1/00 10/1/00 1/1/01 Biosurveillance Detection Algorithms: Slide 59 Multi Source Signals Footprint of Influenza in Routinely Collected Data Lab Lab Flu Flu WebMD WebMD School School Cough& Cold Cough & Cold Cough Syrup Throat Resp Resp Viral Viral Death Death 27 31 35 Copyright © 2002, 2003, Andrew Moore 39 43 47 51 3 7 11 15 weeks 19 23 27 31 35 39 43 47 51 3 Biosurveillance Detection Algorithms: Slide 60 What if you’ve got multiple signals? Red: Cough Sales Signal Blue: ED Respiratory Visits Time Idea One: Simply treat it as two separate alarm-fromsignal problems. …Question: why might that not be the best we can do? Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 61 Another View Red: Cough Sales Signal Blue: ED Respiratory Visits Cough Sales Question: why might that not be the best we can do? ED Respiratory Visits Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 62 Another View Red: Cough Sales Signal Blue: ED Respiratory Visits This should be an anomaly Cough Sales Question: why might that not be the best we can do? ED Respiratory Visits Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 63 N-dimensional Gaussian Red: Cough Sales Signal Blue: ED Respiratory Visits One Sigma Good Practical Idea: Cough Sales Model the joint with a Gaussian This is a sensible N-dimensional SQC 2 Sigma ED Respiratory Visits Copyright © 2002, 2003, Andrew Moore …But you can also do Ndimensional modeling of dynamics (leads to the idea of Kalman Filter model) Biosurveillance Detection Algorithms: Slide 64 What you’ll learn about • Noticing events in bioevent time series • Tracking many series at once • Detecting geographic hotspots • Finding emerging new patterns WSARE Copyright © 2002, 2003, Andrew Moore Univariate Anomaly Detection Multivariate Anomaly Detection Spatial Scan Statistics Biosurveillance Detection Algorithms: Slide 65 One Step of Spatial Scan Entire area being scanned (Philadelphia Metro) Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 66 One Step of Spatial Scan Entire area being scanned Current region being considered Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 67 One Step of Spatial Scan Entire area being scanned Current region being considered I have a population of 5300 of whom 53 are sick (1%) Everywhere else has a population of 2,200,000 of whom 20,000 are sick (0.9%) Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 68 One Step of Spatial Scan Entire area being scanned Current region being considered I have a population of 5300 of whom 53 are sick (1%) So... is that a big deal? Evaluated with Score Everywhere else has a function (e.g. Kulldorf’s population of 2,200,000 of score) whom 20,000 are sick (0.9%) Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 69 One Step of Spatial Scan Entire area being scanned Current region being considered I have a population of 5300 of whom 53 are sick (1%) [Score = 1.4] So... is that a big deal? Evaluated with Score Everywhere else has a function (e.g. Kulldorf’s population of 2,200,000 of score) whom 20,000 are sick (0.9%) Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 70 Many Steps of Spatial Scan Entire area being scanned Highest scoring region in search so far Current region being considered I have a population of 5300 of whom 53 are sick (1%) [Score = 9.3] [Score = 1.4] So... is that a big deal? Evaluated with Score Everywhere else has a function (e.g. Kulldorf’s population of 2,200,000 of score) whom 20,000 are sick (0.9%) Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 71 Scan Statistics Standard approach: Standard scan statistic question: Given the geographical locations of occurrences of a phenomenon, is there a region with an unusually high (low) rate of these occurrences? Copyright © 2002, 2003, Andrew Moore 1. Compute the likelihood of the data given the hypothesis that the rate of occurrence is uniform everywhere, L0 2. For some geographical region, W, compute the likelihood that the rate of occurrence is uniform at one level inside the region and uniform at another level outside the region, L(W). 3. Compute the likelihood ratio, L(W)/L0 4. Repeat for all regions, and find the largest likelihood ratio. This is the scan statistic, S*W 5. Report the region, W, which yielded the max, S* W See [Glaz and Balakrishnan, 99] for details Biosurveillance Detection Algorithms: Slide 72 Significance testing Standard approach: Given that region W is the most likely to be abnormal, is it significantly abnormal? Copyright © 2002, 2003, Andrew Moore 1. Generate many randomized versions of the data set by shuffling the labels (positive instance of the phenomenon or not). 2. Compute S*W for each randomized data set. This forms a baseline distribution for S*W if the null hypothesis holds. 3. Compare the observed value of S*W against the baseline distribution to determine a p-value. Biosurveillance Detection Algorithms: Slide 73 N Fast squares speedup N • Theoretical complexity of fast squares: O(N2) (as opposed to naïve N3), if maximum density region sufficiently dense. If not, we can use several other speedup tricks. • In practice: 10-200x speedups on real and artificially generated datasets. Emergency Dept. dataset (600K records): 20 minutes, versus 66 hours with naïve approach. Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 74 N Fast rectangles speedup N • Theoretical complexity of fast rectangles: O(N2log N) (as opposed to naïve N4) Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 75 N Fast oriented rectangles speedup N • Theoretical complexity of fast rectangles: 18N2log N (as opposed to naïve 18N4) (Angles discretized to 5 degree buckets) Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 76 Why the Scan Statistic speed obsession? • Traditional Scan Statistics very expensive, especially with Randomization tests Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 77 Rectangular SS on Electrolyte Sales Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 78 Rectangular SS on Cough/cold Sales Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 79 Proposed new WSARE/Scan Statistic hybrid This is the strangest region because the age distribution of respiratory cases has changed dramatically for no reason that can be explained by known background changes Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 80 What you’ll learn about • Noticing events in bioevent time series • Tracking many series at once • Detecting geographic hotspots • Finding emerging new patterns WSARE Copyright © 2002, 2003, Andrew Moore Univariate Anomaly Detection Multivariate Anomaly Detection Spatial Scan Statistics Biosurveillance Detection Algorithms: Slide 81 A Limitation of Univariate Analysis REPRESENTATIVE SURVEILLANCE DATA : Date Time Hospital ICD9 Prodrome Gender Age Home Location Many more… 6/1/03 9:12 1 781 Fever M 20s NE … 6/1/03 9:45 1 787 Diarrhea F 40s SE … : : : : Standard Approach Select in advance which subpopulations to monitor (e.g., each county, zip) Do not pay close attention to effect of multiple testing Copyright © 2002, 2003, Andrew Moore : : : : WSARE Approach Monitor hundreds of thousands of subpopulations Pay close attention to effect of multiple testing Biosurveillance Detection Algorithms: Slide 82 WSARE v2.0 • What’s Strange About Recent Events? • Designed to be easily applicable to any date/time-indexed biosurveillance-relevant data stream. Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 83 WSARE v2.0 • Inputs: 1. Date/time-indexed biosurveillancerelevant data stream Copyright © 2002, 2003, Andrew Moore 2. Time Window Length 3. Which attributes to use? Biosurveillance Detection Algorithms: Slide 84 WSARE v2.0 • Input s: 1. Date/time-indexed biosurveillancerelevant data stream 2. Time Window Length Primary Date Key Time “ignore key and weather” “last 24 hours” Example Hospital ICD9 Prodrome Gender Age Home 3. Which attributes to use? Work Large Medium Fine Large Medium Fine Scale Scale Scale Scale Scale Scale h6r32 6/2/2 14:12 Down- 781 Fever town Recent Recent (Many Flu Weather more…) Levels M 20s NE 15217 A5 NW 15213 B8 2% 70R … t3q15 6/2/2 14:15 River- 717 Respirat M side ory 60s NE 15222 J3 NE 15222 J3 2% 70R … t5hh5 6/2/2 14:15 Smith- 622 Respirat F field ory 80s SE 15210 K9 SE 15210 K9 2% 70R … : : : : : : Copyright © 2002, 2003, Andrew Moore : : : : : : : : : : : Biosurveillance Detection Algorithms: Slide 85 WSARE v2.0 • Inputs: 1. Date/time-indexed biosurveillancerelevant data stream • Outputs: 1. Here are the 2. Time Window Length 3. Which attributes to use? 2. Here’s why 3. And here’s how seriously you should take it records that most surprise me Primary Date Key Time Hospital ICD9 Prodrome Gender Age Home Work Large Medium Fine Large Medium Fine Scale Scale Scale Scale Scale Scale h6r32 6/2/2 14:12 Down- 781 Fever town Recent Recent (Many Flu Weather more…) Levels M 20s NE 15217 A5 NW 15213 B8 2% 70R … t3q15 6/2/2 14:15 River- 717 Respirat M side ory 60s NE 15222 J3 NE 15222 J3 2% 70R … t5hh5 6/2/2 14:15 Smith- 622 Respirat F field ory 80s SE 15210 K9 SE 15210 K9 2% 70R … : : : : : : Copyright © 2002, 2003, Andrew Moore : : : : : : : : : : : Biosurveillance Detection Algorithms: Slide 86 WSARE v2.0 • Given 500 day’s worth of ER cases at 15 hospitals… Date Cases Thu 5/22/2000 C1, C2, C3, C4 … Fri 5/23/2000 C1, C2, C3, C4 … : : : : Sat 12/9/2000 C1, C2, C3, C4 … Sun 12/10/2000 C1, C2, C3, C4 … Copyright © 2002, 2003, Andrew Moore : : Sat 12/16/2000 C1, C2, C3, C4 … : : Sat 12/23/2000 C1, C2, C3, C4 … : : : : Fri 9/14/2001 C1, C2, C3, C4 … Biosurveillance Detection Algorithms: Slide 87 WSARE v2.0 • Given 500 day’s worth of ER cases at 15 hospitals… • For each day… • Take today’s cases Copyright © 2002, 2003, Andrew Moore Date Cases Thu 5/22/2000 C1, C2, C3, C4 … Fri 5/23/2000 C1, C2, C3, C4 … : : : : Sat 12/9/2000 C1, C2, C3, C4 … Sun 12/10/2000 C1, C2, C3, C4 … : : Sat 12/16/2000 C1, C2, C3, C4 … : : Sat 12/23/2000 C1, C2, C3, C4 … : : : : Fri 9/14/2001 C1, C2, C3, C4 … Biosurveillance Detection Algorithms: Slide 88 WSARE v2.0 • Given 500 day’s worth of ER cases at 15 hospitals… • For each day… • Take today’s cases • The cases one week ago • The cases two weeks ago Copyright © 2002, 2003, Andrew Moore Date Cases Thu 5/22/2000 C1, C2, C3, C4 … Fri 5/23/2000 C1, C2, C3, C4 … : : : : Sat 12/9/2000 C1, C2, C3, C4 … Sun 12/10/2000 C1, C2, C3, C4 … : : Sat 12/16/2000 C1, C2, C3, C4 … : : Sat 12/23/2000 C1, C2, C3, C4 … : : : : Fri 9/14/2001 C1, C2, C3, C4 … Biosurveillance Detection Algorithms: Slide 89 WSARE v2.0 • Given 500 day’s worth of ER cases at 15 hospitals… • For each day… DATE_ADMITTED ICD9 12/9/00 12/9/00 12/9/00 12/9/00 : 12/16/00 12/16/00 12/16/00 12/16/00 12/23/00 12/23/00 12/23/00 PRODROME GENDER 786.05 789 789 786.05 : 3 1 1 3 : 787.02 782.1 789 786.09 789.09 789.09 782.1 • Take today’s cases : : : 12/23/00 786.09 786.09 • The cases one week ago 12/23/00 12/23/00 780.9 • The cases two weeks ago12/23/00 V40.9 2 4 1 3 1 1 4 3 3 2 7 F F M M : M F M M M F M : M M F M place2 s-e s-e n-w s-e : n-e s-w s-e n-w s-w s-w n-w : s-e s-e n-w s-w … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … • Ask: “What’s different about today?” Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 90 WSARE v2.0 • Given 500 day’s worth of ER cases at 15 hospitals… • For each day… DATE_ADMITTED ICD9 12/9/00 12/9/00 12/9/00 12/9/00 : 12/16/00 12/16/00 12/16/00 12/16/00 12/23/00 12/23/00 12/23/00 PRODROME GENDER 786.05 789 789 786.05 : 3 1 1 3 : 787.02 782.1 789 786.09 789.09 789.09 782.1 2 4 1 3 1 1 4 F F M M : M F M M M F M : M M F M • Take today’s cases : : Fields we use:: 12/23/00 786.09 3 12/23/00 786.09 3 • The cases one week ago 12/23/00 780.9 2 Date, Time of Day, Prodrome, ICD9, 12/23/00 V40.9 7 • The cases two weeks ago Symptoms, Age, Gender, Coarse Location, place2 s-e s-e n-w s-e : n-e s-w s-e n-w s-w s-w n-w : s-e s-e n-w s-w … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … Fine Location, Derived Features, • Ask: “What’sICD9 different Census Block Derived Features, Work about today?” Details, Colocation Details Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 91 Example of Output Sat 12-23-2001 (daynum 36882, dayindex 239) 35.8% ( 48/134) of today's cases have 30 <= age < 40 17.0% ( 45/265) of other cases have 30 <= age < 40 Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 92 Example of Output Sat 12-23-2001 (daynum 36882, dayindex 239) FISHER_PVALUE = 0.000051 35.8% ( 48/134) of today's cases have 30 <= age < 40 17.0% ( 45/265) of other cases have 30 <= age < 40 Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 93 Searching for the best score… • • • • • • Try ICD9 = x for each value of x Try Gender=M, Gender=F Try CoarseRegion=NE, =NW, SE, SW.. Try FineRegion=AA,AB,AC, … DD (4x4 Grid) Try Hospital=x, TimeofDay=x, Prodrome=X, … [In future… features of census blocks] Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 94 Corrected P value Sat 12-23-2001 (daynum 36882, dayindex 239) FISHER_PVALUE = 0.000051 RANDOMIZATION_PVALUE = 0.031 35.8% ( 48/134) of today's cases have 30 <= age < 40 17.0% ( 45/265) of other cases have 30 <= age < 40 Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 95 WSARE v2.0 • Inputs: 1. Date/time-indexed biosurveillancerelevant data stream • Outputs: 1. Here are the 2. Time Window Length 3. Which attributes to use? 2. Here’s why 3. And here’s how seriously you should take it records that most surprise me Primary Date Key Time Hospital ICD9 Prodrome Gender Age Home Work Large Medium Fine Large Medium Fine Scale Scale Scale Scale Scale Scale h6r32 6/2/2 14:12 Down- 781 Fever town Recent Recent (Many Flu Weather more…) Levels M 20s NE 15217 A5 NW 15213 B8 2% 70R … t3q15 6/2/2 14:15 River- 717 Respirat M side ory 60s NE 15222 J3 NE 15222 J3 2% 70R … t5hh5 6/2/2 14:15 Smith- 622 Respirat F field ory 80s SE 15210 K9 SE 15210 K9 2% 70R … : : : : : : Copyright © 2002, 2003, Andrew Moore : : : : : : : : : : : Biosurveillance Detection Algorithms: Slide 96 WSARE v2.0 • Input s: •Output s: 1. Date/time-indexed biosurveillancerelevant data stream 1. Here are the records that most surprise me 2. Time Window Length 3. Which attributes to use? 2. Here’s why 3. And here’s how seriously you should take it Primary Date Time Hospita ICD Prodrom Gende Ag Home Work Recen Recent (Many Normally, l 8% of9cases in the Key e r East e t Flu Weathe more… Large Mediu Fine Large Mediu Fine Levels r ) are over-50s with respiratory Scale m Scale Scale m Scale Scale Scale problems. h6r32 6/2/2 14:12Down- 781 Fever M But today town it’s been 15% t3q15 6/2/2 14:15River- 717 Respira M side tory t5hh5 6/2/2 14:15Smith- 622 Respira F field tory Copyright © 2002, 2003, Andrew Moore 20 NE s 15217 A5 Don’t be too impressed! NW 15213 B8 2% 70R … Taking into account all the patterns 60 NE 15222 J3been NE searching 15222 J3 over, 2% there’s 70R a … I’ve s 20% chance I’d have found a rule 80 SE 15210 K9 15210just K9 by 2% 70R … thisSE dramatic chance s Biosurveillance Detection Algorithms: Slide 97 WSARE on recent Utah Data Saturday June 1st in Utah: The most surprising thing about recent records is: Normally: 0.8% of records (50/6205) have time before 2pm and prodrome = Hemorrhagic But recently: 2.1% of records (19/907) have time before 2pm and prodrome = Hemorrhagic Pvalue = 0.0484042 Which means that in a world where nothing changes we'd expect to have a result this significant about once every 20 times we ran the program Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 98 WSARE 3.0 • • • • • “Taking into account recent flu levels…” “Taking into account that today is a public holday…” “Taking into account that this is Spring…” “Taking into account recent heatwave…” “Taking into account that there’s a known natural Food-borne outbreak in progress…” Bonus: More efficient use of historical data Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 99 Idea: Bayesian Networks “Patients from West Park Hospital are less likely to be young” “On Cold Tuesday Mornings the folks coming in from the North part of the city are more likely to have respiratory problems” “The Viral prodrome is more likely to co-occur with a Rash prodrome than Botulinic” Copyright © 2002, 2003, Andrew Moore “On the day after a major holiday, expect a boost in the morning followed by a lull in the afternoon” Biosurveillance Detection Algorithms: Slide 100 WSARE 3.0 All historical data Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 101 WSARE 3.0 All historical data Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 102 WSARE 3.0 All historical data Today’s Environment What should be happening today? Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 103 WSARE 3.0 All historical data Today’s Environment What should be happening today? Copyright © 2002, 2003, Andrew Moore Today’s Cases What’s strange about today, considering its environment? Biosurveillance Detection Algorithms: Slide 104 WSARE 3.0 All historical data Today’s Environment What should be happening today? Copyright © 2002, 2003, Andrew Moore Today’s Cases What’s strange about today, considering its environment? And how big a deal is this, considering how much Detection searchAlgorithms: I’ve done? Biosurveillance Slide 105 WSARE 3.0 All historical data Today’s Environment Today’s Cases Cheap What should be happening today? Expensive Copyright © 2002, 2003, Andrew Moore What’s strange about today, considering its environment? And how big a deal is this, considering how much Detection searchAlgorithms: I’ve done? Biosurveillance Slide 106 WSARE 3.0 All historical data Today’s Environment • All-dimensions Trees Today’s Cases • Racing Randomization • Differential Randomization Cheap • RADSEARCH Expensive Copyright © 2002, 2003, Andrew Moore What should be happening today? What’s strange about today, considering its environment? And how big a deal is this, considering how much Detection searchAlgorithms: I’ve done? Biosurveillance Slide 107 Standard WSARE2.0 WSARE2.5 WSARE3.0 Results on Simulation Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 108 BARD (Bayesian Aerosol Release Detector) Key Points Meselson et al, 1994 Science Goal: detect aerosol release of B. anthracis spores Automates the analysis done by Meselson et al. Alarms when increase in disease activity spatially and temporally consistent with aerosol anthrax Makes use of inverted atmospheric dispersion model and meteorological data In preliminary evaluation, no false positives in 6.5 months - By simply analyzing existing surveillance data more thoroughly (without additional data collection), BARD has the potential to improve the earliness and specificity of detection Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide More info: BARD Tech109 report For further info • Papers on these and other anti-terror applications: www.cs.cmu.edu/~awm/antiterror • Papers on scaling up many of these analysis methods: www.cs.cmu.edu/~awm/papers.html • Software implementing the above: www.autonlab.org • Copies of 18 lectures on 25 statistical data mining topics: www.cs.cmu.edu/~awm/781 • CD-ROM, powerpoint-synchronized video/audio recordings of the above lectures: [email protected] Information Gain, Decision Trees Probabilistic Reasoning, Bayes Classifiers, Density Estimation Probability Densities in Data Mining Gaussians in Data Mining Maximum Likelihood Estimation Gaussian Bayes Classifiers Regression, Neural Nets Overfitting: detection and avoidance The many approaches to cross-validation Locally Weighted Learning Bayes Net, Bayes Net Structure Learning, Anomaly Detection Andrew's Top 8 Favorite Regression Algorithms (Regression Trees, Cascade Correlation, Group Method Data Handling (GMDH), Multivariate Adaptive Regression Splines (MARS), Multilinear Interpolation, Radial Basis Functions, Robust Regression, Cascade Correlation + Projection Pursuit Clustering, Mixture Models, Model Selection K-means clustering and hierarchical clustering Vapnik-Chervonenkis (VC) Dimensionality and Structural Risk Minimization PAC Learning Support Vector Machines Time Series Analysis with Hidden Markov Models Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 110 References 1. WSARE 3.0 : Bayesian Network based Anomaly Pattern Detection Wong, Moore, Cooper and Wagner [ICML/KDD 2003] 2. Fast Grid Based Computation of Spatial Scan Statistics Neill and Moore [NIPS 2003] These and other Biosurveillance algorithms papers and free software available from http://www.autonlab.org/ See also: http://www.health.pitt.edu/rods Copyright © 2002, 2003, Andrew Moore Biosurveillance Detection Algorithms: Slide 111