The Numenta Anomaly Benchmark (NAB) - GitHub

Viewer
Transcript

Please note this a draft under revision.

The Numenta Anomaly Benchmark (NAB)

Introduction Much of real world data is streaming; this means that the data not only changes over time, but it has meaning over time – the order of the data points matters. Detecting anomalies in streaming data is a difficult task, because the Detector must process data in real time rather than making many passes through a large batch file, and the Detector must learn the high level patterns and sequences as it goes. Detecting anomalies in streaming data can be extremely valuable in many domains, though, such as IT security, finance, vehicle tracking, health care, energy grid monitoring, e-‐ commerce – essentially in any application where there are sensors that produce important data changing over time. Measuring and comparing the efficacy of these streaming data anomaly detectors is also a difficult task. First, since most anomaly detectors are focused on a specific domain, these detectors don’t share a common, more generalized data set, which makes it hard to compare between Detectors. Second, most data sets are synthetic, which doesn’t provide realistic measurements of how a Detector performs in the real world [reference “Systematic Construction of Anomaly Detection Benchmarks from Real Data”, Aug 2013]. The Numenta Anomaly Detection Benchmark (NAB) attempts to provide a controlled and repeatable environment and tools to test and measure different anomaly detection algorithms on streaming data. To our knowledge there are no benchmarks to adequately test the efficacy of online anomaly detectors. The motivation of the NAB is to provide a framework for which we can compare and evaluate different algorithms for detecting anomalies in streaming data. NAB is designed to test features of streaming data anomaly detectors that will be valuable in practical applications. Thus in order to score well on NAB, anomaly detection algorithms should: -‐ Run in unsupervised mode. -‐ Not have any specific dataset tuning. -‐ Perform continuous or online learning. -‐ Process real time data only (and not depend on look ahead). This paper describes the main processes, inputs and outputs of NAB, and the motivation behind the choices made in the benchmark design. More details on NAB and anomaly detection can be found in a TBD journal publication and Numenta’s Science of Anomaly Detection whitepaper, respectively.

Please note this a draft under revision. Anomaly Detection What is an anomaly? An “anomaly” is defined as a deviation from what is standard, normal, or expected. In streaming data, “normal” can be described as a high order pattern or sequence, which can be recognized over time. When this pattern changes, i.e. the data behaves in unexpected ways, this can be characterized as an “anomaly”. If the data then stabilizes in a new pattern, this behavior soon becomes “normal”, since over time the patterns can be discerned and learned. There are many types of anomalies, in an almost infinite variety of normal behavior, which makes anomaly detection difficult. In static data, there are spatial anomalies, or deviations from normal in space. In streaming data there are temporal anomalies, or deviations from normal in time. There are anomalies in seemingly random data, and there are sudden change anomalies that define a new normal. Anomalies can be both positive and negative, i.e. an increase or decrease, respectively, in the data metric of interest. An anomaly detector accepts data input and outputs items, events, or observations which do not conform to the identified pattern(s). These patterns can be either global or local in the dataset. As discussed above, we are specifically concerned with detecting anomalies in real time, streaming data; i.e. an online anomaly detector. What is The Numenta Anomaly Benchmark? NAB consists of an open source repository available under Gnu GPL v3, containing labeled benchmark data, code, file format descriptions, examples and documentation for the Numenta Anomaly Benchmark. The ideal anomaly detector is both accurate and fast. The ideal detector: -‐ Detects all anomalies present in the streaming data. -‐ Detects anomalies as soon as possible, ideally before the anomaly becomes visible to a human. -‐ Triggers no false alarms (no false positives). -‐ Works with real world data. One important aspect of NAB is a scoring methodology that is designed for streaming applications. The NAB scoring system quantifies the degree to which the Detector Under Test (DUT) meets the above ideal standards. NAB compares the DUT “detections” against the Ground Truth File, which is a combination of the individual Labeling Files (see description of Ground Truth creation below), using a programmatic scoring function (also described below). In addition we introduce the concept of “application profiles” which enables us to vary the relative cost of true positives, false positives and false negatives.

Please note this a draft under revision. A second important aspect of NAB is that the Benchmark Dataset includes both artificial and real world data. The Data Files in the initial Benchmark Dataset include server metrics, machine sensor readings and temperature readings. The goal is to add other real world Data Files to the Benchmark Dataset: financial data, security settings, positional data, etc. NAB Details Descriptions of Algorithms Included With NAB At alpha release, three different detector algorithms are included in the NAB repository. The user can experiment with these detectors, or create new detectors to run through NAB. -‐ HTM-‐based detector: The Numenta detector, based on Hierarchical Temporal Memory (HTM), is included in the NAB code repository. The Numenta detector doesn’t need training sets, and automatically builds models for any number of metrics, and performs continuous learning on streaming datasets. Refer to https://github.com/numenta/nupic/wiki/Anomaly-‐Detection-‐and-‐Anomaly-‐ Scores for more information. The NAB repository documentation includes instructions on how a user can do NAB test runs with the Numenta detector, and replicate the results. -‐ Etsy/Skyline detector: Skyline is an open source, real-‐time anomaly detection system, built to enable passive monitoring of hundreds of thousands of metrics, without the need to configure a model/thresholds for each one. It is designed to be used wherever there is a large quantity of high-‐resolution timeseries data which need constant monitoring. Refer to https://github.com/etsy/skyline for more information. The NAB repository documentation includes instructions on how a user can do NAB test runs with the Skyline detector, and replicate the results. -‐ Random detector: provided as a trivial baseline for comparison. Benchmark Dataset Overview The NAB Benchmark Dataset contains the streaming data that detectors use as input during a benchmark test run. The Benchmark Dataset consists of a number of individual Data Files; each Data File represents a sequence of data points over time that is interesting and potentially challenging for an anomaly detector. At the alpha release the Benchmark Dataset contains thirty-‐two (32) individual Data Files. Five (5) of these Data Files contain no anomalies, and instead represent patterns of data over time; anomaly detectors should not find anomalies in these files. The other twenty-‐seven (27) Data Files each contain one or more anomalies. Some of the Data Files are simulated, to create simple anomalies in clear conditions, and others are taken from real world situations. The NAB repository documentation includes

Please note this a draft under revision. documentation on all of these Data Files, and you can also refer to Appendix A which describes the types of anomalies contained in the Benchmark Dataset. NAB Ground Truth Creation In order to score a DUT’s anomaly detections, NAB needs a reference, or “Ground Truth” to score against. The Ground Truth for the Benchmark Dataset has been calculated in the following three steps: 1. A number of human Labelers [reference to list of names] have labeled the Benchmark Dataset. Each individual Labeler read through the published labeling guidelines [reference these here], and then looked through the Data Files in the Benchmark Dataset and recorded the time windows that s/he thought contained anomalies, in the specified file format. 2. These multiple Labeling Files are then combined into one Labeling File, using the following algorithm: a. A parameter is passed into the label combining function, which describes the level of agreement needed between individual labelers before a particular point in a data file is labeled as anomalous in the combined file. An Agreement Parameter of “1” indicates that 100% of all labelers must agree on a data point being labeled an anomaly in the combined label file; lower values make the final label less reliant on unanimous agreement. For instance an Agreement Parameter of 0.5 means that only 50% of the labelers would need to agree in order for the data point within the Data File to be labeled as an anomaly. b. The combined labels are then converted to “Combined Anomaly Windows”, indicated by a beginning and ending timestamp. This is necessary because the majority of anomalies are not point anomalies, but rather they occur over multiple time steps. Anomaly Windows are a less verbose representation of the anomaly labels for individual points; the NAB code uses both representations for different functions, so for brevity we only refer to the Anomaly Windows in the rest of this paper. 3. The human labeling may be imprecise, even with out formalized process and guidelines, thus NAB also introduces the concept of a “Relaxed Anomaly Window”. The Relaxed Anomaly Window will allow the DUT to not be penalized during scoring if the DUT anomaly detections are slightly before or after the Ground Truth indicates the anomaly. We want these windows to be large enough to allow early detection of anomalies, but not so large as to artificially boost the score of inaccurate detectors. The Relaxed Anomaly Windows are calculated using the following algorithm: a. The total amount of “relaxation” in a single Data File was chosen to be 10% of the data file length. For instance, if a Data File contains 4000 data points, then the total amount of relaxation shared by all anomaly windows in the data file is 0.10*4000, or 400 data points. b. The total relaxation amount for a Data File is then divided by the total number of anomaly windows in the Data File, and the resulting number of data points are added to each anomaly window, half before

Please note this a draft under revision. the start of the window, and half at the end of the window. The resulting, larger window is called the “Relaxed Anomaly Window”. For instance, if the Data File above has four anomalies in it (after the label combination is done), then each anomaly window is relaxed by 400/4, or 100 data points (50 data points in each direction). Because we assume anomalies are relatively sparse and exist as windows, we can relax the windows without the issue of overlap. The relaxation is directly proportional to the number of anomalies in a given dataset. The Relaxed Window sizing method intends to give the benefit of the doubt to the detector, but to a limited extent where false positives would not count as true positives. The method works well because anomalies are rare. The 10% parameter is validated because, at this level, we see the largest deviation from the random detector to the real detectors. This phase doesn’t happen during every NAB test run; the programs in this phase were run once before the release of NAB. The code that created the Ground Truth, the individually labeled files, and all the Data Files are available in the NAB repository. Ground Truth Creation Inputs Ground Truth Creation uses the Benchmark Dataset and the individual Labeling Files as inputs. Ground Truth Creation Outputs The relaxed window timestamps resulting from step 3 in this phase are stored in the ground_truth_labels.json file, located in the /labels directory of the NAB repository. This file is used by the NAB optimization and scoring phases as the Ground Truth.

Please note this a draft under revision. NAB Test Process The NAB Test process consists of three phases: -‐ Phase 1 – Detector Phase: The DUT processes the Data Files, and records the Raw Anomaly Score for each data point in the Data Files. This is a real valued number between 0 and 1. -‐ Phase 2 – Threshold Optimization Phase: The Detector’s Raw Anomaly Scores are optimized to find the highest scoring threshold that causes a data point to be tagged as an anomaly; these optimized “Detections” are recorded for each data point in the Data Files. -‐ Phase 3 – Scoring Phase: The Detector’s Detections are scored with respect to the Ground Truth. Final scores are recorded for each Data File. A user can enter the NAB test process at the beginning of any one of the three phases, as long as the user has the correct input data format. Following is a more detailed description of each phase. 1. Detector Phase During the Detector Phase, the DUT processes each Data File in the Benchmark Dataset, taking the data points in sequence order. The first 15% of each Data File is called the “probationary period”, and no scoring will be done in this time window. The minimum and maximum value of each Data File is available to the detector before processing begins; in most real-‐world applications the minimum and maximum values are known No other information about the Data File is given, and no look-‐ahead is allowed. For each point in the Data File, the DUT must calculate a “Raw Anomaly Score”, which is a floating point number between 0 and 1; the closer the number is to 1, the more likely this point is to be an anomaly. These Raw Anomaly Scores are stored, each with its appropriate time stamp, in a specific output file for each Data File, in the results folder. The NAB repository includes instructions on how to do a test run with the Detectors included in the repository, and instructions on how to integrate a custom Detector into NAB. Detector Phase Inputs: The inputs into this phase are the Detector (DUT) and the Benchmark Dataset. The user can feed an optional, custom Detector into the Detector Phase by creating a subclass of the base.py class in the nab/detectors directory. To help, skeleton code is included in the detectors directory as detector_skeleton.py. Detector Phase Outputs: The DUT’s Raw Anomaly Scores are written into files named detectorName_dataFileName.csv; these files are located in the /results directory.

Please note this a draft under revision. Phase 2: Threshold Optimization Phase The outputs of the Detection Phase are files (one for each Data File) that have floating point Raw Anomaly Scores (between 0 and 1) for every data point in the Benchmark Dataset. The next step in the NAB process is to threshold these Raw Anomaly Scores into discrete values indicating the presence or absence of an anomaly. We constrain the detectors to use a single detection threshold for the entire Dataset. For convenience, NAB includes a simple hill-‐climbing routine for finding the optimal threshold value given the scoring rules. Then the chosen threshold (the one that resulted in the optimal score) is applied to the Raw Anomaly Scores, resulting in “Detection Labels” for each timestamp, where a binary “1” represents the presence of an anomaly, and a “0” represents the absence of an anomaly. These Detection Labels are written into the same output files from Phase 1. Threshold Optimization Phase Inputs: The inputs into this phase are the DUT’s Raw Anomaly Scores, the Ground Truth files, and the Application Profile. Note that a user can enter the NAB test run at the beginning of the Threshold Optimization Phase (Phase 2) by supplying the files detectorName_dataFileName.csv, that are populated with Raw Anomaly Scores. File formats are described in detail in the NAB repository. Threshold Optimization Phase Outputs: The DUT’s Detection Labels are written into files named detectorName_dataFileName.csv; these files are located in the /results directory.

Please note this a draft under revision. Phase 3: Scoring Phase In the Scoring Phase, NAB assigns a final score to the DUT’s Detection Labels in the results file; each Data File is scored separately. The Scoring Phase uses a Scoring Function and three different Scoring Weights to calculate the DUT’s final score. This is discussed in further detail in the Scoring section below. Scoring Phase Inputs: The inputs into this phase are the Anomaly Windows file (after the Threshold Optimization Phase is complete, containing the DUT’s Detection Windows), the Ground Truth file, and the Application Profile. Note that a user can enter the NAB test run at the beginning of the Scoring Phase (Phase 3) by supplying an Anomaly Windows file, in the correct format, that is populated with Detection Windows. File formats are described in detail in the NAB repository. Scoring Phase Outputs: The scores are stored in a file called detectorname_scores.csv, in the results directory. Scoring Scoring Function Final scores for a DUT are based on a Scoring Function, S, that quantifies how good or bad it is for the DUT to label a given data point in the Benchmark Dataset as an anomaly (where the presence of an anomaly is represented by the binary value 1). Given a data point at time t, 𝑆 𝑡 = 1 indicates an optimal location to label a data point as an anomaly, and 𝑆 𝑡 = −1 indicates the opposite. 𝑆 𝑡 ranges from -‐1 to 1. 1 𝑆(𝑡) = 2 − 1 1 + 𝑒 !!

Please note this a draft under revision.

Figure 1. NAB scoring function – needs some revision to simplify In Figure 1, the shaded areas represent the Ground Truth anomaly windows – these are the time windows in which the DUT needs to detect an anomaly in order to get a good score. The dark line represents the Scoring Function; the value of the Detection Label at a data point (either 1 or 0) is multiplied by this Scoring Function. Note that the Scoring Function weighs early detection of anomalies (in the anomaly window) higher than later detection, providing two benefits. It rewards earlier detection of anomalies with a higher true positive score, and also ramps down the punishment of false positives right after a Ground Truth anomaly window in order to be less harmful when the detection of an anomaly is slightly late. Scoring Weights. The Scoring Weights, stored in the Application Profile (config/profiles.json), are used in the scoring function to customize the values of correct and incorrect anomaly detections. The values for each of the Scoring Weights in the default Application Profile is 1. The user can vary the Scoring Weights by using different Application Profiles, which assign values to the binary classification metrics: -‐ True positive: correctly identified -‐ True negative: correctly rejected

Please note this a draft under revision. -‐ -‐

False positive: incorrectly identified (type I error) False negative: incorrectly rejected (type II error)

True Positive weight (TP): the theoretical maximum number of points given for each anomaly detected. When an anomaly is correctly labeled, the first positive data point (d) within the anomaly window is used to calculate the change in score, as follows: 𝑆𝑐𝑜𝑟𝑒 = 𝑆𝑐𝑜𝑟𝑒 + 𝑇𝑃 ∙ 𝑆(𝑑) After the first positive data point in an anomaly window, further data points within the Ground Truth anomaly window are ignored and do not add or detract from the score – i.e. only counts each correctly identified anomaly once. The value of the True Positive weight (TP) in the default Application Profile is 1. If a user wants to emphasize the importance of correct anomaly detection during the Benchmark run, then s/he can increase the value of TP. This will increase the amount that the Score is incremented every time an anomaly is detected correctly. Accordingly, if the user wants to deemphasize the importance of correct anomaly detection during the Benchmark run, then decreasing the value of TP the decrease amount that the Score is incremented by every time an anomaly is detected correctly. False Positive weight (FP): the theoretical maximum number of points taken away for each false positive label. Whenever any data point is labeled as positive outside of a Ground Truth anomaly window, the score is reduced as follows (note that S(d) will be negative in this case): 𝑆𝑐𝑜𝑟𝑒 = 𝑆𝑐𝑜𝑟𝑒 + 𝐹𝑃 ∙ 𝑆(𝑑) The score is reduced by this amount for each positive data point outside of a Ground Truth anomaly window. The value of the False Positive weight (FP) in the default Application Profile is 1. If a user doesn’t care as much about false positives being detected, then s/he can decrease the value of FP; this will decrement the Score by a lesser amount every time an anomaly is incorrectly identified. Accordingly, if the user wants to emphasize a low false positive rate (i.e. they want a detector that gets very few false positives), then increasing the value of FP will decrement the Score by a greater amount every time an anomaly is incorrectly identified.

Please note this a draft under revision. False Negative (undetected positive) weight (FN): the fixed amount of points taken away whenever an anomaly is not detected at all. If no data point inside a Ground Truth anomaly window is labeled as positive, then the score is reduced as follows: 𝑆𝑐𝑜𝑟𝑒 = 𝑆𝑐𝑜𝑟𝑒 − 𝐹𝑁 The score is reduced by this amount once for each anomaly which is not detected at all. The value of the False Negative weight (FN) in the default Application Profile is 1. If a user wants to make it more important that the detector never miss a real anomaly, then s/he can increase the value of FN, which will decrement the Score by a greater amount when an anomaly is missed. If the user cares less about true anomalies being missed, then decreasing the value of FN will decrement the Score by a lesser amount when an anomaly is missed. More details on the derivation of the scoring function and its relation to standard algorithms scoring metrics are discussed in Appendix A.

Please note this a draft under revision. Application Profiles. As described above, the three Scoring Weights can be varied by the user for a particular NAB test run; these Scoring Weights are stored in the Application Profile (config/profiles.json). Along with a default Application Profile (in which all the scoring weights are set to 1), several different example Application Profiles are provided in the NAB repository. Application Profile #1: This application needs a detector that has a very low false positive rate; they would rather trade off missing a few “real” anomalies rather than getting multiple false positives. That is, the motivation is to minimize type I error. The Scoring Weights in this profile are set as follows: TP = 1 [give full credit for properly detected anomalies] FP = 2 [decrement the Score more for any false positives] FN = .5 [decrement the Score less for any missed anomalies] Application Profile #2: This application needs a detector that doesn’t miss any real anomalies; they would rather trade off a few false positives than miss any true anomalies. That is, the motivation is to minimize type II error. The Scoring Weights in this profile are set as follows: TP = 1 [give full credit for properly detected anomalies] FP = .5 [decrement the Score less for any false positives] FN = 2 [decrement the Score more for any missed anomalies] NAB Results Interpreting Results [needs descriptive text] Reporting Results If an NAB user wants to report the results of their Detector NAB runs and have these results posted in the NAB repository, the user should send an email to [need a [email protected] email address], attaching their final score results file, their name and/ or company affiliation, and a link to the custom Detector and any other NAB inputs provided. The submission file should be a CSV with column headers “__”, with the respective data in rows 2+. We should be able to run your detector and produce the same results file. All results will be reviewed before posting. It is possible to have results posted without including links to the NAB inputs, but this fact will be noted on the leaderboard. Score Leaderboard

Please note this a draft under revision. The NAB Score Leaderboard is published in the NAB repository, as leaderboard.yml, in the ?? directory. -‐ Need a “committee” to approve new labelers & labeling results, new results – who is this? Ideally we can recruit a couple of other organizations. Agreed.

Please note this a draft under revision. Glossary -‐ Benchmark Dataset: Consisting of a number of Data Files, this is the fixed dataset used to test the anomaly detection algorithms. -‐ Data File: Any one of a series of time sequence data in a specific format, chosen to be part of the Benchmark Dataset. While some of these Data Files contain simulated data, many are taken from real life situations. -‐ Detector Under Test (DUT): Anomaly detector that is being tested by the benchmark. -‐ Detector Phase: First phase of a NAB test run. -‐ Threshold Optimization Phase: -‐ Scoring Phase: -‐ Raw Anomaly Score: Floating point value between 0 and 1 that is the output of the DUT for each data point in the Benchmark Dataset. A score closer to 1 indicates that the DUT has scored this more likely to be an anomaly. -‐ Anomaly Windows: -‐ Detection Labels: data points in a results file that show where the DUT detected an anomaly (binary 1 or 0). -‐ Detection Windows -‐ Final Scores: Is this a good name? -‐ Results File: File that is used by NAB to write the results of a benchmark run. NAB store anomaly scores, detections and scores into this file at different points in the benchmark test run. This file can also be used as input if the user wants to enter the benchmark test at different points. -‐ Application Profile: A set of variables that can be set by the user for any particular NAB test run; the Threshold Optimizer uses these variables to identify the actual Detections from the Anomaly Windows in the results file. The A default profile is used for runs where the user doesn’t set these variables. -‐ Labeling File: Human generated labels for anomalies in the Benchmark Test File, used to compute the “Ground Truth Files” which are the measure the DUT is testing against. -‐ Combined Labeling File -‐ Labeler: Person who labels the Data Files in the Benchmark Dataset by hand, and provides a Labels File. -‐ Ground Truth File: File containing Anomaly Windows showing the presence of an anomaly in the Benchmark Dataset that the DUT will be scored against. -‐ Data Visualizer: visualizer for the Data Files in the Benchmark Dataset, used by labelers (or anyone else who wants to view the input data).

Please note this a draft under revision. Appendices A. Scoring metrics for algorithms B. Overview of anomalies represented in the Benchmark Dataset C. Labeling process description D. Dataset visualizer E. Approval process description i. New labelers/ labeling files ii. Result Leaderboard Postings F. Scoring Examples

Please note this a draft under revision. Appendix A: Scoring algorithms Validating the Scoring Function An anomaly detector is a binary classifier, where each step in time is labeled anomalous or not. Metrics to evaluate such classifiers arise from the resulting counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Ground Truth TRUE FALSE Classifier POSITIVE TP FP Result NEGATIVE FN TN These values are used to calculate the “precision” and “recall”. The precision is defined as the number of true positives divided by the total number of positive samples (i.e. union of true positives and false positives). Recall is defined as the number of true positives divided by the total number of positive samples (i.e. union of true positives and false negatives). Perfect precision means every result retrieved was relevant (i.e. a correct anomaly), while perfect recall means every relevant sample (i.e. all correct anomalies in the data) we’re retrieved. High precision guards against type I errors; a precise anomaly detection algorithm will perform well under NAB Application Profile 1. High recall, on the other hand, guards against type II errors; an anomaly detector of high recall will perform well under NAB Application Profile 2. A more robust evaluation metric considers both precision and recall: the “F1 score”. Both of these metrics are valuable in the binary classification task, so the F1 score is the harmonic mean of precision and recall: 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙 𝐹! = 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 How does the NAB scoring function relate to the F1 score? [needs descriptive text]

Please note this a draft under revision. Appendix B: Overview of anomalies represented in the Benchmark Dataset Figures 1A, 1B and 1C below show artificially generated data streams, each containing examples of simple anomalies. Each of these anomalies is imposed between 11Apr and 12Apr on a regular, normal streaming data pattern of a slightly noisy square wave; you can easily see what the anomaly is in each of these cases. An anomaly detector must recognize the deviation from the normal pattern as soon as possible, and then recognize the return to the normal square wave pattern – i.e. by not labeling the return to normal as another anomaly.

Figure 1A

Figure 1B

Figure 1C

Please note this a draft under revision. Figure 2 shows another artificially generated data stream, but with a slightly more subtle anomaly. The normal pattern is two spikes followed by a long, quiet period. Around 08Apr there are four spikes; the anomaly detector must be able to recognize this is different from the normal as the third spike occurs, and then recognize a return to normal after the fourth spike.

Figure 2 Figure 3 is a real world data sequence, representing CPU utilization on a server cluster. The data is fairly noisy, but you can see two apparent anomalies: one is a single down spike just before the 16Apr mark, and the second is a more dramatic drop right after the 16Apr mark. Since these two changes in the data are relatively close to each other in time, they are labeled as a single anomaly in the Ground Truth.

Figure 3

Please note this a draft under revision. Figure 4 shows an anomaly in seemingly random data. On 22Feb there is a very tall spike in the data, followed by a long period of near-‐zero data. The near-‐zero data pattern should soon be learned as a new normal, and no longer flagged as an anomaly. When the pattern of spikes starts up again on 25Feb, this pattern should be recognized as the previous normal, and not flagged as an anomaly.

Figure 4 Figure 5 shows anomalies in a very noisy data stream. At the 19Feb mark, a much higher data spike is seen, followed by a lower overall noisy pattern, which should be seen as a new normal within a small window. Then right before 25Feb, several lower spikes are seen, followed by a very high spike, and finally a new, much lower normal pattern. These two sets of events are far enough apart in time that they are considered to be two separate anomalies.

Figure 5

Please note this a draft under revision.

Please note this a draft under revision. Appendix E Scoring Examples [these still need to be edited]

Example 1

Figure 2. Here, the detector labels are marked in red (false positive) and green (true positive). Notice how only the relaxed windows remain in this diagram. This is because the actual windows no longer matter after the relaxed windows are calculated. Within the window of the first anomaly, there are two records labeled as anomalous. In our scoring system the second label will be ignored and the score will only be increased by the first positive record within the window. This will increase the score by S(r)*TP Each false positive receives a negative score. This will decrease the score by Σr S(r)*FP The second anomaly does not have a single true positive within its window. This will decrease the score by FN * Length(Anomaly 2)/Length(dataset) Score = S(rTP[1])*TP - Σr S(rFP)*FP - FN*( length(anomaly[2]) / length(dataset) )

Please note this a draft under revision.

Example 2

Figure 3. Again, the detectors are marked in red (false positive) and green (true positive). Here is an example output of a very sensitive detector.

If a sensitive detector labeled many records as anomalous, the score would be quite negative because every false positive results in a decrease in the score. The score would look like this: Score = S(rTP[1])*TP + S(rTP[2]) - Σr S(rFP)*FP

Please note this a draft under revision.

Example 3

Figure 4. Here we compare two different types of true positives. Those created by detector 1 and those created by detector 2. Because detector 1 consistently caught anomalies earlier than detector 2, its score will be higher. S(rdetector 1) > S(rdetector 2)