Please  note  this  a  draft  under  revision.    

The  Numenta  Anomaly  Benchmark  (NAB)  

    Introduction   Much  of  real  world  data  is  streaming;  this  means  that  the  data  not  only  changes  over   time,  but  it  has  meaning  over  time  –  the  order  of  the  data  points  matters.  Detecting   anomalies  in  streaming  data  is  a  difficult  task,  because  the  Detector  must  process   data  in  real  time  rather  than  making  many  passes  through  a  large  batch  file,  and  the   Detector  must  learn  the  high  level  patterns  and  sequences  as  it  goes.  Detecting   anomalies  in  streaming  data  can  be  extremely  valuable  in  many  domains,  though,   such  as  IT  security,  finance,  vehicle  tracking,  health  care,  energy  grid  monitoring,  e-­‐ commerce  –  essentially  in  any  application  where  there  are  sensors  that  produce   important  data  changing  over  time.     Measuring  and  comparing  the  efficacy  of  these  streaming  data  anomaly  detectors  is   also  a  difficult  task.    First,  since  most  anomaly  detectors  are  focused  on  a  specific   domain,  these  detectors  don’t  share  a  common,  more  generalized  data  set,  which   makes  it  hard  to  compare  between  Detectors.    Second,  most  data  sets  are  synthetic,   which  doesn’t  provide  realistic  measurements  of  how  a  Detector  performs  in  the   real  world  [reference  “Systematic  Construction  of  Anomaly  Detection  Benchmarks   from  Real  Data”,  Aug  2013].  The  Numenta  Anomaly  Detection  Benchmark  (NAB)   attempts  to  provide  a  controlled  and  repeatable  environment  and  tools  to  test  and   measure  different  anomaly  detection  algorithms  on  streaming  data.  To  our   knowledge  there  are  no  benchmarks  to  adequately  test  the  efficacy  of  online   anomaly  detectors.  The  motivation  of  the  NAB  is  to  provide  a  framework  for  which   we  can  compare  and  evaluate  different  algorithms  for  detecting  anomalies  in   streaming  data.     NAB  is  designed  to  test  features  of  streaming  data  anomaly  detectors  that  will  be   valuable  in  practical  applications.  Thus  in  order  to  score  well  on  NAB,  anomaly   detection  algorithms  should:   -­‐ Run  in  unsupervised  mode.   -­‐ Not  have  any  specific  dataset  tuning.   -­‐ Perform  continuous  or  online  learning.   -­‐ Process  real  time  data  only  (and  not  depend  on  look  ahead).     This  paper  describes  the  main  processes,  inputs  and  outputs  of  NAB,  and  the   motivation  behind  the  choices  made  in  the  benchmark  design.  More  details  on  NAB   and  anomaly  detection  can  be  found  in  a  TBD  journal  publication  and  Numenta’s   Science  of  Anomaly  Detection  whitepaper,  respectively.        

Please  note  this  a  draft  under  revision.     Anomaly  Detection     What  is  an  anomaly?   An  “anomaly”  is  defined  as  a  deviation  from  what  is  standard,  normal,  or  expected.   In  streaming  data,  “normal”  can  be  described  as  a  high  order  pattern  or  sequence,   which  can  be  recognized  over  time.    When  this  pattern  changes,  i.e.  the  data  behaves   in  unexpected  ways,  this  can  be  characterized  as  an  “anomaly”.    If  the  data  then   stabilizes  in  a  new  pattern,  this  behavior  soon  becomes  “normal”,  since  over  time   the  patterns  can  be  discerned  and  learned.         There  are  many  types  of  anomalies,  in  an  almost  infinite  variety  of  normal  behavior,   which  makes  anomaly  detection  difficult.  In  static  data,  there  are  spatial  anomalies,   or  deviations  from  normal  in  space.  In  streaming  data  there  are  temporal  anomalies,   or  deviations  from  normal  in  time.    There  are  anomalies  in  seemingly  random  data,   and  there  are  sudden  change  anomalies  that  define  a  new  normal.  Anomalies  can  be   both  positive  and  negative,  i.e.  an  increase  or  decrease,  respectively,  in  the  data   metric  of  interest.     An  anomaly  detector  accepts  data  input  and  outputs  items,  events,  or  observations   which  do  not  conform  to  the  identified  pattern(s).  These  patterns  can  be  either   global  or  local  in  the  dataset.  As  discussed  above,  we  are  specifically  concerned  with   detecting  anomalies  in  real  time,  streaming  data;  i.e.  an  online  anomaly  detector.     What  is  The  Numenta  Anomaly  Benchmark?   NAB  consists  of  an  open  source  repository  available  under  Gnu  GPL  v3,  containing   labeled  benchmark  data,  code,  file  format  descriptions,  examples  and   documentation  for  the  Numenta  Anomaly  Benchmark.     The  ideal  anomaly  detector  is  both  accurate  and  fast.  The  ideal  detector:   -­‐ Detects  all  anomalies  present  in  the  streaming  data.   -­‐ Detects  anomalies  as  soon  as  possible,  ideally  before  the  anomaly  becomes   visible  to  a  human.   -­‐ Triggers  no  false  alarms  (no  false  positives).   -­‐ Works  with  real  world  data.     One  important  aspect  of  NAB  is  a  scoring  methodology  that  is  designed  for   streaming  applications.  The  NAB  scoring  system  quantifies  the  degree  to  which  the   Detector  Under  Test  (DUT)  meets  the  above  ideal  standards.    NAB  compares  the   DUT  “detections”  against  the  Ground  Truth  File,  which  is  a  combination  of  the   individual  Labeling  Files  (see  description  of  Ground  Truth  creation  below),  using  a   programmatic  scoring  function  (also  described  below).    In  addition  we  introduce  the   concept  of  “application  profiles”  which  enables  us  to  vary  the  relative  cost  of  true   positives,  false  positives  and  false  negatives.    

Please  note  this  a  draft  under  revision.     A  second  important  aspect  of  NAB  is  that  the  Benchmark  Dataset  includes  both   artificial  and  real  world  data.  The  Data  Files  in  the  initial  Benchmark  Dataset  include   server  metrics,  machine  sensor  readings  and  temperature  readings.  The  goal  is  to   add  other  real  world  Data  Files  to  the  Benchmark  Dataset:    financial  data,  security   settings,  positional  data,  etc.       NAB  Details     Descriptions  of  Algorithms  Included  With  NAB   At  alpha  release,  three  different  detector  algorithms  are  included  in  the  NAB   repository.    The  user  can  experiment  with  these  detectors,  or  create  new  detectors   to  run  through  NAB.     -­‐ HTM-­‐based  detector:  The  Numenta  detector,  based  on  Hierarchical  Temporal   Memory  (HTM),  is  included  in  the  NAB  code  repository.    The  Numenta  detector   doesn’t  need  training  sets,  and  automatically  builds  models  for  any  number  of   metrics,  and  performs  continuous  learning  on  streaming  datasets.  Refer  to   https://github.com/numenta/nupic/wiki/Anomaly-­‐Detection-­‐and-­‐Anomaly-­‐ Scores  for  more  information.  The  NAB  repository  documentation  includes   instructions  on  how  a  user  can  do  NAB  test  runs  with  the  Numenta  detector,  and   replicate  the  results.     -­‐ Etsy/Skyline  detector:  Skyline  is  an  open  source,  real-­‐time  anomaly  detection   system,  built  to  enable  passive  monitoring  of  hundreds  of  thousands  of  metrics,   without  the  need  to  configure  a  model/thresholds  for  each  one.  It  is  designed  to   be  used  wherever  there  is  a  large  quantity  of  high-­‐resolution  timeseries  data   which  need  constant  monitoring.  Refer  to  https://github.com/etsy/skyline  for   more  information.    The  NAB  repository  documentation  includes  instructions  on   how  a  user  can  do  NAB  test  runs  with  the  Skyline  detector,  and  replicate  the   results.     -­‐ Random  detector:  provided  as  a  trivial  baseline  for  comparison.       Benchmark  Dataset  Overview   The  NAB  Benchmark  Dataset  contains  the  streaming  data  that  detectors  use  as  input   during  a  benchmark  test  run.  The  Benchmark  Dataset  consists  of  a  number  of   individual  Data  Files;  each  Data  File  represents  a  sequence  of  data  points  over  time   that  is  interesting  and  potentially  challenging  for  an  anomaly  detector.  At  the  alpha   release  the  Benchmark  Dataset  contains  thirty-­‐two  (32)  individual  Data  Files.  Five   (5)  of  these  Data  Files  contain  no  anomalies,  and  instead  represent  patterns  of  data   over  time;  anomaly  detectors  should  not  find  anomalies  in  these  files.    The  other   twenty-­‐seven  (27)  Data  Files  each  contain  one  or  more  anomalies.  Some  of  the  Data   Files  are  simulated,  to  create  simple  anomalies  in  clear  conditions,  and  others  are   taken  from  real  world  situations.    The  NAB  repository  documentation  includes  

Please  note  this  a  draft  under  revision.     documentation  on  all  of  these  Data  Files,  and  you  can  also  refer  to  Appendix  A  which   describes  the  types  of  anomalies  contained  in  the  Benchmark  Dataset.     NAB  Ground  Truth  Creation   In  order  to  score  a  DUT’s  anomaly  detections,  NAB  needs  a  reference,  or  “Ground   Truth”  to  score  against.    The  Ground  Truth  for  the  Benchmark  Dataset  has  been   calculated  in  the  following  three  steps:   1. A  number  of  human  Labelers    [reference  to  list  of  names]  have  labeled  the   Benchmark  Dataset.  Each  individual  Labeler  read  through  the  published   labeling  guidelines  [reference  these  here],  and  then  looked  through  the  Data   Files  in  the  Benchmark  Dataset  and  recorded  the  time  windows  that  s/he   thought  contained  anomalies,  in  the  specified  file  format.   2. These  multiple  Labeling  Files  are  then  combined  into  one  Labeling  File,  using   the  following  algorithm:   a. A  parameter  is  passed  into  the  label  combining  function,  which   describes  the  level  of  agreement  needed  between  individual  labelers   before  a  particular  point  in  a  data  file  is  labeled  as  anomalous  in  the   combined  file.  An  Agreement  Parameter  of  “1”  indicates  that  100%  of   all  labelers  must  agree  on  a  data  point  being  labeled  an  anomaly  in  the   combined  label  file;  lower  values  make  the  final  label  less  reliant  on   unanimous  agreement.  For  instance  an  Agreement  Parameter  of  0.5   means  that  only  50%  of  the  labelers  would  need  to  agree  in  order  for   the  data  point  within  the  Data  File  to  be  labeled  as  an  anomaly.   b. The  combined  labels  are  then  converted  to  “Combined  Anomaly   Windows”,  indicated  by  a  beginning  and  ending  timestamp.  This  is   necessary  because  the  majority  of  anomalies  are  not  point  anomalies,   but  rather  they  occur  over  multiple  time  steps.  Anomaly  Windows  are   a  less  verbose  representation  of  the  anomaly  labels  for  individual   points;  the  NAB  code  uses  both  representations  for  different   functions,  so  for  brevity  we  only  refer  to  the  Anomaly  Windows  in  the   rest  of  this  paper.   3. The  human  labeling  may  be  imprecise,  even  with  out  formalized  process  and   guidelines,  thus  NAB  also  introduces  the  concept  of  a  “Relaxed  Anomaly   Window”.  The  Relaxed  Anomaly  Window  will  allow  the  DUT  to  not  be   penalized  during  scoring  if  the  DUT  anomaly  detections  are  slightly  before  or   after  the  Ground  Truth  indicates  the  anomaly.  We  want  these  windows  to  be   large  enough  to  allow  early  detection  of  anomalies,  but  not  so  large  as  to   artificially  boost  the  score  of  inaccurate  detectors.  The  Relaxed  Anomaly   Windows  are  calculated  using  the  following  algorithm:   a. The  total  amount  of  “relaxation”  in  a  single  Data  File  was  chosen  to  be   10%  of  the  data  file  length.  For  instance,  if  a  Data  File  contains  4000   data  points,  then  the  total  amount  of  relaxation  shared  by  all  anomaly   windows  in  the  data  file  is  0.10*4000,  or  400  data  points.   b. The  total  relaxation  amount  for  a  Data  File  is  then  divided  by  the  total   number  of  anomaly  windows  in  the  Data  File,  and  the  resulting   number  of  data  points  are  added  to  each  anomaly  window,  half  before  

Please  note  this  a  draft  under  revision.     the  start  of  the  window,  and  half  at  the  end  of  the  window.  The   resulting,  larger  window  is  called  the  “Relaxed  Anomaly  Window”.     For  instance,  if  the  Data  File  above  has  four  anomalies  in  it  (after  the   label  combination  is  done),  then  each  anomaly  window  is  relaxed  by   400/4,  or  100  data  points  (50  data  points  in  each  direction).   Because  we  assume  anomalies  are  relatively  sparse  and  exist  as  windows,  we   can  relax  the  windows  without  the  issue  of  overlap.  The  relaxation  is  directly   proportional  to  the  number  of  anomalies  in  a  given  dataset.  The  Relaxed   Window  sizing  method  intends  to  give  the  benefit  of  the  doubt  to  the   detector,  but  to  a  limited  extent  where  false  positives  would  not  count  as  true   positives.  The  method  works  well  because  anomalies  are  rare.  The  10%   parameter  is  validated  because,  at  this  level,  we  see  the  largest  deviation   from  the  random  detector  to  the  real  detectors.     This  phase  doesn’t  happen  during  every  NAB  test  run;  the  programs  in  this  phase   were  run  once  before  the  release  of  NAB.  The  code  that  created  the  Ground  Truth,   the  individually  labeled  files,  and  all  the  Data  Files  are  available  in  the  NAB   repository.       Ground  Truth  Creation  Inputs   Ground  Truth  Creation  uses  the  Benchmark  Dataset  and  the  individual  Labeling   Files  as  inputs.     Ground  Truth  Creation  Outputs   The  relaxed  window  timestamps  resulting  from  step  3  in  this  phase  are  stored  in  the   ground_truth_labels.json  file,  located  in  the  /labels  directory  of  the  NAB  repository.   This  file  is  used  by  the  NAB  optimization  and  scoring  phases  as  the  Ground  Truth.          

Please  note  this  a  draft  under  revision.     NAB  Test  Process   The  NAB  Test  process  consists  of  three  phases:   -­‐ Phase  1  –  Detector  Phase:  The  DUT  processes  the  Data  Files,  and  records  the  Raw   Anomaly  Score  for  each  data  point  in  the  Data  Files.    This  is  a  real  valued  number   between  0  and  1.   -­‐ Phase  2  –  Threshold  Optimization  Phase:  The  Detector’s  Raw  Anomaly  Scores  are   optimized  to  find  the  highest  scoring  threshold  that  causes  a  data  point  to  be   tagged  as  an  anomaly;  these  optimized  “Detections”  are  recorded  for  each  data   point  in  the  Data  Files.   -­‐ Phase  3  –  Scoring  Phase:  The  Detector’s  Detections  are  scored  with  respect  to  the   Ground  Truth.  Final  scores  are  recorded  for  each  Data  File.     A  user  can  enter  the  NAB  test  process  at  the  beginning  of  any  one  of  the  three   phases,  as  long  as  the  user  has  the  correct  input  data  format.    Following  is  a  more   detailed  description  of  each  phase.     1.  Detector  Phase   During  the  Detector  Phase,  the  DUT  processes  each  Data  File  in  the  Benchmark   Dataset,  taking  the  data  points  in  sequence  order.    The  first  15%  of  each  Data  File  is   called  the  “probationary  period”,  and  no  scoring  will  be  done  in  this  time  window.     The  minimum  and  maximum  value  of  each  Data  File  is  available  to  the  detector   before  processing  begins;  in  most  real-­‐world  applications  the  minimum  and   maximum  values  are  known    No  other  information  about  the  Data  File  is  given,  and   no  look-­‐ahead  is  allowed.  For  each  point  in  the  Data  File,  the  DUT  must  calculate  a   “Raw  Anomaly  Score”,  which  is  a  floating  point  number  between  0  and  1;  the  closer   the  number  is  to  1,  the  more  likely  this  point  is  to  be  an  anomaly.  These  Raw   Anomaly  Scores  are  stored,  each  with  its  appropriate  time  stamp,  in  a  specific   output  file  for  each  Data  File,  in  the  results  folder.     The  NAB  repository  includes  instructions  on  how  to  do  a  test  run  with  the  Detectors   included  in  the  repository,  and  instructions  on  how  to  integrate  a  custom  Detector   into  NAB.     Detector  Phase  Inputs:   The  inputs  into  this  phase  are  the  Detector  (DUT)  and  the  Benchmark  Dataset.  The   user  can  feed  an  optional,  custom  Detector  into  the  Detector  Phase  by  creating  a   subclass  of  the  base.py  class  in  the  nab/detectors  directory.  To  help,  skeleton  code  is   included  in  the  detectors  directory  as  detector_skeleton.py.     Detector  Phase  Outputs:   The  DUT’s  Raw  Anomaly  Scores  are  written  into  files  named   detectorName_dataFileName.csv;  these  files  are  located  in  the  /results  directory.        

Please  note  this  a  draft  under  revision.     Phase  2:  Threshold  Optimization  Phase   The  outputs  of  the  Detection  Phase  are  files  (one  for  each  Data  File)  that  have   floating  point  Raw  Anomaly  Scores  (between  0  and  1)  for  every  data  point  in  the   Benchmark  Dataset.    The  next  step  in  the  NAB  process  is  to  threshold  these  Raw   Anomaly  Scores  into  discrete  values  indicating  the  presence  or  absence  of  an   anomaly.  We  constrain  the  detectors  to  use  a  single  detection  threshold  for  the   entire  Dataset.  For  convenience,  NAB  includes  a  simple  hill-­‐climbing  routine  for   finding  the  optimal  threshold  value  given  the  scoring  rules.  Then  the  chosen   threshold  (the  one  that  resulted  in  the  optimal  score)  is  applied  to  the  Raw  Anomaly   Scores,  resulting  in  “Detection  Labels”  for  each  timestamp,  where  a  binary  “1”   represents  the  presence  of  an  anomaly,  and  a  “0”  represents  the  absence  of  an   anomaly.    These  Detection  Labels  are  written  into  the  same  output  files  from  Phase   1.     Threshold  Optimization  Phase  Inputs:   The  inputs  into  this  phase  are  the  DUT’s  Raw  Anomaly  Scores,  the  Ground  Truth   files,  and  the  Application  Profile.  Note  that  a  user  can  enter  the  NAB  test  run  at  the   beginning  of  the  Threshold  Optimization  Phase  (Phase  2)  by  supplying  the  files   detectorName_dataFileName.csv,  that  are  populated  with  Raw  Anomaly  Scores.  File   formats  are  described  in  detail  in  the  NAB  repository.     Threshold  Optimization  Phase  Outputs:   The  DUT’s  Detection  Labels  are  written  into  files  named   detectorName_dataFileName.csv;  these  files  are  located  in  the  /results  directory.        

Please  note  this  a  draft  under  revision.     Phase  3:  Scoring  Phase   In  the  Scoring  Phase,  NAB  assigns  a  final  score  to  the  DUT’s  Detection  Labels  in  the   results  file;  each  Data  File  is  scored  separately.    The  Scoring  Phase  uses  a  Scoring   Function  and  three  different  Scoring  Weights  to  calculate  the  DUT’s  final  score.  This   is  discussed  in  further  detail  in  the  Scoring  section  below.     Scoring  Phase  Inputs:   The  inputs  into  this  phase  are  the  Anomaly  Windows  file  (after  the  Threshold   Optimization  Phase  is  complete,  containing  the  DUT’s  Detection  Windows),  the   Ground  Truth  file,  and  the  Application  Profile.  Note  that  a  user  can  enter  the  NAB   test  run  at  the  beginning  of  the  Scoring  Phase  (Phase  3)  by  supplying  an  Anomaly   Windows  file,  in  the  correct  format,  that  is  populated  with  Detection  Windows.  File   formats  are  described  in  detail  in  the  NAB  repository.     Scoring  Phase  Outputs:   The  scores  are  stored  in  a  file  called  detectorname_scores.csv,  in  the  results   directory.       Scoring     Scoring  Function   Final  scores  for  a  DUT  are  based  on  a  Scoring  Function,  S,  that  quantifies  how  good   or  bad  it  is  for  the  DUT  to  label  a  given  data  point  in  the  Benchmark  Dataset  as  an   anomaly  (where  the  presence  of  an  anomaly  is  represented  by  the  binary  value  1).   Given  a  data  point  at  time  t,  𝑆 𝑡 = 1  indicates  an  optimal  location  to  label  a  data   point  as  an  anomaly,  and  𝑆 𝑡 = −1  indicates  the  opposite.  𝑆 𝑡  ranges  from  -­‐1  to  1.   1 𝑆(𝑡) = 2 − 1   1 + 𝑒 !!    

Please  note  this  a  draft  under  revision.    

Figure  1.  NAB  scoring  function  –  needs  some  revision  to  simplify     In  Figure  1,  the  shaded  areas  represent  the  Ground  Truth  anomaly  windows  –  these   are  the  time  windows  in  which  the  DUT  needs  to  detect  an  anomaly  in  order  to  get  a   good  score.    The  dark  line  represents  the  Scoring  Function;  the  value  of  the   Detection  Label  at  a  data  point  (either  1  or  0)  is  multiplied  by  this  Scoring  Function.   Note  that  the  Scoring  Function  weighs  early  detection  of  anomalies  (in  the  anomaly   window)  higher  than  later  detection,  providing  two  benefits.  It  rewards  earlier   detection  of  anomalies  with  a  higher  true  positive  score,  and  also  ramps  down  the   punishment  of  false  positives  right  after  a  Ground  Truth  anomaly  window  in  order   to  be  less  harmful  when  the  detection  of  an  anomaly  is  slightly  late.       Scoring  Weights.  The  Scoring  Weights,  stored  in  the  Application  Profile   (config/profiles.json),  are  used  in  the  scoring  function  to  customize  the  values  of   correct  and  incorrect  anomaly  detections.  The  values  for  each  of  the  Scoring   Weights  in  the  default  Application  Profile  is  1.  The  user  can  vary  the  Scoring   Weights  by  using  different  Application  Profiles,  which  assign  values  to  the  binary   classification  metrics:   -­‐ True  positive:  correctly  identified   -­‐ True  negative:  correctly  rejected  

 

Please  note  this  a  draft  under  revision.     -­‐ -­‐

False  positive:  incorrectly  identified  (type  I  error)   False  negative:  incorrectly  rejected  (type  II  error)  

  True  Positive  weight  (TP):  the  theoretical  maximum  number  of  points  given  for  each   anomaly  detected.  When  an  anomaly  is  correctly  labeled,  the  first  positive  data  point   (d)  within  the  anomaly  window  is  used  to  calculate  the  change  in  score,  as  follows:   𝑆𝑐𝑜𝑟𝑒 = 𝑆𝑐𝑜𝑟𝑒 + 𝑇𝑃 ∙ 𝑆(𝑑)     After  the  first  positive  data  point  in  an  anomaly  window,  further  data  points  within   the  Ground  Truth  anomaly  window  are  ignored  and  do  not  add  or  detract  from  the   score  –  i.e.  only  counts  each  correctly  identified  anomaly  once.     The  value  of  the  True  Positive  weight  (TP)  in  the  default  Application  Profile  is  1.  If  a   user  wants  to  emphasize  the  importance  of  correct  anomaly  detection  during  the   Benchmark  run,  then  s/he  can  increase  the  value  of  TP.    This  will  increase  the   amount  that  the  Score  is  incremented  every  time  an  anomaly  is  detected  correctly.   Accordingly,  if  the  user  wants  to  deemphasize  the  importance  of  correct  anomaly   detection  during  the  Benchmark  run,  then  decreasing  the  value  of  TP  the  decrease   amount  that  the  Score  is  incremented  by  every  time  an  anomaly  is  detected   correctly.     False  Positive  weight  (FP):  the  theoretical  maximum  number  of  points  taken  away   for  each  false  positive  label.  Whenever  any  data  point  is  labeled  as  positive  outside   of  a  Ground  Truth  anomaly  window,  the  score  is  reduced  as  follows  (note  that  S(d)   will  be  negative  in  this  case):   𝑆𝑐𝑜𝑟𝑒 = 𝑆𝑐𝑜𝑟𝑒 + 𝐹𝑃 ∙ 𝑆(𝑑)     The  score  is  reduced  by  this  amount  for  each  positive  data  point  outside  of  a  Ground   Truth  anomaly  window.       The  value  of  the  False  Positive  weight  (FP)  in  the  default  Application  Profile  is  1.  If  a   user  doesn’t  care  as  much  about  false  positives  being  detected,  then  s/he  can   decrease  the  value  of  FP;  this  will  decrement  the  Score  by  a  lesser  amount  every   time  an  anomaly  is  incorrectly  identified.    Accordingly,  if  the  user  wants  to   emphasize  a  low  false  positive  rate  (i.e.  they  want  a  detector  that  gets  very  few  false   positives),  then  increasing  the  value  of  FP  will  decrement  the  Score  by  a  greater   amount  every  time  an  anomaly  is  incorrectly  identified.              

Please  note  this  a  draft  under  revision.     False  Negative  (undetected  positive)  weight  (FN):  the  fixed  amount  of  points  taken   away  whenever  an  anomaly  is  not  detected  at  all.  If  no  data  point  inside  a  Ground   Truth  anomaly  window  is  labeled  as  positive,  then  the  score  is  reduced  as  follows:   𝑆𝑐𝑜𝑟𝑒 = 𝑆𝑐𝑜𝑟𝑒 − 𝐹𝑁     The  score  is  reduced  by  this  amount  once  for  each  anomaly  which  is  not  detected  at   all.     The  value  of  the  False  Negative  weight  (FN)  in  the  default  Application  Profile  is  1.  If   a  user  wants  to  make  it  more  important  that  the  detector  never  miss  a  real  anomaly,   then  s/he  can  increase  the  value  of  FN,  which  will  decrement  the  Score  by  a  greater   amount  when  an  anomaly  is  missed.    If  the  user  cares  less  about  true  anomalies   being  missed,  then  decreasing  the  value  of  FN  will  decrement  the  Score  by  a  lesser   amount  when  an  anomaly  is  missed.       More  details  on  the  derivation  of  the  scoring  function  and  its  relation  to  standard   algorithms  scoring  metrics  are  discussed  in  Appendix  A.      

Please  note  this  a  draft  under  revision.     Application  Profiles.  As  described  above,  the  three  Scoring  Weights  can  be  varied  by   the  user  for  a  particular  NAB  test  run;  these  Scoring  Weights  are  stored  in  the   Application  Profile  (config/profiles.json).  Along  with  a  default  Application  Profile  (in   which  all  the  scoring  weights  are  set  to  1),  several  different  example  Application   Profiles  are  provided  in  the  NAB  repository.     Application  Profile  #1:  This  application  needs  a  detector  that  has  a  very  low  false   positive  rate;  they  would  rather  trade  off  missing  a  few  “real”  anomalies  rather  than   getting  multiple  false  positives.  That  is,  the  motivation  is  to  minimize  type  I  error.   The  Scoring  Weights  in  this  profile  are  set  as  follows:     TP  =  1  [give  full  credit  for  properly  detected  anomalies]     FP  =  2  [decrement  the  Score  more  for  any  false  positives]     FN  =  .5  [decrement  the  Score  less  for  any  missed  anomalies]     Application  Profile  #2:  This  application  needs  a  detector  that  doesn’t  miss  any  real   anomalies;  they  would  rather  trade  off  a  few  false  positives  than  miss  any  true   anomalies.  That  is,  the  motivation  is  to  minimize  type  II  error.  The  Scoring  Weights   in  this  profile  are  set  as  follows:     TP  =  1  [give  full  credit  for  properly  detected  anomalies]     FP  =  .5  [decrement  the  Score  less  for  any  false  positives]     FN  =  2  [decrement  the  Score  more  for  any  missed  anomalies]       NAB  Results   Interpreting  Results   [needs  descriptive  text]       Reporting  Results   If  an  NAB  user  wants  to  report  the  results  of  their  Detector  NAB  runs  and  have  these   results  posted  in  the  NAB  repository,  the  user  should  send  an  email  to  [need  a   [email protected]  email  address],  attaching  their  final  score  results  file,   their  name  and/  or  company  affiliation,  and  a  link  to  the  custom  Detector  and  any   other  NAB  inputs  provided.     The  submission  file  should  be  a  CSV  with  column  headers  “__”,  with  the  respective   data  in  rows  2+.  We  should  be  able  to  run  your  detector  and  produce  the  same   results  file.  All  results  will  be  reviewed  before  posting.  It  is  possible  to  have  results   posted  without  including  links  to  the  NAB  inputs,  but  this  fact  will  be  noted  on  the   leaderboard.     Score  Leaderboard  

Please  note  this  a  draft  under  revision.     The  NAB  Score  Leaderboard  is  published  in  the  NAB  repository,  as  leaderboard.yml,   in  the  ??  directory.     -­‐ Need  a  “committee”  to  approve  new  labelers  &  labeling  results,  new  results  –   who  is  this?  Ideally  we  can  recruit  a  couple  of  other  organizations.  Agreed.        

Please  note  this  a  draft  under  revision.     Glossary   -­‐ Benchmark  Dataset:  Consisting  of  a  number  of  Data  Files,  this  is  the  fixed  dataset   used  to  test  the  anomaly  detection  algorithms.   -­‐ Data  File:  Any  one  of  a  series  of  time  sequence  data  in  a  specific  format,  chosen   to  be  part  of  the  Benchmark  Dataset.  While  some  of  these  Data  Files  contain   simulated  data,  many  are  taken  from  real  life  situations.   -­‐ Detector  Under  Test  (DUT):  Anomaly  detector  that  is  being  tested  by  the   benchmark.   -­‐ Detector  Phase:  First  phase  of  a  NAB  test  run.   -­‐ Threshold  Optimization  Phase:   -­‐ Scoring  Phase:   -­‐ Raw  Anomaly  Score:  Floating  point  value  between  0  and  1  that  is  the  output  of   the  DUT  for  each  data  point  in  the  Benchmark  Dataset.    A  score  closer  to  1   indicates  that  the  DUT  has  scored  this  more  likely  to  be  an  anomaly.   -­‐ Anomaly  Windows:   -­‐ Detection  Labels:  data  points  in  a  results  file  that  show  where  the  DUT  detected   an  anomaly  (binary  1  or  0).   -­‐ Detection  Windows   -­‐ Final  Scores:  Is  this  a  good  name?   -­‐ Results  File:  File  that  is  used  by  NAB  to  write  the  results  of  a  benchmark  run.   NAB  store  anomaly  scores,  detections  and  scores  into  this  file  at  different  points   in  the  benchmark  test  run.  This  file  can  also  be  used  as  input  if  the  user  wants  to   enter  the  benchmark  test  at  different  points.   -­‐ Application  Profile:  A  set  of  variables  that  can  be  set  by  the  user  for  any   particular  NAB  test  run;  the  Threshold  Optimizer  uses  these  variables  to  identify   the  actual  Detections  from  the  Anomaly  Windows  in  the  results  file.      The  A   default  profile  is  used  for  runs  where  the  user  doesn’t  set  these  variables.   -­‐ Labeling  File:  Human  generated  labels  for  anomalies  in  the  Benchmark  Test  File,   used  to  compute  the  “Ground  Truth  Files”  which  are  the  measure  the  DUT  is   testing  against.   -­‐ Combined  Labeling  File   -­‐ Labeler:  Person  who  labels  the  Data  Files  in  the  Benchmark  Dataset  by  hand,  and   provides  a  Labels  File.   -­‐ Ground  Truth  File:  File  containing  Anomaly  Windows  showing  the  presence  of   an  anomaly  in  the  Benchmark  Dataset  that  the  DUT  will  be  scored  against.   -­‐ Data  Visualizer:  visualizer  for  the  Data  Files  in  the  Benchmark  Dataset,  used  by   labelers  (or  anyone  else  who  wants  to  view  the  input  data).                

Please  note  this  a  draft  under  revision.     Appendices   A. Scoring  metrics  for  algorithms   B. Overview  of  anomalies  represented  in  the  Benchmark  Dataset   C. Labeling  process  description     D. Dataset  visualizer   E. Approval  process  description   i. New  labelers/  labeling  files   ii. Result  Leaderboard  Postings   F. Scoring  Examples                  

 

Please  note  this  a  draft  under  revision.     Appendix  A:  Scoring  algorithms           Validating  the  Scoring  Function   An  anomaly  detector  is  a  binary  classifier,  where  each  step  in  time  is  labeled   anomalous  or  not.  Metrics  to  evaluate  such  classifiers  arise  from  the  resulting   counts  of  true  positives  (TP),  true  negatives  (TN),  false  positives  (FP),  and  false   negatives  (FN).     Ground  Truth       TRUE   FALSE     Classifier     POSITIVE   TP   FP   Result   NEGATIVE   FN   TN     These  values  are  used  to  calculate  the  “precision”  and  “recall”.  The  precision  is   defined  as  the  number  of  true  positives  divided  by  the  total  number  of  positive   samples  (i.e.  union  of  true  positives  and  false  positives).  Recall  is  defined  as  the   number  of  true  positives  divided  by  the  total  number  of  positive  samples  (i.e.  union   of  true  positives  and  false  negatives).  Perfect  precision  means  every  result  retrieved   was  relevant  (i.e.  a  correct  anomaly),  while  perfect  recall  means  every  relevant   sample  (i.e.  all  correct  anomalies  in  the  data)  we’re  retrieved.     High  precision  guards  against  type  I  errors;  a  precise  anomaly  detection  algorithm   will  perform  well  under  NAB  Application  Profile  1.  High  recall,  on  the  other  hand,   guards  against  type  II  errors;  an  anomaly  detector  of  high  recall  will  perform  well   under  NAB  Application  Profile  2.       A  more  robust  evaluation  metric  considers  both  precision  and  recall:  the  “F1  score”.   Both  of  these  metrics  are  valuable  in  the  binary  classification  task,  so  the  F1  score  is   the  harmonic  mean  of  precision  and  recall:   𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛   ∙  𝑟𝑒𝑐𝑎𝑙𝑙 𝐹! = 2   𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 How  does  the  NAB  scoring  function  relate  to  the  F1  score?   [needs  descriptive  text]                    

Please  note  this  a  draft  under  revision.                   Appendix  B:  Overview  of  anomalies  represented  in  the  Benchmark  Dataset       Figures  1A,  1B  and  1C  below  show  artificially  generated  data  streams,  each   containing  examples  of  simple  anomalies.  Each  of  these  anomalies  is  imposed   between  11Apr  and  12Apr  on  a  regular,  normal  streaming  data  pattern  of  a  slightly   noisy  square  wave;  you  can  easily  see  what  the  anomaly  is  in  each  of  these  cases.    An   anomaly  detector  must  recognize  the  deviation  from  the  normal  pattern  as  soon  as   possible,  and  then  recognize  the  return  to  the  normal  square  wave  pattern  –  i.e.  by   not  labeling  the  return  to  normal  as  another  anomaly.      

     

                               

 

Figure  1A  

                                                                               

Figure  1B  

  Figure  1C  

 

Please  note  this  a  draft  under  revision.     Figure  2  shows  another  artificially  generated  data  stream,  but  with  a  slightly  more   subtle  anomaly.  The  normal  pattern  is  two  spikes  followed  by  a  long,  quiet  period.   Around  08Apr  there  are  four  spikes;  the  anomaly  detector  must  be  able  to  recognize   this  is  different  from  the  normal  as  the  third  spike  occurs,  and  then  recognize  a   return  to  normal  after  the  fourth  spike.      

                                                                                                                                          Figure  2       Figure  3  is  a  real  world  data  sequence,  representing  CPU  utilization  on  a  server   cluster.  The  data  is  fairly  noisy,  but  you  can  see  two  apparent  anomalies:  one  is  a   single  down  spike  just  before  the  16Apr  mark,  and  the  second  is  a  more  dramatic   drop  right  after  the  16Apr  mark.  Since  these  two  changes  in  the  data  are  relatively   close  to  each  other  in  time,  they  are  labeled  as  a  single  anomaly  in  the  Ground  Truth.      

     

                    Figure  3  

Please  note  this  a  draft  under  revision.     Figure  4  shows  an  anomaly  in  seemingly  random  data.    On  22Feb  there  is  a  very  tall   spike  in  the  data,  followed  by  a  long  period  of  near-­‐zero  data.    The  near-­‐zero  data   pattern  should  soon  be  learned  as  a  new  normal,  and  no  longer  flagged  as  an   anomaly.    When  the  pattern  of  spikes  starts  up  again  on  25Feb,  this  pattern  should   be  recognized  as  the  previous  normal,  and  not  flagged  as  an  anomaly.    

                                Figure  4           Figure  5  shows  anomalies  in  a  very  noisy  data  stream.  At  the  19Feb  mark,  a  much   higher  data  spike  is  seen,  followed  by  a  lower  overall  noisy  pattern,  which  should  be   seen  as  a  new  normal  within  a  small  window.    Then  right  before  25Feb,  several   lower  spikes  are  seen,  followed  by  a  very  high  spike,  and  finally  a  new,  much  lower   normal  pattern.    These  two  sets  of  events  are  far  enough  apart  in  time  that  they  are   considered  to  be  two  separate  anomalies.      

         

  Figure  5  

Please  note  this  a  draft  under  revision.        

Please  note  this  a  draft  under  revision.     Appendix E Scoring Examples [these still need to be edited]

Example 1

Figure 2. Here, the detector labels are marked in red (false positive) and green (true positive). Notice how only the relaxed windows remain in this diagram. This is because the actual windows no longer matter after the relaxed windows are calculated. Within the window of the first anomaly, there are two records labeled as anomalous. In our scoring system the second label will be ignored and the score will only be increased by the first positive record within the window. This will increase the score by S(r)*TP Each false positive receives a negative score. This will decrease the score by Σr S(r)*FP The second anomaly does not have a single true positive within its window. This will decrease the score by FN * Length(Anomaly 2)/Length(dataset) Score = S(rTP[1])*TP - Σr S(rFP)*FP - FN*( length(anomaly[2]) / length(dataset) )

Please  note  this  a  draft  under  revision.    

Example 2

Figure 3. Again, the detectors are marked in red (false positive) and green (true positive). Here is an example output of a very sensitive detector.

If a sensitive detector labeled many records as anomalous, the score would be quite negative because every false positive results in a decrease in the score. The score would look like this: Score = S(rTP[1])*TP + S(rTP[2]) - Σr S(rFP)*FP

Please  note  this  a  draft  under  revision.    

Example 3

Figure 4. Here we compare two different types of true positives. Those created by detector 1 and those created by detector 2. Because detector 1 consistently caught anomalies earlier than detector 2, its score will be higher. S(rdetector 1) > S(rdetector 2)

 

The Numenta Anomaly Benchmark (NAB) - GitHub

anomalies in streaming data can be extremely valuable in many domains, ... domain, these detectors don't share a common, more generalized data set, which.

2MB Sizes 83 Downloads 190 Views

Recommend Documents

the zendesk benchmark - cloudfront.net
Among industries, Social Media, Web Hosting, and Manufacturing ... support team should experiment to find the performance targets, staffing model, and best ..... Peak live chat volumes for our customers occurred between 10:00 a.m. and 3:00 ...

NAB AFL Auskick is the Australian Football League's ...
primary school-aged boys and girls (from age's five to eight) and their families. ... At Warwick Greenwood Junior Football Club our Auskick sessions usually ...

Practice H Mechanics and PoM NAB
4 The following data should be used when required. Speed of light .... proportional to its volume. bead of mercury electric immersion heater water length of glass ...

The Shuram/Wonoka anomaly
Mar 12, 2010 - ... oxidation hypothesis for this data has a significant implication for the carbon cycle but ...... 50 Myr recovery from the largest negative d13C excursion in the. Ediacaran ocean. ... problems, Geological Society, London. Special ..

Practice H Mechanics and PoM NAB
Planck's constant h. 6·63 x 10–34 J s. Magnitude of the charge on electron e. 1·60 x 10–19 C. Mass of electron me. 9·11 x 10–31 kg. Acceleration due to gravity g.

The saltpool benchmark problem - Numerical ...
Nov 9, 2001 - The saltpool benchmark problem - Numerical ... and the definition of the mathematical benchmark problem. ... Email address : Klaus .

2013 Benchmark Survey - Rackcdn.com
... More About Buyers. About The Survey Sample. About Demand Gen Report. 3. 7. 8. 9. 10. 11. 11 ... those dollars on social media as well as developing white papers ... efforts will get top priority, according to the respondents to the 2013 ... Marke

Benchmark-Fablab.pdf
Photographie 60. Remerciements 60. SYNTHESE ET RECOMMANDATIONS 61. Page 3 of 63. Benchmark-Fablab.pdf. Benchmark-Fablab.pdf. Open. Extract.

Benchmark SURVEY - Rackcdn.com
expect their demand generation budgets to increase in 2013, with one half expecting their budgets to rise by 20% or more. Did demand generation become a ...

Benchmark-Fablab.pdf
TYPOLOGIE 32. Structure et organisation 32. Fab Lab type « éducationnel » 33. Fab Lab type « privé-business », prototypage et services aux entreprises 36.

Departmental Anomaly Committee.PDF
Page 1 of 5. ,. /i . t. t). &". Registration No. : RTU/N nnl3i, PAP. t. Dated: 241111201'7. (ii). NFIR. National Federation of Indian Railwaymen. 3, CHELMSFORD ROAD, NEW DELHI . 1 1O 055 ' Affiliated to : lndian NationalTrade Union Congress (INTUC).

Diffusion anomaly and dynamic transitions in the Bell ...
liquid. The coefficients Ai and Bi are fitting parameters, which are not ... 1 M. Chaplin, See http://www.lsbu.ac.uk/water/anmlies.html for 63 anoma- lies of water.

The Consumption-Real Exchange Rate Anomaly
In the data, real exchange rates tend to move in opposite directions with respect to the ... Keywords: Non-traded goods, incomplete markets, distribution services.

Overall Benchmark Report.pdf
Figure 2. *Top-Performing colleges are those that scored in the top 10 percent of the cohort by benchmark. Kauai Community College 2014 CCSSE Cohort 2014 ...

Benchmark for promotion.PDF
HAG Very Good Plus - two 'Outstanding' and. three 'Very Good' gnading in the ... iJ- *. No. II/7. NATIONAL FEDERATION OF INDIAN RAILI.YAYMEN (N,F,I,R,).

Time Series Anomaly Detection - Research at Google
Statistical and regression techniques seem more promising in these cases. Netflix recently released their solution for anomaly detection in big data using Robust.

Magicbricks introduces practical and robust anomaly ...
Measure unexpected website/app behaviors. • Measure increase due to campaign ... which ultimately hit the bottom line of business. Tatvic worked with Magicbricks team to develop a solution ... Sophisticated conversion attribution and experimentatio

A Persistent Oxygen Anomaly Reveals the Fate of ...
Jan 25, 2011 - core is ~6 weight % or less (Fig. 4). If signif- icant water is ... S1 to S7. Tables S1 and S2. References. 21 October 2010; accepted 16 December 2010. Published online 6 January 2011; .... Blue plus, red diamond, and white triangle sy

Probing the Chiral Anomaly with Nonlocal ... - Semantic Scholar
Sep 2, 2014 - linearly from degeneracy points at which two energy bands meet. .... generation magnetic field Bg, a valley imbalance Δμ is created via the chiral anomaly ...... nonlocal transport, we note that an alternative approach would be ...

Departmental Anomaly Committee to settle the anomalies arising out ...
Departmental Anomaly Committee to settle the anomalies arising out of the implementation of 7th CPC.PDF. Departmental Anomaly Committee to settle the ...

Meeting of the National Anomaly Committee.PDF
*ffiI / Fax: 011-23094906. frus'. ffi*^. ffr / Fax: i Chalai). WATIO}qAL ASOM.E.LY. DistrtrbutiCIn: A!,X, STAFF. COIIIMITTEE. SIDE MEMts. {List attached). Counci!