BLPA: Bayesian Learn-Predict-Adjust Method for ...

Viewer
Transcript

BLPA: Bayesian Learn-Predict-Adjust Method for Online Detection of Recurrent Changepoints Alexandr Maslov∗† , Mykola Pechenizkiy∗ , Yulong Pei∗ , ˇ Indr˙e Zliobait˙ e§¶ , Alexander Shklyaev‡ , Tommi K¨arkk¨ainen† and Jaakko Hollm´en§ ∗

Dept. of Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands Email: {a.maslov, m.pechenizkiy, y.pei.1}@tue.nl †

‡

Faculty of Information Technology, University of Jyv¨askyl¨a, Finland Email: [email protected]

Dept. of Mechanics and Mathematics, Lomonosov Moscow State University, Russia Email: [email protected] §

Dept. of Computer Science, Aalto University, Finland Email: [email protected], [email protected] ¶

Dept. of Geosciences and Geography; Dept. of Computer Science, University of Helsinki, Finland

Abstract—Online changepoint detection is an important task for machine learning in changing environments, as it signals when the learning model needs to be updated. Presence of noise that can be mistaken for real changes makes it difficult to develop an effective approach that would have a low false alarm rate and being able to detect all the changes with a minimal delay. In this paper we study how performance of popular Bayesian online detectors can be improved in case of recurrent changes. Modelling recurrence allows us to anticipate future changepoints and predict their locations in time. We propose an approach for inducing and integrating recurrence information in the streaming settings, and demonstrate its effectiveness on synthetic and realworld human activity datasets.

I. I NTRODUCTION Online change detection is practically relevant in many domains, such as medicine, energy production, industrial processes monitoring [1]. In machine learning and data mining, change detection is often studied in the context of problem of concept drift happening due to changes in the underlying data distribution over time [2]. A popular approach for handling concept drift is to monitor data or model performance for changes and to adapt model using most recent data collected after the last detected change [3]. In this paper, we consider a change detection task in a univariate time series recorded as data streams. Further in the text we denote a univariate vector of observations either as hxi ini=1 or as x 1:n , i.e. x 1:n ≡ hxi ini=1 ≡ hx1 , . . . , xn i.

(1)

Input to the change detector is a vector of observations hxt i indexed by timestamps t ∈ T . A collection of timestamps is

an ordered vector of time moments T ≡ ht1 , . . . , tT i when observations were taken. We assume a constant sampling rate. Changepoint is a time moment when statistical properties of the data stream change significantly according to the predefined criteria. Changepoint is identified by the moment of time when it happened (further referred to as the time location of the change). The sequence of changes is denoted as hci iki=1 ∈ T and an individual change from this sequence as ci . The task is to detect changes online using only the data observed until the current moment in time. The top plot (A) in the Fig. 1 illustrates an example of the input signal with three changepoints in the mean value at the moments c 1:3 = h5, 10, 14i. Change is usually detected with some time delay δ. The change detection task is to detect changes c 1:3 with as small delay δ as possible while at the same time minimizing the number of false alarms. An event when the change was alarmed by the detector while there is actually no change is called False Positive (FP). False Positives are often caused by outliers or noise in the input signal. While the majority of existing change detection techniques focus on individual changepoint detection and assume that changepoints are not predictable, Fig. 1 illustrates use cases in which changes are expected to reappear over time after approximately equal time intervals. In this paper we focus on such a setting, addressing the problem of detecting changes in noisy signals with recurrent changes. Our approach (called BLPA method) is based on the hypothesis that if the probability distribution of the time intervals between changepoints differs from the probability distribution

of time intervals between outliers we can use this information to predict time locations of the changes and skip outliers and therefore achieve better true detection vs. false positive alarm rates. BLPA is a novel online detection method. It extends the Bayesian Online Changepoint Detector (BD) proposed in [4] by embedding into it a Predictive Change Confidence Function (PCCF), which was introduced recently in [5], in order to predict future changepoints in the input data stream, adjust detector’s settings dynamically and to reduce FP rate. BD detector works by recursively estimating posterior probx1:t , θ) of the run length variable rt ability distribution P (rt |x which is a time since the last changepoint. Changepoint is an event when x1:t , θ) = 0 arg max P (rt |x

(2)

rt

Every time a new measurement xt is observed the posterior distribution is recalculated using the Bayes‘ theorem to update parameters of the distributions used to model data and the law of total probability

·

P (rt | ) =

X

rt−1

·

·

P (rt | rt−1 , ) P (rt−1 | )

(3)

to take into account values from all the runs in the past. The prior probability of the change P (rt = 0|t) in BD detector is specified using the constant-value hazard rate h which is a prior probability to observe a change and which is supposed to be known before the change detection process starts. The uniform non-informative prior does not hold enough information to distinguish outliers and noisy changes from the changepoints. We improve performance of the BD detector by using an informative prior distribution in a form of the PCCF function which parameters are the average time interval µ between consecutive changepoints hci − ci−1 i and standard deviation σ. Given current estimate of µ and σ PCCF gives a prior probability P(t|µ, σ) to observe recurrent changepoint at time t. During the change detection process parameters of the BD detector are adjusted dynamically according to the predictions in order to skip possible noisy changes between changepoints. When a new changepoint is detected (or its location is provided by an external source) parameters (µ, σ) are updated using the Bayesian rule and new prediction P(t|µnew , σnew ) is made. The paper is organized as follows. In Section II we review related works. In Section III we describe in detail how the Bayesian Change Detector proposed in [4] works. In Section IV we describe PCCF function used to predict recurrent changes. In Section V we describe the data model common for the input signal of observations hxi i and for the time intervals between changepoints hci − ci−1 i. In the Section VI we describe the BLPA method which is a BD detector integrated with the PCCF function. In the Section VII we describe experimental results demonstrating improved performance of the BD detector when integrated with the PCCF.

II. R ELATED WORK While many change detection methods have been developed [1], [6] for offline and online settings, they typically assume that changes occur at random in time, and are independent from each other. In practice, however, in many industrial applications changes occur with some regularity, which may be, for instance, due to seasonality of the conditions or semiregular wear rates of the equipment. Our BLPA approach captures this information from data, and utilizes it for improving the accuracy of a Bayesian online change detection. In the Bayesian online change detector proposed in [4] and extended in [7] authors model time intervals between change points (run lengths) using the hazard rate. This approach allows to take into account recurrence by tuning a single parameter, but it does not allow to distinguish outliers from changes which may appear between outliers. In [8] data stream volatility, defined as the rate of detected changes, is used to make detector more reactive. We concentrate on the problem of improving change detection by predicting time locations of the changes in the future in order to better distinguish outliers from real changes. In BD [4] the hazard rate is a constant value assumed to be known in advance. This is not a very realistic assumption and this problem has been addressed in [9] where authors proposed an on-line inference procedure to estimate h parameter in the case when hazard rate is unknown and can itself undergo changes while new data arrives. In [10] authors proposed an algorithm which can detect and locate changepoints simultaneously using Bayesian statistics approach. In [11] authors use Gaussian Process model to compute predictive distribution xold ). p(xnew |x Our method is different in a way that we combine change detection and prediction tasks. We add a second layer (PCCF function) on top of the change detection algorithm allowing to predict future recurrent changes and adjust the settings of the detector dynamically. This second layer is a change detector itself as it automatically incorporates changes into the underlying distribution of the time intervals between recurrent changes. In our previous work [5] we demonstrated how to combine PCCF with a naive threshold. The BLPA method we propose here is a more advanced in the sense that it integrates PCCF natively into the BD detector using Bayesian statistics framework. BLPA updates both parameters of BD and PCCF sequentially, detects changes, predicts future changes and adjusts parameters of the BD detector according to the predictions in order to skip noisy changes and outliers while detecting changes of interest. A few other lines of work conceptually relate to our approach via attention to recurrent concept drift [12], [13], [14], predictability of concept drift [15], or change detection with delayed labeling [16]. Yet these approaches are specific to handling concept drift, while our focus is on generic online change detection and its accuracy.

III. O NLINE BAYESIAN C HANGE D ETECTOR (BD) In this section we describe the Bayesian Online Changepoint Detector proposed in [4]. As it was previously mentioned to model time occurrences of the changes authors introduce a latent variable run length rt which is the number of time steps since the most recent change. In Fig. 1 plot (A) you can see an illustrating example of the input signal and corresponding run values on plot (B). On each time step there are two possibilities: either the run length increases rt = rt−1 + 1 or changepoint occurs rt = 0. The conditional prior P (rt |rt−1 ) of the change is given by a constant-value hazard rate h (Equation 4). ( 1 − h if rt = rt−1 + 1 p(rt |rt−1 ) = (4) h if rt = 0 The plot (C) in Fig. 1 illustrates the message-passing algorithm to compute prior probabilities of the changepoint at any time moment given the boundary condition P (r1 = 0) = 1.0 that change occurred at the moment t = 1. Each node (circle) represents a hypothesis about the current run length value. From each node there is a solid line upwards depicting probability of increasing of the run on the next time step (no change) and a dashed line going downwards depicting probability of the change. At each time step the probability of a changepoint is estimated by calculating the posterior probability distribution of the run length value given observed data (Equation 5). x1:t ) = P (rt |x

P (rt , x 1:t ) x1:t ) P (x

(5)

The joint probability of the run length values and the observed data can be sequentially computed using recursive procedure in Equation 6 as it is described in [4]: X P (rt , rt−1 , x1:t ) = P (rt , x1:t ) = rt−1

X

rt−1

P (rt , xt | rt−1 , x1:t−1 ) P (rt−1 , x1:t−1 ) =

X

rt−1

(r)

P (rt |rt−1 )P (xt |rt−1 , xt )P (rt−1 , x1:t−1 )

(A)

Xt

(6)

(r) xt

where ≡ hxt−r+1 , . . . , xt i is input data sub-interval associated with the run length r. Marginal predictive distribution of the new observation xt is computed using the sum rule: X (r) x1:t−1 ) x1:t−1 ) = P (xt |rt , x t )P (rt |x (7) P (xt |x rt

IV. PCCF FUNCTION In this section we show how to compute PCCF function used to predict time locations of the recurrent changes in the future. We consider a discrete case when observations are obtained at the discrete time moments htiTt=1 with a constant sampling rate. Probability distribution function (Pdf) for the discrete sets is defined using the Probability mass function

rt

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(B)

4 3 2 1 0

rt

t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

t

(C)

4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

t

Fig. 1. The changepoint detection problem. (A): Input signal. (B): A particular realization of the run length path corresponding to the actual changepoints locations in the input signal. (C): Directed graph representing all possible run length paths. The figure is replicated from the illustration in [4].

(Pmf). As mentioned earlier we assume that changes reoccur at approximately equal time intervals. To model time intervals between consecutive changes hci − ci−1 i we use the Gaussian distribution assuming that the standard deviation is small enough such that the probability to observe a change ci before ci−1 is extremely small. Definition 1: Changes hci iki=1 are recurrent if p(ci+1 = t | θC ) = p(c1 = t − ci | θC ),

(8)

where θC = (µC , σ C ), c1 is the time location of the 1st change, ci is the time location of the ith change. This definition corresponds to the generative model defined by Equation 9 in which every next change ci+1 happens after time intervals ∆ which follows the Gaussain distribution N (µC , σ C ). ci+1 = ci + ∆, where ∆ ∼ N (µc , σ c )

(9)

To predict future changes we introduce the notion of the Predictive Change Confidence Function (PCCF) [5]. Definition 2: PCCF is a Pmf defined on a discrete set of time moments htiTt=1 giving a probability to observe recurrent change ∀c ∈ hci iki=1 at the time moment t P(c = t|µc , σ c ) =

k X

p(ci = t|µc , σ c )

(10)

i=1

where p(ci = t|µc , σ c ) is a Pmf for an individual change ci . It is important to note that change-events hci i are independent. Every ci can happen at any moment of time according to its individual Pmf p(ci = t|µc , σ c ). Following the sum rule for

p(ci+1 = t | ci = τ )p(ci = τ ).

(11)

According to the definition 2 PCCF is a sum of individual Pmf’s of the changes which might happen till current moment of time P(t)

= =

t−1 t X X

i=1 τ =i t−1 t X X i=1 τ =i

=

p(c1 = t − ci )p(ci = τ ).

(12)

(13)

τ =1

The convolution of two Gaussian distributions is also a Gaussian distribution q (p(x|µ1 , σ1 ) ∗ p(x|µ2 , σ2 )) = p(x|µ1 + µ2 , σ12 + σ22 ). (14) PCCF (Eq. 12) can be written as a t-fold convolution P(t) = (p(c1 ) ∗ p(c1 ) ∗ · · · ∗ p(c1 )) {z } |

(15)

t

t X l=1

1 √ exp σ 2πl

−(t − lµ)2 2lσ 2

.

t→∞

∞ X l=1

=

P

y

(t − µl)2 1 1 √ exp − = . 2 2lσ µ σ 2πl

P (x|y)p(y)

0.20

40

50

60

70

1 µC

.

V. DATA MODEL In this section we describe the data model which we use in BLPA. There are two data streams to be analyzed. The stream of input observations hxt i and the stream of time intervals between changepoints hci − ci−1 i. The first stream is used to detect changes and therefore to produce the second stream. The stream of changes may be updated by external processes providing additional information about time location of the changes. E.g. there are additional change-detection processes running in parallel with the main detector on collected data to locate changes in the past more precisely. The stream of changepoints is used to predict future changepoints in order to adjust detector’s settings to achieve a better performance. Further, data D is either input data stream of observations hxt i or data stream of time intervals between consecutive changepoints hci − ci−1 i. In this section we describe the data model for D common both for input signal and sequence of time intervals between changepoints. Data D is assumed to be generated by a Gaussian distribution with an unknown mean and variance. We denote elements of D by x ˜i ∈ D with mean and variance (˜ µ, σ ˜ ).

x ˜i ∼ N (˜ µ, τ˜), τ˜ = (1/˜ σ )2 µ ˜ ∼ N (˜ µ0 , κ ˜ 0 τ˜)

(17)

From Equations 17 follows that PCCF converges to the constant value uniform distribution for large t values. Fig. 2 illustrates two PCCF functions (Equation 16) with parameters (µ = 10, σ = 2) and (µ = 15, σ = 3). Prior and posteriors for the PCCF’s parameters are estimated and updated using the procedure described in Section V describing data model. 1 P (x)

30

An example of two Gaussian PCCF functions. The limits are

(16)

The sum in Equation 16 describes renewal-reward process [17], [18]. Using the renewal theorem [17] we can calculate the limit of P(t) when t → ∞ L = lim

20

A. Prior distributions. Following the notations in [19], we use a normal-gamma prior for µ ˜ and σ ˜:

which is equivalent to the sum P(t) =

10

p(ci+1 = t|ci = τ )p(ci = τ )

(p(c1 ) ∗ p(ci ))[τ ] t−1 X p(c1 = t − τ )p(ci = τ ).

=

0

Fig. 2.

Right side of the Equation 11 is a convolution for the Pmf p(c1 ) of the 1st recurrent change and of the Pmf of the change ci computed in the previous step p(ci+1 )

0.15 0.05

τ =i

0.00

t−1 X

p(ci+1 = t) =

mu = 10, sd = 2 mu = 15, sd = 3

0.10

total probability1 in order to compute Pmf of ci+1 we need to consider all possible time locations of ci .

τ˜ ∼ Gamma(˜ α0 , β˜0 )

(18) (19) (20)

where (˜ α0 , β˜0 , µ ˜0 , κ ˜ 0 ) are hyperparameters. The value τ˜ is also named precision 2 . The likelihood of data D = h˜ xi i is n τ˜ n/2 τ˜ X P (D|˜ µ, τ˜) = exp − (21) (˜ xi − µ)2 2π 2 i=1 The joint conjugate prior for parameters (˜ µ, τ˜) is the defined normal-gamma (NG) distribution:

P (˜ µ, τ˜|˜ µ0 , κ ˜0 , α ˜ 0 , β˜0 ) = N (˜ µ0 , κ ˜ 0 τ˜)Gamma(˜ α0 , β˜0 ) (22) κ ˜ 0 τ˜ 1 ˜ (23) (˜ µ−µ ˜0 )2 τ˜α˜ 0 −1 e−˜τ β0 = τ˜1/2 exp − Z 2 1 τ˜ = τ˜α˜ 0 −1/2 exp − [˜ κ0 (˜ µ−µ ˜0 )2 + 2β˜0 ] (24) Z 2 2 Further

we use σ and τ parameters interchangeably.

where Z =

Γ(α ˜0 ) α ˜ β˜ 0 0

2π κ ˜0

1/2

VI. BLPA CHANGE DETECTOR

is the normalized factor.

B. Posterior distributions

The posterior can be derived as P (˜ µ, τ˜|D) ∝ P (˜ µ, τ˜|˜ µ0 , κ ˜0 , α ˜ 0 , β˜0 )P (D|˜ µ, τ˜)

∝ N (˜ µn , κ ˜ n τ˜)Gamma(˜ α0 + n/2, β˜n )

(25) (26)

which is also a normal-gamma distribution: P (˜ µ, τ˜|D) = N G(˜ µ, τ˜|˜ µn , κ ˜n , α ˜ n , β˜n )

(27)

with the parameters n κ ˜0 µ ˜0 + x ¯ κ ˜0 + n κ ˜0 + n κ ˜n = κ ˜0 + n

µ ˜n =

(28) (29)

α ˜n = α ˜ 0 + n/2 (30) n 1X κ ˜ 0 n(¯ x−µ ˜ 0 )2 β˜n = β˜0 + (31) (˜ xi − x ¯ )2 + 2 i=1 2(˜ κ0 + n) Pn ˜i is the mean of sampled data. The poswhere x ¯ = n1 i=1 x terior distribution for τ˜ is obtained by integrating Equation 27 over µ (See [19]) ˜ ∝ p(˜ τ |D, µ ˜0 , κ ˜ 0 , α, ˜ β) n X κ ˜κ ˜0 ˜ 1 (¯ x−˜ µ0 )2 )) (˜ xi −¯ x)2 + Gamma(˜ α+n/2, β+ 2 i=1 2(˜ κ+κ ˜0) (32) Given the updated parameters θ = (α0 , β0 , µ0 , κ0 ) using the rules 31, the predictive distribution for a new data xnew is x, µ, κ, α, β) = p(xnew |x Z x, µ0 , κ0 , α, β)dτ p(xnew |µ, τ )p(τ |x

(33)

where

τ 1/2 − τ (x−µ)2 dτ (34) ) e 2 2π x, µ0 , κ0 , α, β) is given by 32. Integral 33 is a Pearson and p(τ |x type VII distribution (Equation 35) which is equivalent of the non-standardized Student’s t-distribution. x 2 −m 1 n+1 − λ 1+ (35) p(xnew ) = αB(m − 1/2, 1/2) α p(xnew |µ, τ ) = (

where

m = α0 + (n + 1)/2 (36) v u n Pn uX ( i=1 xi + µ0 κ0 )2 + 2β0 (37) x2i + κ0 µ20 − α = At n + κ0 i=1 r n + 1 + κ0 A= (38) n + κ0 Pn x i + µ 0 κ0 (39) λ = i=1 n + κ0 Predictive distribution 7 in case of this data model is given by Equation 35. Please see detailed calculations in the Appendix.

The BLPA method is a combination of BD detector and PCCF predictive function. Particularly when we compute the joint probability P (rt , hxj itj=1 ) and after that when computing the run-length distribution P (rt |hxj itj=1 ) we multiply these probabilities by the prior probability of the changes given by PCCF for the moment t. The BLPA method is depicted in Algorithm 1, in which: • Lines 1-4: Set initial parameter values for the probability distribution of the data D. • Line 5: Compute PCCF using initial values of the parameters (Equation 12). • Line 7: Collect a new measurement. • Line 8: Compute predictive distribution using Equation 35. • Line 9-10: Compute the change probabilities and the growth probabilities of the run length. • Line 11: Compute posterior probabilities of run lengths (changes). • Line 12: Update parameters of the probability distributions for the data D using Equations 31. • Lines 13-16: Find the most likely position of the last changepoint, update PCCF parameters and recalculate PCCF. Algorithm 1 LPA-detector pseudocode 1: θ ← (µ0 , κ0 , α0 , β0 ) C C C 2: θ C ← (µC 0 , κ0 , α 0 , β 0 ) 3: θ = θ0 ⊲ Init sig. params 4: θ C = θ0C ⊲ Init PCCF params θ0C ) 5: hHj iT ⊲ Predict changes (Initial) j=1 = Pccf(θ 6: for t=1:T do x , xt ] 7: x ← [x ⊲ Observe new datum 8: πt = P (xt |θθ ) ⊲ Predictive distribution πt (1 − Ht−1 ) 9: P (rt = rt−1 + 1, x ) =PP (rt−1 , x1:t−1 )π πt 10: P (rt = 0, x ) = Ht−1 rt−1 P (rt−1 , x1:t−1 )π x) = P (rt , x )/P (x x) 11: P (rt |x ⊲ Run length Distrib θ) 12: θ ← Update(θ) ⊲ Update parameters x, θ ) = 0) then 13: if (arg maxrt p(rt |x 14: θC ← Update(θθC ) 15: hHj iTj=t = Pccf(θθC ) 16: end if 17: end for

VII. E XPERIMENTS We performed experiments with artificially generated and real world human activity data sets. To measure the performance of the change detector we can consider it as a binary classifier assigning labels ‘change’/‘not change’ to the incoming observations xt . If e+ t is the ‘change’ label assigned at the moment t and e− t is the label ‘not change’ assigned at t then Then True Positive (TP), False Positives (FP), True Negatives (TN) and False Negatives (FN) events can be defined as follows:

e+ t is TP if ∃ci : t − ci < δ, and FP if ∄ci : t − ci < δ − • et is FN if ∃ci : t − ci < δ, and TN if ∄ci : t − ci < δ The performance of the change detector is defined by TP/FP rates and by the average delay δ of the detection. •

A. Artificial data In the simulation we generated 200 signals with 10 recurrent changes in the mean value for each hazard-rate value h varied in the interval from 50 to 300 by the step 15. Average distance between changes was set to µ = 100 with the standard deviation σ = 10. Results are depicted in Fig. 3. FP rate

B. Human Activity (HA) signal In the second experiment we used the Human activity data set [20] which contains sensor measurements from people performed 6 types of activities: three static postures (standing, sitting, lying) and three dynamic activities (walking, walking downstairs and walking upstairs). We detected changes caused by transitions from one set of activities to another in one of the sensor signals. Results are depicted in Fig. 4. FP rate is Input Activity signal Changes

Detections

Input signal Changes

Detections

0

100

200

PCCF 0

200

PCCF

400

600

Changes

800

400

●

●

●

● ●

●

●

●●

● ● ● ● ●● ●● ●● ● ● ● ●

0.72

0.74

●

500

0.82

300

No PCCF PCCF

0.78

TP rate

0.80 0.75

●

0.70

TP rate

200

1000

●

0.65

PCCF Detections

0.80

800

0.76

600

100

●

0.70

400

No PCCF PCCF

0.60

500

PCCF Detections

0.85

200

Changes

400

1000

0

0

300

●● ●●

0.00

0.05

0.10

0.15

0.20

0.25

0.55

FP rate

Fig. 4. Experimental results for the ’Activity recognition’ signal. On the top plot - illustrating example of the signal and corresponding PCCF function. Vertical solid lines depict changepoints to be detected. Dashed lines on the plot FP rate with the signal depict moments when detector without PCCF alarmed changes. Fig. 3. Experimental results for simulated data streams with recurrent changes. Dashed lines on the plot with PCCF show time moments when the detector with On the top plot - an example of the generated input signal. Vertical solid PCCF alarmed changes. Bottom plot - ROC curves. Blue triangles - performance lines depict changepoints to be detected. Dashed lines on the plot with the of the BD with PCCF. signal depict moments when detector without PCCF alarmed changes. Bottom plot - ROC curves. Blue triangles depict performance of the detector equipped decreased when BD detector is used with the PCCF function. with PCCF. Black dots - performance without PCCF. FP rate is reduced while keeping the same TP rate. VIII. C ONCLUSIONS AND FUTURE WORK 0.0

0.1

0.2

0.3

0.4

is decreased while not reducing TP rate. In the worst cases the performance of both detectors is similar.

We proposed BLPA - a new Bayesian method to improve accuracy of the Bayesian Online Changepoint detector for the

data streams with recurrent changes by embedding the Predictive Confidence Change Function (PCCF). When new data arrives in a stream BD detector’s and PCCF’s parameters are adjusted to the changing conditions using the same Bayesian update procedures constituting a two-layer adaptive change detection/prediction method BLPA. In the experiments with real and artificial data sets we demonstrated that Bayesian detector equipped with PCCF performs better in terms of TP/FP rates than the detector without PCCF. In BLPA run lengths hci − ci−1 i are modelled by the Gaussian distribution N (µ, σ) with the small standard deviation σ value to ensure that probability of the event ci+1 < ci is close to zero. Initial values for µ and σ can be easily calculated from historical data using well known maximum likelihood estimators. Also, the choice of the Gaussian distribution allows a uniform modelling of both input data streams of observations and changepoints. However, in some real life applications this assumption may be not realistic and it would make sense to consider positive defined distributions. In the BD detector [4] run lengths are modelled by the constant-value hazard rate h what is equivalent to the Poisson process in which time intervals between sequential events are i.i.d. from the Exponential distribution h exp(−ht) which, in turn, is a special case of the distribution Gamma(α = 1, β = 1/h). In the future work we plan to consider the case of Gamma distribution to model run lengths values when α > 1. In the current version of the BLPA a user should test the hypothesis if changepoints in the input data stream are recurrent using historical data and perform initial estimation of the prior probability distribution of the PCCF’s parameters. We plan to address this issue by considering more real life data streams in order to develop procedures to automate this process. IX. ACKNOWLEDGEMENTS This research is partly supported by COMMIT Big Data Veracity project and by STW CAPA project. We would like to thank Alexander Semenov for useful comments on this work. R EFERENCES [1] I. V. Nikiforov and M. Basseville, “Detection of Abrupt Changes,” 1993. [2] G. Widmer and M. Kubat, “Learning in the presence of concept drift and hidden contexts,” Mach. Learn., vol. 23, no. 1, pp. 69–101, 1996. [3] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM Computing Surveys, vol. 46, no. 4, pp. 44:1–44:37, 2014. [4] R. P. Adams and D. J. MacKay, “Bayesian online changepoint detection,” arXiv:0710.3742, 2007. ˇ [5] A. Maslov, M. Pechenizkiy, I. Zliobait˙ e, and T. K¨arkk¨ainen, “Modelling recurrent events for improving online change detection,” in SIAM International Conference on Data Mining (SDM16), 2016. [6] A. S. Polunchenko and A. G. Tartakovsky, “State-of-the-Art in Sequential Change-Point Detection,” Methodology and Computing in Applied Probability, vol. 14, no. 3, pp. 649–684, Oct. 2011. [7] R. C. Wilson, M. R. Nassar, and J. I. Gold, “Bayesian online learning of the hazard rate in change-point problems,” Neural computation, vol. 22, no. 9, pp. 2452–2476, 2010. [8] D. Huang, Y. S. Koh, G. Dobbie, and R. Pears, “Detecting volatility shift in data streams,” in ICDM’2014, pp. 863–868. [9] R. C. Wilson, M. R. Nassar, and J. I. Gold, “Bayesian online learning of the hazard rate in change-point problems,” Neural Comput., vol. 22, no. 9, pp. 2452–2476, Sep. 2010.

[10] A. B. Downey, “A novel changepoint detection algorithm,” 2008. [11] Y. Saatc¸i, R. D. Turner, and C. E. Rasmussen, “Gaussian process change point models,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 927–934. [12] J. Gama and P. Kosina, “Learning about the learning process,” in Proc. of IDA’11, pp. 162–172. [13] J. B. Gomes, M. M. Gaber, P. A. C. Sousa, and E. M. Ruiz, “Mining recurring concepts in a dynamic feature space,” IEEE Trans. Neural Netw. Learning Syst., vol. 25, no. 1, pp. 95–110, 2014. [14] J. B. Gomes, P. A. C. Sousa, and E. M. Ruiz, “Tracking recurrent concepts using context,” Intell. Data Anal., vol. 16, no. 5, pp. 803–825. [15] H. Ang, V. Gopalkrishnan, I. Zliobaite, M. Pechenizkiy, and S. Hoi, “Predictive handling of asynchronous concept drifts in distributed environments,” IEEE Trans. on Knowl. and Data Eng., vol. 25, pp. 2343– 2355, 2013. [16] I. Zliobaite, “Change with delayed labeling: When is it detectable?” in ICDM 2010 Workshops, pp. 843–850. [17] D. R. Cox, Renewal theory. Methuen, 1962, vol. 58. [18] W. Feller, An introduction to probability theory and its applications: volume I. John Wiley & Sons London-New York-Sydney-Toronto, 1968, vol. 3. [19] M. I. Jordan, “Chapter 9. the exponential family: Conjugate priors.” [20] J.-L. Reyes-Ortiz, L. Oneto, A. Sam`a, X. Parra, and D. Anguita, “Transition-aware human activity recognition using smartphones,” Neurocomputing, vol. 171, pp. 754–767, 2016.

Noticing that

A PPENDIX Marginal predictive distribution p(xn+1 |x1 , ..., xn ) can be found as p(xn+1 |x1 , ..., xn ) (40) R R p(x1 , ...., xn+1 |µ, τ )p(µ, τ )dµdτ τ ∈R+ µ∈R R = R . p(x1 , ...., xn |µ, τ )p(µ, τ )dµdτ τ ∈R+ µ∈R

Assuming the Gaussian distribution √ 2 τ τ p(x|µ, τ ) = √ e− 2 (x−µ) 2π

the probability p(x1 , ...., xn |µ, τ )p(µ, τ ) is ! √ n n τ τX √ × exp − (xi − µ)2 τ α0 −1/2 2 i=1 2π τ × exp − κ0 (µ − µ0 )2 + 2β0 2

(41)

(42)

a ˆn+1 n + κ0 =1+ (48) a ˆn a ˆn (n + 1 + κ0 ) ! Pn Pn 2 2xn+1 ( i=1 xi + µ0 κ0 ) ( i=1 xi + µ0 κ0 ) 2 + × xn+1 − n + κ0 (n + κ0 )2 marginal predictive distribution is p(xn+1 |x1 , ..., xn ) =

Therefore (44) ˆbn 2

!−α0 +n/2

×Gamma(τ |α0 + n/2, a ˆn /2)N (µ|ˆ µn , σ ˆn2 ) where posterior parameters estimates a ˆn , µ ˆn , σ ˆn2 are ! Pn n X ( i=1 xi + µ0 κ0 )2 2 2 a ˆn = x i + κ0 µ 0 − + 2β0 , n + κ0 i=1 (45) Pn x i + µ 0 κ0 1 µ ˆn = i=1 , σ ˆn2 = . n + κ0 τ (n + κ0 ) Therefore the integral in the numerator in Equation 40 is Z Z p(x1 , ...., xn+1 |µ, τ )p(µ, τ )dµdτ (46) τ ∈R+

=

1 √ 2π

µ∈R

n−1

Γ(α ) √ 0 n + κ0

a ˆn 2

ˆbn B(α0 + n/2, 1/2)

1+

(xn+1 − λ)2 ˆb2 n

where coefficients are p Pn + 1 + κ0 )ˆ an x i + µ 0 κ0 ˆbn = (n p , λ = i=1 . n + κ0 (n + κ0 )

where an expression in the exponent is Pn 2 n x i + µ 0 κ0 τ (n + κ0 ) X − (43) µ − i=1 2 n + κ0 i=1 ! Pn n ( i=1 xi + µ0 κ0 )2 τ X 2 2 x + κ0 µ 0 − + 2β0 , − 2 i=1 i n + κ0 p(x1 , ...., xn |µ, τ )p(µ, τ ) n−1 Γ(α0 + n/2) 1 √ = √ n + κ0 2π

1

(49) !−(α0 +(n+1)/2)

−(α0 +n/2)

and integral 40 can be expressed as √ α +n/2 n + κ0 a ˆn 0 √ (47) α0 +(n+1)/2 n + 1 + κ0 B(α0 + n/2, 1/2) a ˆn+1 −(α0 +(n+1)/2) √ n + κ0 a ˆn+1 =√ a ˆ−1/2 . n a ˆn n + 1 + κ0 B(α0 + n/2, 1/2)

,

(50)

Bayesian Method for Motion Segmentation and ...

Two-way imputation: A Bayesian method for estimating ...

A Sequential Monte Carlo Method for Bayesian ...

Method for processing dross

RBPR: Role-based Bayesian Personalized Ranking for ...

BAYESIAN PURSUIT ALGORITHM FOR SPARSE ...

Nonparametric Hierarchical Bayesian Model for ...

Download BAYESIAN METHODS FOR HACKERS ...

Nonparametric Hierarchical Bayesian Model for ...

RBPR: Role-based Bayesian Personalized Ranking for ...

Bayesian Optimization for Likelihood-Free Inference

AN EVIDENCE FRAMEWORK FOR BAYESIAN ...

BAYESIAN HIERARCHICAL MODEL FOR ...

Adaptive Bayesian personalized ranking for heterogeneous implicit ...