Controlled Permutations for Testing Adaptive Learning ...

Viewer
Transcript

Noname manuscript No. (will be inserted by the editor)

Controlled Permutations for Testing Adaptive Learning Models ˇ Indre˙ Zliobait e˙

Received: Nov 10, 2011; Revised: Dec 15, 2012; Accepted: Feb 18, 2013

Abstract We study evaluation of supervised learning models that adapt to changing data distribution over time (concept drift). The standard testing procedure that simulates online arrival of data (test-then-train) may not be sufficient to generalize about the performance, since that single test concludes how well a model adapts to this fixed configuration of changes, while the ultimate goal is to assess the adaptation to changes that happen unexpectedly. We propose a methodology for obtaining datasets for multiple tests by permuting the order of the original data. A random permutation is not suitable, as it makes the data distribution uniform over time and destroys the adaptive learning task. Therefore, we propose three controlled permutation techniques that make it possible to acquire new datasets by introducing restricted variations in the order of examples. The control mechanisms with theoretical guarantees of preserving distributions ensure that the new sets represent close variations of the original learning task. Complementary tests on such sets allow to analyze sensitivity of the performance to variations in how changes happen and this way enrich the assessment of adaptive supervised learning models. Keywords concept drift · evaluation · data streams · permutations

1 Introduction Changing distribution of data over time (concept drift [24]) is one of the major challenges for data mining applications, including marketing, financial analysis, recommender systems, spam categorization and more. As data arrives and evolves over time, constant manual adjustment of models is inefficient and ˇ I. Zliobait˙ e Aalto University, Department of Information and Computer Science, Espoo, Finland Helsinki Institute for Information Technology (HIIT), Espoo, Finland Bournemouth University, Poole, Dorset, UK E-mail: [email protected]

2

ˇ Indr˙e Zliobait˙ e

with increasing amounts of data is quickly becoming infeasible. In such situations decision models need to have mechanisms to update or retrain themselves using recent data, otherwise their accuracy will degrade. Research attention to such supervised learning scenarios has been rapidly increasing in the last decade, a lot of adaptive learning models for massive data streams and smaller scale sequential learning problems have been developed (e.g. [6,14–16,26,28]). Evaluation of adaptive models requires specific testing procedures. Since data is not uniformly distributed over time, testing needs to take into account the sequential order of data. A standard procedure for that is the test-then-train (or prequential) [5], which mimics online learning. Given a sequential dataset every example is first used for testing and then for updating the model. Suppose (x1 , x2 , x3 , x4 ) is our dataset. We train a model with x1 . Next, the model is tested with x2 , and the training set is augmented with x2 . Next, we test on x3 , and update the model with x3 . Finally, we test on x4 . The limitation of this evaluation is that it allows to process a dataset only once in the fixed sequential order. The positions where and how changes happen remain fixed, thus, a single test concludes how well a model would adapt to this fixed configuration of changes. In contrast, the ultimate goal is to assess the performance on a given task online while data evolves unexpectedly. Thus, the results of a single test may be insufficient to conclude about the generalization performance of an adaptive model. The problem is particularly serious when several predictive models are compared, since, as we will see in the next section, the winning strategy may depend on the order of data. Multiple tests with close variations of the original dataset could make evaluation more confident. The results from multiple tests could be used for assessing the stability of the model performance by estimating the variance of the accuracies. Multiple tests could also be used as validation sets for tuning the model parameters. However, for such an evaluation to be reliable we need to ensure that the multiple test sets do not deviate much from the original learning problem. This study presents a methodology for generating such sets. One way to obtain multiple test sets would be to create synthetic data. For that we would need to build a statistical model for the sequence and use that model to generate data. However, this would require to rely on knowing or correctly estimating the underlying data generating process, which is hardly feasible in complex processes. Hence, we resort to permutations of the original dataset as they provide a simple model-free way of obtaining new sequences and the data examples do not need to be modified, while at the same time the way how changes happen in data can be varied. A random permutation is not suitable, since it would make the data distribution uniform over time. Changing data distributions would be destroyed (see Figure 1). Such data would represent a different learning problem that does not need adaptation. We propose to construct multiple test sets by permuting the data order in a controlled way to preserve local distributions, which means that the examples that were originally near to each other in time need to remain close after a permutation. We present three permutation techniques that are theoretically restricted to keep examples together. The permuted datasets can be considered

Controlled Permutations for Testing Adaptive Learning Models

3

different distributions

original order, Dec. 2007 − Mar. 2010

random order

Fig. 1 One variable of the Chess data (Section 5).

as semi-synthetic: while order of data is gently modified, at the same time the data itself is not altered. Each permutation technique creates different types of changes: sudden, reoccurring, and gradual. Our study is novel in identifying the risks of a single test, forming a measure between the original test set and its variations, and proposing a theoretically founded solution how to form multiple test sets. The study contributes to the methodology of evaluating the performance of adaptive supervised learning models in online scenarios. As a result, it becomes possible to complement evaluation with assessment of sensitivity of the performance, and also to obtain validation sets for parameter tuning. A short version of this study was published as a conference paper [29] in Discovery Science 2011. The paper is organized as follows. In Section 2 we discuss evaluation of adaptive models. In Section 3 we present our permutations. Section 4 presents the theoretical guarantees for controlling the permutations. Section 5 experimentally demonstrates how our permutations aid in evaluating adaptive models. Section 6 discusses related work. Section 7 concludes the study.

2 Evaluation of adaptive learning models The problem setting of adaptive supervised learning is as follows. Our datasets are composed of examples ordered in time; each example is described by the values of a fixed set of input variables and a target variable (the label), which the model needs to predict. Data distribution is expected to change over time. Learning models have mechanisms to adapt online by taking into account the new incoming examples. Prediction accuracy is the primary measure of the performance. The test-then-train [12] is a standard procedure for evaluating the performance of adaptive learning models. This procedure restricts evaluation to a single run with a fixed configuration of changes in data. While different learning models have different adaptation rates, the results on a fixed test snapshot with a few changes may not be sufficient to generalize how this adaptive model would perform online on a given problem. Evaluating adaptive models differs from evaluating learning models in the stationary settings in several key aspects. Firstly, in the stationary setting all data is assumed to be originating from the same underlying distribution, and thus examples are exchangeable. Testing sets can be formed combining any randomly chosen examples from a dataset, we can use ten-fold or leave-one-

4

ˇ Indr˙e Zliobait˙ e

out cross validation procedures or bootstrapping [25]. Adaptive models need to be tested on sequential data where a sequence includes different distributions and examples generally are not exchangeable; hence, when forming test sets we need to respect the order of data. Moreover, in the stationary settings we aim at testing the predictive power of a model and we measure it by the testing accuracy. In evolving settings we aim at testing the predictive power as well as the ability to adapt and we measure both by the same testing accuracy. One change in the data distribution over time can be considered as one testing instance for testing the ability to adapt. Data stream snapshots at hand may contain only a few changes. In such a case the evaluation of the ability to adapt will be biased towards the order and timing of changes in this data. This problem of evaluation could be alleviated if we had a really long sequence with a few distributions and many changes, or are in a continuous (probably gradual) change. Unfortunately, real sequential datasets often include only a few distribution changes, even though they contain plenty of examples (in particular data streams). For example, changes in sales due to seasonality, changes in suppliers or production process or new adversary activities in credit card usage do not happen very frequently. Changes in economic situation or personal interests can be even slower than yearly. Thus, the validity of conclusions from a single run evaluation may be limited to this particular order of data, while the goal of evaluation is to generalize about adaptivity when changes happen unexpectedly and in fact for any change pattern in a given application.

3 Proposed permutations We aim at permuting a dataset so that the permuted sets closely resemble the original learning problem, yet provide meaningful additional test data. We require that the examples that were near in time in the original sequence remain close to each other in the permuted sequence. We propose three controlled permutation techniques, that are aimed at modifying the configuration of changes in a dataset. Our permutations are non intrusive to the original data, as they do not need to detect or model changes in the instance space, they only manipulate positions of examples in a sequence. Permutations will not help in situations where the data snapshot that is available misses some of the distributions that may happen in reality for a given application. However, our permutations will help to model various transitions and changes across the existing distributions. The proposed permutations differ in assuming which examples in a dataset are exchangeable and highlight particular types of changes. The time permutation forces sudden drifts by shifting blocks of data. This permutation assumes that blocks of data are exchangeable with each other, as they represent different distributions of the original data and these distributions may occur in any order in an arbitrary data snapshot. The speed permutation introduces reoccurring concepts by shifting a set of examples to the end of the sequence,

Controlled Permutations for Testing Adaptive Learning Models The time permutation

The speed permutation

5 The shape permutation

Fig. 2 The proposed permutations.

aiming at varying the speed of changes and modeling recurring concepts. It assumes that the distributions observed in the past may reoccur. The shape permutation forces gradual drifts by perturbing examples within their close neighborhood. It assumes that the examples that arrive at similar time are exchangeable with each other. We study permutations in the following formal setting. Given is a sequential dataset consisting of n examples, n-sequence. Each example has an index, which indicates its position in time, we will permute these indices. Let Ωn be the space of all permutations of the integers {1, 2, . . . , n}. The identity sequence I(n) is an n-sequence where the indices are in order (1, 2, . . . , n). Let J = (j1 , . . . , jn ) be a permutation from Ωn . Consider a permutation function π such that π(m) = jm and π −1 (jm ) = m. Here jm is the original index of the example, which is now in the position m. For example, let the original sequence be I(3) = (1, 2, 3). If π(1) = 2, π(2) = 3 and π(3) = 1, then J = (2, 3, 1). In this case π −1 (1) = 3, π −1 (2) = 1 and π −1 (3) = 2. The time permutation. First we randomly determine where to split. A split can occur after each example with a chosen probability p. After processing all the sequence we obtain a number of splits. Next we reverse the order of the blocks, which resulted from splitting, as illustrated in Figure 2 (left). The procedure is given in Algorithm 1. The parameter p varies the extent of the permutation. The speed permutation. For each example we determine at random with a probability p whether or not it will be lifted. Then the lifted examples are moved to the end of the sequence keeping their original order, as illustrated in Figure 2 (center). The procedure is given in Algorithm 2. The parameter p varies the extent of the permutation. The shape permutation. An example is selected uniformly at random and swapped with its right neighbor. The permutation is illustrated in Figure 2 (right). The procedure is given in Algorithm 3. To keep the permutation local, the number of iterations k must be controlled, otherwise we will end up with a random order. Obviously, the larger k is, the further away from the original distribution we get. Therefore we constrain the number of iterations to k < 2n, this constraint will be justified in Section 4. The proposed permutation procedures are inspired by physical analogs in card shuffling. We model the time permutation procedure as the overhand card shuffle [18], the speed permutation as the inverse riffle shuffle [1], and the shape permutation as a transposition shuffle [1].

ˇ Indr˙e Zliobait˙ e

6

Algorithm 1: The time permutation. input : data length n, probability of split p output: permutation π(m) = j assign k = 0; s0 = 0; for i = 1 to n − 1 do if p > ξ ∼ U [0, 1] then split k = k + 1; sk = i; assign the last split k = k + 1; sk = n; for j = 1 to k do reverse a block π ∗ (sj−1 + 1 . . . sj ) = (sj . . . sj−1 + 1); reverse the full sequence back π(1 . . . n) = π ∗ (n . . . 1)

Algorithm 2: The speed permutation. input : data length n, probability of lift p output: permutation π(m) = j assign Π0 = (); Π1 = (); for i = 1 to n do if p > ξ ∼ U [0, 1] then add i to sequence Π1 = (Π1 , i); else add i to sequence Π0 = (Π0 , i); concatenate the two sequences π(1 . . . n) = (Π0 , Π1 )

Algorithm 3: The shape permutation. input : data length n, number of swaps k < 2n output: permutation πk (m) = j start with the identity original order π(1 . . . n) = (1 . . . n); for i = 1 to k do randomly select s ∈ {1, . . . , n − 1}; assign πi (s) = πi−1 (s + 1); πi (s + 1) = πi−1 (s);

4 Controlling the permutations Our next step is to define a measure capturing to what extent a permutation preserves the original distributions. With this measure we can theoretically justify that our permutations perturb the original order to a controlled extent staying far from random permutations. This ensures that we do not lose varying data distributions, and yet significantly perturb the data, as desired.

4.1 Measuring the extent of permutations A number of distance measures between two permutations exists [8, 21, 22]. They count editing operations (e.g. insert, swap, reverse) to arrive from one permutation to the other. Such distances are not suitable for our purpose, as

Controlled Permutations for Testing Adaptive Learning Models

7

they measure absolute change in the position of an example, while we need to measure a relative change. We are interested to measure how well local distributions of data are preserved after a permutation. Preserving local distributions means that the examples that were originally near to each other in time need to remain close after a permutation. Thus, instead of measuring how far the examples have shifted, we need to measure how far they have moved from each other. If the examples move together, then local distributions are preserved. To illustrate the requirement, consider a sequence of eight examples (1, 2, 3, 4, 5, 6, 7, 8). The measure should treat the permutation (5, 6, 7, 8, 1, 2, 3, 4) as being close to the original order. The local distributions (1, 2, 3, 4) and (5, 6, 7, 8) are preserved while the blocks are swapped. The permutation (8, 7, 6, 5, 4, 3, 2, 1) needs to be very close to the original. Although the global order has changed completely, every example locally has the same neighbors. In contrast, the permutation (1, 8, 3, 6, 5, 4, 7, 2) needs to be very distant from the original. Although half of the examples globally did not move, the neighbors are mixed and the local distributions are completely destroyed. To capture the local distribution aspects after a permutation we introduce the neighbor distance measure. Definition 1 The total neighbor distance (TND) between the original sePn−1 quence and its permutation is defined as D = i=1 |ji − ji+1 |. where ji is the original position of an example that is now in the position i, or ji = π −1 (i). For example, the total neighbor distance between the identity permutation and J = (1, 3, 2, 4) is D(J) = |1 − 3| + |3 − 2| + |2 − 4| = 5. Definition 2 The average neighbor distance (AND) between the original sequence and its permutation is the total neighbor distance divided by the number of adjacent pairs d = D/(n − 1). In our example D(J) = 5, n = 4, hence, d(J) = 5/3. In our definition neither TND, nor AND are metrics, but aggregations of the distances between adjacent examples (which are metrics). Although these measures could be normalized to the interval [0, 1], we keep them in this simple form, while we are interested in the differences rather than absolute values. We have investigated the following nine distance measures between permutations, presented up to normalizing denominators. Assume that the original sequence is in order (1, 2, 3, . . . , n). – Kendall distance counts the number of swaps of the neighboring elements to get P from one permutation to the other n−1 Pn dK ∼ i=1 j=i+1 1(π(j) < π(i)). – Precede distance counts the number of times the elements precede each other. Up to a constant it is the same as Kendall distance dC ∼ dK . – Spearman rank correlation aggregates the squared differences between the positions of the same element in two permutations Pn dS ∼ i=1 (π(i) − i)2 .

ˇ Indr˙e Zliobait˙ e

8

– Position distance sums the differences between the positions of the elePn ments dP ∼ i=1 |π(i) − i|. – Adjacent distance counts the number elements, which neighbor each Pof n−1 Pn other in the two permutations dA ∼ − i=1 j=i+1 1(|π(j) − π(i)| = 1). – Exchange distance counts the exchange operations Pn of two elements needed to get from one permutation to the other dE ∼ i=1 1(π ∗ (i) 6= i), where π ∗ is changing as exchange operations proceed. – Hamming Pn distance counts the number of elements in the same positions dH ∼ i=1 1(π(i) = i). – Rising P sequences count the number of increasing subsequences n−1 dR ∼ − i=1 1(π −1 (i) > π −1 (i + 1)). – Runs count Pn−1the number of increasing index subsequences dU ∼ − i=1 1(π(i) > π(i + 1)).

None of these distances captures the same properties of permutations as our measure. Edit based distance measures treat the reverse permutation (8, 7, 6, 5, 4, 3, 2, 1) as the most distant, while our measure treats it as very close in terms of the local distributions, which is a desired property. The adjacent distance appears to be the closest to our measure TND, but the relation is not that strong. The adjacent distance requires to preserve the exact neighbors, while our measure captures how far the examples are from each other in the new permutation. The adjacent distance does not quantify how strong a permutation is in destroying neighborhoods, while our measure does that.

4.2 The theoretical extent of our permutations As we defined how to measure the extent of permutations, we can find theoretical expressions of that measure for our permutations. Proposition 3 The expected average neighbor distance d after the time permutation of an n-sequence with the probability of split p is E(d) = 1 + 2np

(n − np − 1) . (np + 1)(n − 1)

The time permutation tends to limn→∞ E(d) = 3 − 2p < 3, for fixed p ∈ (0, 1). We prove Proposition 3 in two stages. First we assume that the number of splits is fixed, and then we generalize to a random number of splits. Proposition 4 The expected total neighbor distance D after the time permutation of n-sequence with exactly k splits is E(Dk ) = n − 1 + 2(n − k) −

2n . k+1

Controlled Permutations for Testing Adaptive Learning Models

9

Proof (of Proposition 4) E(Dk ) = D(I)−k +E(H). Starting from the identity permutation D(I), subtract k due to splits happening and add H due to the concatenation after the splits. Denote the positions of splits in the ascending order as k1 , k2 , . . . , kk . The total neighbor distance of the identity sequence is D(I(n) ) = |1 − 2| + |2 − 3| + . . . + |n − 1 − n| = n − 1. H can be expressed as H = kn −(kk−1 +1)+kk −(kk−2 +1)+. . .+k2 −(k0 +1), where kn = n and k0 = 0. After canceling out we get H = n + kk − k1 − k and E(H) = n + E(kk − k1 ) − k. There are n − 1 possible positions for k splits. It is easy to show that the expected difference and the maximum element in a between the minimum k combination n−1 . Thus, E(H) = n − k + n k−1 is E(kk − k1 ) = n k−1 k+1 k+1 , and k−1 ⊓ ⊔ E(Dk ) = n − 1 − k + n − k + n k+1 . Proof (of Proposition 3) Now k is a random variable following the binomial distribution k ∼ B(n, p), thus the expected number of splits is E(k) = np. 2n = n−1+ From Proposition 4 we get E(D) = n − 1 + 2(n − E(k)) − E(k)+1 (n−np−1) 2np (n−np−1) np+1 , and E(d) = E(D)/(n − 1) = 1 + 2np (np+1)(n−1) .

⊓ ⊔

Proposition 5 The expected average neighbor distance d after the speed permutation of an n-sequence with the probability p of lifting each example is E(d) =

3(n + 1) 3 − . n−1 p(1 − p)(n − 1)

The speed permutation tends to limn→∞ E(d) = 3. To prove Proposition 5 we need to find TND of a rising sequence. Definition 6 A rising sequence is a sequence M = (j1 , j2 , . . . , jm ), where j1 < j2 < . . . < jm . Proposition 7 The total neighbor distance D of a rising sequence M is D(M ) = jm − j1 . Pm−1 Pm−1 Proof (of Proposition 7 ) D(M ) = j=1 |ij + 1 − ij | = j=1 (ij+1 − ij ) = i2 − i1 + i3 − i2 + . . . + im−1 − im−2 + im − im−1 = im − i1 . ⊓ ⊔ Proof (of Proposition 5) A lift forms two rising sequences. We denote the lifted subsequence l1 , l2 , . . . , lL and the remaining subsequence z1 , z2 , . . . , zZ . Since the starting sequence is in the identity order, both the lifted and the remaining subsequences are rising. The total neighbor distance after the permutation is E(D) = E(zZ − z1 + lL − l1 + zZ − l1 ) = 2E(zZ ) − E(z1 ) + E(lL ) − 2E(l1 ), which is the sum of the neighbor distances in the two subsequences plus their concatenation. The lifted subsequence can start with ’1’ with a probability p, or it can start with ’2’ with a probability (1 − p)p, given that ’1’ was not lifted and so on. E(l1 ) = 1p + 2(1 − p)p + 3(1 − p)2 p + . . . + n(1 − p)n−1 p ≈ pp2 = p1 , Pn p 1 j and E(z1 ) ≈ 1−p . The sums use the identity j=1 jp ≈ (1−p)2 , which is straightforward to verify decomposing the sum into geometric progressions. Similarly, E(lL ) = np + (n − 1)(1 − p)p + (n − 2)(1 − p)2 p + . . . + (1 − p)n−1 p ≈

ˇ Indr˙e Zliobait˙ e

10

1 p np−1+p = n + 1 − p1 , and E(zZ ) ≈ n + 1 − 1−p . The sums use the identity 2 Pn p n(1−p)−p j j=0 (n − j)p ≈ (1−p)2 ). With the terms in place we get E(D) ≈ 3(n +

1) −

3 p(1−p) ,

and E(d) =

E(D) n−1

=

3(n+1) n−1

−

3 p(1−p)(n−1) .

⊓ ⊔

For the time and the speed permutations one iteration results in many edit operations in a sequence. Examples mix fast. In contrast, in the shape permutation one iteration makes one edit operation, thus examples mix slowly, hence, we need more than one iteration to perturb the order. Thus, AND of the the shape permutation is also a function of the number of iterations k. Proposition 8 The expected average neighbor distance d after k iterations of the shape permutation of an n-sequence is 2 2k 1 k E(d) < 1 + − . n−1 2 n The shape permutation in the prudent theoretical limit (ignoring the minus term) limn→∞ E(d) < 5. Note that for k < 2n we have E(d) < 3 and for fixed k we have limn→∞ E(d) < 1. Proof (sketch of Proposition 8) Denote the expected total neighbor distance after k iterations of the shape permutation as E(Dk ). As we start from the n−1 = 1. After one iteration E(d1 ) = E(d0 ) + identity sequence, E(d0 ) = n−1 2 n−6 2 2 n−3 + 1 n−1 + 1 n−1 − 2 n−1 + 1 n−1 . After two iterations E(d2 ) < E(D1 ) + 2 n−1 1 2 1 n−1 < E(d1 ) + n−1 . The first inequality appears, since at the start and at the end of the sequence the examples have one neighbor instead of two while we treat them as having two neighbors. After k iterations 2 2k 2 ≈ E(d0 ) + k n−1 = 1 + n−1 . ⊓ ⊔ E(dk ) < E(dk−1 ) + n−1 In order to assess how far a permutation is from random, we need the minimum AND of a random permutation. Proposition 9 The minimum average neighbor distance dmin of a permutation of an n-sequence is dmin = 1. Proof (of Proposition 9) A sequence of length n contains n − 1 adjacent pairs. Since there are no equal indices in the sequence, the distance between any two adjacent indices cannot be less than 1. In the identity sequence (1, 2, 3 . . . , n) the distances between all the adjacent neighbors are equal to 1. Thus, dmin = (n − 1)|1|/(n − 1) = 1. ⊓ ⊔ Proposition 10 The expected average neighbor distance d of a random permutation of an n-sequence is E(d) = (n + 1)/3. Proof (of Proposition 10) When permuting at random any permutation J ∈ Ωn is equally likely. Let J = (j1 , j2 , . . . , jn ) be a permutation. Since TND is a simple sum of pairwise distances, the expected neighbor distance of a random

Controlled Permutations for Testing Adaptive Learning Models

11

permutation resorts to an expected distance between two randomly chosen indices in a sequence: E(D) = E(|ji − ji+1 |). We find theP expected Pn value as n an average over all possible combinations E(|ji − ji+1 |) = u=1 v=u+1 (v − u)/ n2 . The components of the numerator can be expressed as a triangular matrix Tn−1×n−1 , where the elements tij = i for j ≤ (n − i) and 0 otherwise. It can be shown that the sum of all the elements is SnT = 16 n(n + 1)(n + 2). T / n2 = n+1 Using this expression we get E(|ji − ji+1 |) = Sn−1 ⊓ ⊔ 3 .

In summary, our permutations double-triple the average neighbor distance, which is a substantial variation for generating multiple test sets. The expected AND of a random permutation is linear in n (Proposition 10), while the maximum AND of our permutations is constant in n. Since our permutations are that far from random and close to the original order, we are not losing variations in data distributions. The order of data is perturbed to a controlled extent. 5 Experiments We explore our permutations experimentally in two parts. Firstly, we visualize the permutations on real data. Our goal is to analyze the behavior of the permutations when changes in the original sequence happen in different ways. Secondly, we test a set of adaptive learning models on real evolving datasets and their permutations. Our goal is to is to demonstrate what additional information becomes available as a result of testing with our permutations and how using this information reduces the risk of a single evaluation bias. 5.1 Visual inspection We visualize data over time so that the effects of permutations on changes in data distribution could be observed to give an intuitive perspective on what effect the proposed permutations achieve. For simplicity of interpretation we limit this analysis to a single input variable. We plot four numeric input variables from evolving datasets from [13, 28] as time series, so that different types of changes are represented (s1: a sudden change in concepts, s2: gradual (incremental) change, s3: reoccurring concepts, s4: no concept drift). Figure 3 presents the original sequences, the time, the speed and the shape permutations of the sequences, as well as a random permutations. We see that all datasets permuted using the time, the speed and the shape permutation techniques closely resemble the original sequences and preserve distinct distributions of data. In contrast, a random permutation produces sequences that are uniformly distributed over time. As a result, different distribution and the need for adaptivity are lost. Randomly permuted sequences represent different online learning problems that do not require adaptation over time.

ˇ Indr˙e Zliobait˙ e

12 s1

s2

s3

s4

original time speed shape random Fig. 3 The original and permuted variables.

5.2 Testing adaptive models with multiple test sets Next we present computational experiments to demonstrate how our permutations can aid in evaluating adaptive models. We use three real datasets with the original time order covering 2-3 years period. All datasets present binary classification problems where concept drift is expected. The Chess1 dataset (size: 503 × 8) presents the task to predict the outcome of a chess game. Skills of a player and types of tournaments evolve over time. The Luxembourg1 dataset (size: 1 901 × 31) asks to predict what time a person spends on the internet, given the demographic information of a person. The task is relevant for marketing purposes. The usage of internet is expected to change over time. The Electricity dataset [13] (size: 45 312 × 8) is the same as in Section 2. For each dataset we generate ten permutations of each type (time, speed and shape) with the parameters fixed to p = 0.5 and k = n. The code for our permutations is available online1 . We test five adaptive classifiers: OzaBagAdwin [6] with 1 and 10 ensemble members (Oza1, Oza10), DDM [11], EDDM [4] and HoeffdingOptionTreeNBAdaptive (Hoeff) [19]. We use MOA implementations [5] of these classifiers with the Naive Bayes as the base classifier. Hoeff on the Electricity data is not reported as it runs out of memory. Multiple tests with our permutations enable assessing three more aspects of the performance in addition to the testing accuracy: volatility, reaction to different types of changes and stability of classifier ranking. Figure 4 plots the accuracies and the standard deviations of the tests. Robustness of the performance can be assessed by calculating the standard deviation of the accuracy. For instance, in the Chess data Hoeff shows the most robust performance. In the other two datasets we can conclude that the ensemble techniques (Oza1 and Oza10) are quite robust, while the detectors (DDM and EDDM) are pretty resilient. The permutations allow to to assess reactions to variations in changes within the vicinity of the original data. For example, Hoeff is the least accurate 1

Available at https://sites.google.com/site/zliobaite/permutations

Controlled Permutations for Testing Adaptive Learning Models

original

time

speed

Elec dataset shape

original

time

speed

shape

DDM EDDM Oza1 Oza10

LU dataset shape

DDM EDDM Oza1 Oza10

speed

DDM EDDM Oza1 Oza10

time

DDM EDDM Oza1 Oza10

Chess dataset original

13

100 74

84

testing accuracy (%)

95 72

82 90

70

80 85

68

78 80

DDM EDDM Oza1 Oza10 Hoeff

DDM EDDM Oza1 Oza10 Hoeff

DDM EDDM Oza1 Oza10 Hoeff

75

DDM EDDM Oza1 Oza10 Hoeff

DDM EDDM Oza1 Oza10 Hoeff

DDM EDDM Oza1 Oza10 Hoeff

DDM EDDM Oza1 Oza10 Hoeff

76 DDM EDDM Oza1 Oza10 Hoeff

66

Fig. 4 Testing accuracies and one standard deviation computed on permutations.

on the original Chess dataset and on the shape permutations, while it is the most accurate on the time permutations. The results suggest that Hoeff is better at handling sudden changes. When testing with the time permutation on the Chess data the accuracies of Oza1, Oza10 and Hoeff are notably higher than on the original. The observation suggests that the listed classifiers are better in handling more sudden drifts than incremental ones. Finally, our permutations allow to assess the stability of the ranking of classifiers as well as comparing them pairwise. We see on the LU data that the ranking of accuracies is very stable, it does not vary. Drawing the conclusions in the paragraph above would have been risky or impossible on the evidence of a single test run, but is reasonable to nominate Hoeff as the most accurate classifier with the use of our controlled permutations. Our final recommendations for testing practice are as follows. When evaluating a set of adaptive learning models we suggest first to run the test-thentrain test on the original data and use these results as a baseline. We suggest to run multiple tests with permuted datasets, that would inform about robustness of the classifier to variations in changes. We advise to use this information for complementary qualitative assessment. Before applying a permutation one needs to critically think if a particular type of permutations is sensible from the domain perspective for a given application. 6 Related Work Our study relates to three lines of research: comparing supervised learning models, randomization in card shuffling and measuring distance between permutations. Comparing the performance of classifiers received a great deal of attention in the last decade, e.g. [7, 9, 17]; however, these discussions assume a classical classification scenario, where the data is static. A recent contribution [12]

14

ˇ Indr˙e Zliobait˙ e

addresses issues of evaluating classifiers in the online setting. The authors present a collection of tools for comparing classifiers on streaming data (not necessarily changing data). They provide means how to present and analyze the results after the test-then-train train procedure. Our work concerns the test-then-train procedure itself, thus can be seen as complementary. We are not aware of any research addressing the problem of evaluation bias for adaptive classifiers or studying how to generate multiple tests for such classifiers. The second line of research relates to measuring distance between permutations in general [8, 21, 22] or with specific applications to bioinformatics (e.g. [10]). In Section 4 we reviewed and experimentally investigated the main existing distance measures. As we discussed, these distance measures quantify absolute changes in the positions of examples, while our problem requires evaluating the relative changes. Thus we have introduced a new measure. A large body of literature studies randomization in card shuffling (e.g. [1, 18]). These works theoretically analyze shuffling strategies to determine how many iterations are needed to mix a deck of cards to a random order. Although our datasets can be seen as decks of cards, we cannot reuse the theory for shuffling times, as they focus on different aspects of the problem. To adapt those theoretical results for our purpose we would need to model probability distribution of the relations between cards. In the light of this option we argue that our choice to use the average neighbor distance is much more simple and straightforward. A few areas are related via terminology. Restricted permutations [3] avoid having subsequences ordered in a prescribed way, such requirements are not relevant for our permutations. Permutation tests [23] assess statistical significance of a relation between two variables (e.g. an input variable and the label), while we assess the effects of order. Block permutations [2] detect change points in time series, while we do not aim to analyze the data content, we perturb the data order. Discovering periodicity in time series [27] is as well based on analyzing the content of data, while we operate indices of a sequence. Notably, this study uses permutations of elements in time series as a baseline of no periodicity. Time series bootstrap methods [20] aim to estimate distribution of data by resampling. Our time permutation is similar as a technique. However, the problem setting is different, thus generally these methods are not directly reusable. These methods are designed for identically distributed dependent data, while our setting implies that the data is independent but not identically distributed.

7 Conclusion We proposed a methodology for generating multiple test sets from real sequential data for evaluating adaptive models designated for an online operation. We pointed out that the standard test-then-train procedure while running a single test per dataset risks to produce results biased towards the fixed positions of changes in the dataset. Thus, we propose to run multiple tests with

Controlled Permutations for Testing Adaptive Learning Models

15

randomized copies of a data stream. We develop three permutation techniques that are theoretically restricted so that different distributions from the original data are not lost as a result of a permutation. Our experiments demonstrate that such multiple tests provide good means for qualitative analysis of the performance of adaptive models. That allows to assess three more characteristics of the performance, such as volatility, reactivity to different ways changes can happen and stability of model ranking, in addition to the accuracy from a single test. Our permutations make it possible to pinpoint specific properties of the performance and explore sensitivity of the results to the data order. Such an analysis complements evaluation and can make the assessment more confident. Our permutations can be viewed as a form of cross validation for evolving data. This research opens several follow up research directions. It would be relevant to find what statistical tests are suitable to assess the statistical significance of the resulting accuracies. The problem is challenging, since the results from multiple tests cannot be considered independent. Another interesting direction is to develop mechanisms that would allow instead of restricting permutations with an upper bound to have permutations of a specified extent, or even further, to sample an extent from a probabilistic model (e.g. the Mallows model) and then generate a permutation accordingly. Acknowledgements The research leading to these results has received funding from the European Commission within the Marie Curie Industry and Academia Partnerships and Pathways (IAPP) programme under grant agreement no 251617.

References 1. D. Aldous and P. Diaconis. Shuffling cards and stopping times. The American Mathematical Monthly, 93(5):333–348, 1986. 2. J. Antoch and M. Huskova. Permutation tests in change point analysis. Statistics and Probability Letters, 53:37–46, 2001. 3. M. Atkinson. Restricted permutations. Discrete Mathematics, 195:27–38, 1999. 4. M. Baena-Garcia, J. del Campo-Avila, R. Fidalgo, A. Bifet, R. Gavalda, and R. MoralesBueno. Early drift detection method. In Proceedings of ECML PKDD Workshop on Knowledge Discovery from Data Streams, page 7786, 2006. 5. A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive Online Analysis. Journal of Machine Learning Research, 11:1601 – 1604, 2010. 6. A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavalda. New ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge discovery and data mining (KDD), pages 139–148, 2009. 7. J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. 8. P. Diaconis. Group representations in probability and statistics, volume 11 of Lecture Notes–Monograph Series. Hayward Institute of Mathematical Statistics, 1988. 9. T. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895–1923, 1998. 10. R. Durrett. Shuffling chromosomes. J. of Theoretical Probability, 16(3):725–750, 2003. 11. J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. In Proc. of Brazilian Symposium on Artificial Intelligence (SBIA), pages 286–295, 2004.

16

ˇ Indr˙e Zliobait˙ e

12. J. Gama, R. Sebastiao, and P. P. Rodrigues. Issues in evaluation of stream learning algorithms. In Proc. of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pages 329–338, 2009. 13. M. Harries. Splice-2 comparative evaluation: Electricity pricing. Technical report, The University of South Wales, 1999. 14. E. Ikonomovska, J. Gama, and S. Dzeroski. Learning model trees from evolving data streams. Data Mining Knowledge Discovery, 23(1):128–168, 2011. 15. I. Katakis, G. Tsoumakas, and I. Vlahavas. Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowledge Information Systems, 22:371– 391, 2010. 16. J. Kolter and M. Maloof. Dynamic weighted majority: An ensemble method for drifting concepts. Journal of Machine Learning Research, 8:2755–2790, 2007. 17. M. Ojala and G. Garriga. Permutation tests for studying classifier performance. Journal of Machine Learning Research, 11:1833–1863, 2010. 18. R. Pemantle. Randomization time for the overhand shuffle. Journal of Theoretical Probability, 2(1):37–49, 1989. 19. B. Pfahringer, G. Holmes, and R. Kirkby. New options for hoeffding trees. In Proceedings of the 20th Australian joint conference on Advances in Artificial Intelligence (AJCAAI), pages 90–99, 2007. 20. D. Politis. The impact of bootstrap methods on time series analysis. Statistical Science, 18(2):219–230, 2003. 21. T. Schiavinotto and T. Stutzle. A review of metrics on permutations for search landscape analysis. Computers and Operations Research, 34(10):3143–3153, 2007. 22. K. Sorensen. Distance measures based on the edit distance for permutation-type representations. Journal of Heuristics, 13(1):35–47, 2007. 23. W. Welch. Construction of permutation tests. Journal of the American Statistical Association, 85(411):693–698, 1990. 24. G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1):69–101, 1996. 25. I. Witten, E. Frank, and M. Hall. Data Mining: Practical Machine Learning Tools and Techniques (3rd Ed.). Morgan Kaufmann, 2011. 26. M. Wozniak. A hybrid decision tree training method using data streams. Knowledge Information Systems, 29(2):335–347, 2011. 27. M. Vlachos P. Yu, V. Castelli, and Ch. Meek. Structural periodic measures for timeseries data. Data Mining Knowledge Discovery, 12:1–28, 2006. 28. I. Zliobaite. Combining similarity in time and space for training set formation under concept drift. Intelligent Data Analysis, 15(4):589–611, 2011. 29. I. Zliobaite. Controlled permutations for testing adaptive classifiers. In Proceedings of the 14th International Conference Discovery Science (DS), pages 365–379, 2011.

Controlled Permutations for Testing Adaptive Learning ...

Complementary tests on such sets allow to analyze sensitivity of the ... decade, a lot of adaptive learning models for massive data streams and smaller ... data. For that we would need to build a statistical model for the sequence and use that.

Download PDF

340KB Sizes 3 Downloads 221 Views

Report

Controlled Permutations for Testing Adaptive Learning ...

Recommend Documents