c ICIC International ⃝2012 ISSN 1881-803X

ICIC Express Letters Volume 6, Number 2, February 2012

pp. 455–460

MINING IMBALANCED AND CONCEPT-DRIFTING DATA STREAMS USING SUPPORT VECTOR MACHINES Hien M. Nguyen1 , Eric W. Cooper2 and Katsuari Kamei2 1

Graduate School of Science and Engineering College of Information Science and Engineering Ritsumeikan University 1-1-1 Noji-Higashi, Kusatsu, Shiga 525-8577, Japan [email protected]; [email protected]; [email protected] 2

Received April 2011; accepted July 2011 Abstract. This paper presents a new method for mining imbalanced and concept-drifting data streams using support vector machines (SVM). Our proposed method is based on the ensemble learning scheme, in which the AUC evaluation measure is used instead of the overall accuracy. The minority class is learned incrementally based on support vectors to improve the representation of this class. The majority class is under-sampled if needed to further balance the training set. In addition, we use a cross-validation technique to prevent out-of-date instances from being involved in training of classification models. Experimental results on a synthetic data stream show that our method achieves better performance than other methods. Keywords: Class imbalance, Concept drift, Data streams, Ensemble learning, Incremental learning, Support vector machines

1. Introduction. Data stream mining is currently an active research field in the data mining and machine learning community. Some examples include credit card fraud detection [1], web filtering [2], and click-stream clustering [3]. The biggest challenge facing data stream mining is to deal with “concept drift”, which may make a classifier out-of-date if it is not updated appropriately. A popular approach to learning from data streams is ensemble learning, in which a series of base models is built on consecutive training data chunks and forms an ensemble to classify unknown instances. To deal with the problem of concept drift, the base models can be weighted using their classification accuracies on the newest data chunk [1], with the assumption that new data best reflects the distribution of forthcoming unseen instances. Although many methods are available for learning from data streams, most are not designed specifically for imbalanced data streams, which are very often observed in practice. We can see class imbalance in data when considering intrusions in computer network traffic, frauds in credit card transactions, and fakes in online product review streams. The reason why existing learning methods are not good for imbalanced data streams is that they optimize the overall accuracy of the model, which is dominated by the majority class. In other words, it may be very easy for a learner to ignore the minority class. Therefore, the overall accuracy is not a suitable measure for evaluation of learning methods on imbalanced data. In this paper, we propose a new method for mining imbalanced and concept-drifting data streams using support vector machines (SVM). The proposed method is based on the ensemble learning scheme. However, the points that are different in our method include: (1) the AUC evaluation measure is used as the weight of base models because it is more suitable than the overall accuracy for imbalanced data, (2) the minority class is learned 455

456

H. M. NGUYEN, E. W. COOPER AND K. KAMEI

incrementally based on support vectors to improve the representation of this class, and (3) the majority class is under-sampled if needed to further balance the training set. In addition, we use a cross-validation technique to prevent out-of-date instances from being involved in training of classification models. Experimental results on a synthetic data stream show that our proposed method achieves better performance than other methods under various experimental conditions. The rest of the paper is organized as follows. Section 2 reviews some related works. Section 3 describes our proposed method. Section 4 presents experimental results. Section 5 concludes this paper. 2. Related Works. There so far has been very little work conducted on data streams with imbalanced class distributions. Some studies approach the problem of class imbalance for each data chunk isolatedly. In other words, each data chunk of a stream is treated as a conventional imbalanced data set, and thus can be handled by an existing technique, such as over-sampling by generating synthetic minority class instances [4] and under-sampling by clustering the majority class [5]. One drawback of such methods is that old data is not taken into account for improving the representation of the minority class. Other methods have been proposed to exploit past data. Gao et al. [6] proposed a method in which all the minority class instances from previous data chunks are incorporated into the current data chunk. This method is problematic because past data may become out-of-date due to concept drifts. Furthermore, training when reusing all past data is time-consuming. Another method proposed in [7] selects only past minority instances with high rank, which is based on the number of minority nearest neighbors in the current data chunk. However, the actual number of instances selected depends on the desired degree of class imbalance, i.e., not the validity of past data. Therefore, this method may not be effective for dealing with concept drifts. Spyromitros et al. [8] used two separate sliding windows, one for the minority instances and the other for the majority instances. The window sizes are determined so as to obtain a certain class distribution ratio, such as 40/60. This approach also suffers from the same problem as the previous method, in [7]. 3. Mining Imbalanced and Concept-Drifting Data Streams Using SVMs. 3.1. Evaluation measure. As analyzed in Section 1, the overall accuracy is not a suitable measure for evaluation of learning methods on imbalanced data streams. Therefore, we use the AUC measure instead. AUC is the area under the receiver operating characteristic (ROC) curve, as shown in Figure 1. The ROC relates the true positive rate to the false positive rate. Compared with the overall accuracy, the AUC does not depend on prior class probabilities. It is also a good evaluation measure for changing environments such as data streams, where the distribution of future data may be different from that of current data [9]. To calculate the AUC, we can vary the decision threshold of an SVM to generate a ROC curve and then apply the trapezoidal method [10]. However, the output f (x) of a standard SVM is not probabilistic, making it difficult to calculate the AUC. Thus, we use Platt’s method [11] to estimate a probabilistic output for SVMs. This method fits a sigmoid function to the outputs of a standard SVM: P (+|x) = 1/[1 + exp(af (x) + b)]. The default decision threshold for such a probabilistic-output SVM is 0.5, i.e., an instance x is classified as positive if P (+|x) ≥ 0.5, and negative otherwise. To generate a ROC curve, we vary the decision threshold from 0 to 1 with an incremental step of 0.05.

MINING IMBALANCED AND CONCEPT-DRIFTING DATA STREAMS USING SVMS

457

True positive rate

1

AUC

0 False positive rate

1

Figure 1. The area under the ROC curve 3.2. Weighting of base models. Our proposed method is based on the ensemble learning scheme, in which base models are weighted by the corresponding AUC performance on the newest data chunk. In addition, a base model will be used only if its performance is better than that of a random guess, which classifies an instance as positive according to a fixed probability p+ . The AUC value of such a random guess is obtained as follows: + + + − True positive rate tpr = p n×n = p+ and false positive rate f pr = p n×n = p+ , where n+ + − − and n are the numbers of positive and negative instances, respectively. Thus, we have tpr = f pr, i.e., the ROC in the random guess is the upward diagonal of the unit box in Figure 1, leading to a value of 0.5 for the AUC of the random guess. The output of an ensemble E is calculated as follows: ∑ i|wi >0.5 wi Pi (+|x) ∑ E(x) = i|wi >0.5 wi where wi = AUCi and Pi (+|x) are the weight and the probabilistic output of the i-th base model, respectively. Let θ ∈ [0, 1] denote the decision threshold. If E(x) ≥ θ, x is assigned to the minority/positive class. Otherwise, it is assigned to the majority/negative class. Here we keep a fixed size of the ensemble. If the ensemble is full, we will remove the base model with the lowest weight to make room for a new base model. 3.3. Dealing with class imbalance. We propose an incremental learning strategy on the minority class to improve the representation of this class, in which past data is combined with current data to create a better training set. However, there are two problems of this approach that need to be solved: (1) past data may be abundant and requires much training time and (2) past data may be out-of-date due to concept drifts. To solve the first problem, we select only a small subset including the most useful data, namely the support vectors of the minority class, from base model Mi−1 built on the last data chunk Di−1 , and then combine them with the current data chunk Di for training of new base model Mi . The approach of incremental learning with support vectors was investigated by Syed et al. [12]. However, the points that are different in our method include: (1) considering the support vectors of the minority class only and (2) dealing with concept drifts, which was not performed in [12] despite the phrase “handling concept drifts” in the title. Next, we consider the second problem in which past data may be out-of-date. Although support vectors are the most useful, they may easily become out-of-date due to their position closest to the decision boundary. Among the support vectors of the minority class, those located in the other side of the decision boundary are most prone to drift from the scope of a new concept. They are identified by Pi−1 (+|sv + ) < 0.5, where sv +

458

H. M. NGUYEN, E. W. COOPER AND K. KAMEI

is a minority class support vector of base model Mi−1 . We do not involve such support + vectors for incremental learning. For the set SVi−1 of the remaining support vectors, we check whether they help to improve the classification performance. If they do, they will be combined with data chunk Di . Otherwise, we will discard them and start a new incremental learning session from data chunk Di . To find out whether the support vectors will help, we apply a k-fold cross-validation technique to data chunk Di (k was set to 5 in our experiments). Data chunk Di is randomly split into k folds. Then each fold in turn is taken as validation set V and the remaining ones are combined into training set T . We train two models, one on T and the other on + T ∪ SVi−1 , and evaluate them on V . We call the corresponding two cross-validated AUC values AUC1 and AUC2 . If AUC1 < AUC2 , i.e., the support vectors do help, we combine them with data chunk Di . Otherwise, we use only data chunk Di to train new base model Mi . The maximum value between AUC1 and AUC2 will be used as the weight of Mi . To further balance the training set, we randomly split the majority class into subsets having a similar size to the minority class. Then we train a sub-ensemble of SVMs on balanced training subsets, which are composed of the minority class and a different majority class subset. The output of this sub-ensemble is simply the unweighted average over the probabilistic outputs of the component SVMs. 4. Experimental Study. We studied the performance of our method on a synthetic data stream, as described in the following subsection. 4.1. Rotating hyperplane synthetic data. The rotating hyperplane synthetic data has been used in a number of studies of data streams [1, 5, 7]. In this data, a hyperplane is rotating around the center of a unit hypercube [0, 1]d (d is the number of dimensions) ∑ in order to simulate a smooth concept drift. The hyperplane is defined by di=1 ai xi = a0 , where ai ∈ [0, 1] is the coefficient corresponding to the i-th dimension xi ∈ [0, 1] of a data point, and the value of∑a0 is set so that the hyperplane ∑ contains the center of the unit hypercube, i.e., a0 = 12 di=1 ai . The points satisfying di=1 ai xi ≥ a0 are labeled positive. Otherwise, they are labeled negative. We rotate the hyperplane by changing the value of the ai , which are randomly initialized. However, to create an imbalanced ∑d ∑d data stream, we used two parallel hyperplanes, H1 : instead of a single hyperplane. The i=1 ai xi = a01 and H2 : i=1 ai xi = a02 , ∑ points located between them, i.e., satisfying a01 ≤ di=1 ai xi ≤ a02 , are labeled minority. Otherwise, they are labeled majority. The distance δ between H1 and H2 determines the degree of class imbalance. To make H1 and H2 equally distant from the center of the unit hypercube, we set coefficients a01 and a02 as follows: a01 = a0 − 12 δ ∥a∥ and a02 = a0 + 12 δ ∥a∥, where a = (a1 , a2 , . . . , ad )T . We randomly generated data points in [0, 1]d , where d was set to 10 in all experiments. For every 1000 instances generated, we selected d′ coefficients ai at random and changed = ai + σi τ , where σi ∈ {−1, +1} is the direction of their value using the formula: anew i change, which is randomly initialized, and τ ∈ [0, 1] is the amount of change. The direction of change σi is given a chance p% (set to 10% in our experiments) to be reversed. The default parameter settings in our experiments are as follows: d′ = 4, τ = 0.1, and δ = 0.05. Data streams are split into 100 chunks, each of which contains 500 instances. With δ = 0.05, the space limited by two hyperplanes, H1 and H2 , and the unit hypercube has the hypervolume of at least 0.05. In other words, the minority class is approximately 5% or more of the data. For example, we generated 100 data streams and observed that on average the minority class in a data chunk was from 4.1% to 9.6% of the data.

MINING IMBALANCED AND CONCEPT-DRIFTING DATA STREAMS USING SVMS INC

BAL

INS

IMB

75

65

INS

BAL

2

4

6

8

75

65

0.05

10

0.1

Changing Dimensions

INC

0.2

BAL

INS

AUC (%)

70

60

500

750

1000

Chunk Size

(d) Chunk size

0.01

0.02

0.03

IMB

1250

0.04

0.05

0.06

INC

(c) Distance between H1 and H2 BAL

INS

80

80

75

60

70 65

IMB

INC

INS

40 20 0

55

250

58

δ

60

50

INS

48

0.25

G-mean (%)

IMB

0.15

(b) Change amount

80

INC

68

Change Amount

(a) Changing dims. (AUC) BAL

IMB

78

55

55

AUC (%)

INC

85

AUC (%)

AUC (%)

IMB

AUC (%)

BAL

85

459

5

10

15

20

Ensemble Size

(e) Ensemble size

25

2

4

6

8

10

Changing Dimensions

(f) Changing dims. (G-mean)

Figure 2. Classification performance when different parameters are varying 4.2. Learning methods. We compared the following four learning methods: • Balanced data stream learning (BAL): This method weights base models by the accuracy on the newest data chunk. It is similar to the method proposed in [1]. • Imbalanced data stream learning that does not consider past data (IMB): This method weights base models by the AUC on the newest data chunk. Furthermore, it learns a sub-ensemble on each data chunk by splitting the data into balanced training subsets. • IMB + Incremental learning with the minority class support vectors (INC): Besides performing the IMB method, this method always combines the minority class support vectors from the last base model with the current data chunk. • IMB + Selectively incremental learning with the minority class support vectors (INS): This is our proposed method, as described in Section 3. The ensemble size in all the learning methods was set to 10. Each base model is a single SVM for BAL or a sub-ensemble of SVMs for IMB, INC, and INS. To train an SVM, we used the LIBSVM software package [13] and ran with the Gaussian RBF kernel. We evaluated the learning methods by alternating the training of a base model on a data chunk with the classification of the next data chunk. We repeated the experiments five times and took the average results. 4.3. Experimental results. We show the results in Figure 2 when the following parameters are varying: the number of changing dimensions d′ , amount of change τ on each dimension, distance δ between hyperplanes H1 and H2 , chunk size, and ensemble size. Based on the AUC measure, the classification performance of the methods in general is in the following order: BAL < IMB < INC < INS. The results in Figures 2(a) and 2(b) show that a stronger degree of concept drift was accompanied by a decrease in performance. Meanwhile, Figure 2(c) shows an increase of performance when the degree of class imbalance is reduced (i.e., δ increases). All of these results are consistent with our expectations. In Figure 2(d), the performances of the BAL and IMB methods increased with the chunk size. This result could be because more training data usually leads to better models. However, it does not apply to the INC and INS methods, which learn from both current and past data. It is possible that the increase of training data in each data chunk reduces the usefulness of past data, especially when

460

H. M. NGUYEN, E. W. COOPER AND K. KAMEI

the past data may contain out-of-date instances. We also studied the effect of ensemble size to the classification performance, as shown in Figure 2(e). It seems that the ensemble size did not make a significant impact on the learning methods. Due to the limited space, we do not report the detailed results in other evaluation measures for imbalanced data, such as G-mean and F1 [14], but only summarize them. We report that our proposed method, INS, achieved the best G-mean and F1 performances among the compared methods. However, an obvious point is that the INC method was significantly inferior to the IMB method. This result is different from that in AUC, where INC seemed better than IMB. This poor performance of INC may be due to the fact that it does not consider whether a support vector from the last base model is out-ofdate. For reference, we report the result in G-mean when varying the number of changing dimensions in Figure 2(f). 5. Conclusions. This paper proposed a new method using SVMs to mine imbalanced and concept-drifting data streams. Our method combines AUC-weighted ensemble learning with incremental learning on the support vectors of the minority class. Furthermore, the out-of-date training instances can be prevented by a cross-validation technique on the newest data chunk. Experimental results on a synthetic data stream showed the effectiveness of the proposed method in comparison with other methods. In future work, we would like to confirm our method on real-world application domains. REFERENCES [1] H. Wang, W. Fan, P. S. Yu and J. Han, Mining concept-drifting data streams using ensemble classifiers, Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp.226-235, 2003. [2] A. Du and B. Fang, Novel approach for web filtering based on user interest focusing degree, Int. J. Innovative Computing, Information and Control, vol.4, no.6, pp.1325-1334, 2008. [3] J. Ren, C. Hu and R. Ma, HCluWin: An algorithm for clustering heterogeneous data streams over sliding windows, Int. J. Innovative Computing, Information and Control, vol.5, no.8, pp.2171-2179, 2009. [4] G. Ditzler, R. Polikar and N. Chawla, An incremental learning algorithm for non-stationary environments and class imbalance, Proc. 20th Int. Conf. Pattern Recognition, pp.2997-3000, 2010. [5] Y. Wang, Y. Zhang and Y. Wang, Mining data streams with skewed distribution by static classifier ensemble, Studies in Computational Intelligence, vol.214, pp.65-71, 2009. [6] J. Gao, B. Ding, W. Fan, J. Han and P. S. Yu, Classifying data streams with skewed class distributions and concept drifts, IEEE Internet Computing, vol.12, no.6, pp.37-49, 2008. [7] S. Chen and H. He, Towards incremental learning of nonstationary imbalanced data stream: A multiple selectively recursive approach, Evolving Systems, vol.2, no.1, pp.35-50, 2011. [8] E. Spyromitros, M. Spiliopoulou, G. Tsoumakas and I. Vlahavas, Dealing with concept drift and class imbalance in multi-label stream classification, Proc. 22th Int. Joint Conf. Artificial Intelligence, pp.1583-1588, 2011. [9] F. Provost and T. Fawcett, Robust classification for imprecise environments, Machine Learning, vol.42, no.3, pp.203-231, 2001. [10] A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition, vol.30, no.7, pp.1145-1159, 1997. [11] J. C. Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, Advances in Large Margin Classifiers, MIT Press, pp.61-74, 1999. [12] N. A. Syed, H. Liu and K. K. Sung, Handling concept drifts in incremental learning with support vector machines, Proc. 5th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp.317321, 1999. [13] C. C. Chang and C. J. Lin, LIBSVM: A library for support vector machines, ACM Trans. Intelligent Systems and Technology, vol.2, no.3, pp.27:1-27:27, 2011. [14] H. He and E. A. Garcia, Learning from imbalanced data, IEEE Trans. Knowledge and Data Engineering, vol.21, no.9, pp.1263-1284, 2009.

MINING IMBALANCED AND CONCEPT-DRIFTING ...

The minority class is learned incrementally based on support vectors to ..... AU. C. (%. ) Chunk Size. BAL. IMB. INC. INS. (d) Chunk size. 55. 60. 65. 70. 75. 80. 5.

165KB Sizes 0 Downloads 172 Views

Recommend Documents

Learning Concepts from Large Scale Imbalanced Data ...
challenging problem of Multimedia Information Retrieval. (MIR). Currently, there are mainly two types of methods to bridge the gap [8]. The first one is relevance feedback which attempts to capture the user's precise needs through iterative feedback

and Mining
the given collection automatically (cf., Sullivan, 200 !, p 198). Classification ... use link analysis to construct a visual network of pat- ents, which facilitates ... Sullivan (200ł, p. 37) regards the representation of near- ing by means of synte

Exact Self-Consistent Condensates in (Imbalanced ...
3 Ring geometry. 4 Population imbalance. 5 Conclusion and discussion. Giacomo Marmorini. Exact Self-Consistent Condensates in Superfluid Fermi Gases ...

Data Mining: Current and Future Applications - IJRIT
Artificial neural networks: Non-linear predictive models that learn through training ..... Semi-supervised learning and social network analysis are other methods ...

Web Mining and Social Networking.pdf
Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps.

Data Mining: Current and Future Applications - IJRIT
(KDD), often called data mining, aims at the discovery of useful information from ..... Advanced analysis of data for extracting useful knowledge is the next natural ...

data mining and warehousing pdf
data mining and warehousing pdf. data mining and warehousing pdf. Open. Extract. Open with. Sign In. Main menu. Displaying data mining and warehousing ...

DATA MINING AND ALGORITHMS.pdf
iv) Sting. v) BIRCH 12. ______. Whoops! There was a problem loading this page. DATA MINING AND ALGORITHMS.pdf. DATA MINING AND ALGORITHMS.pdf.

R and Data Mining
This book introduces into using R for data mining. It presents many examples of various data mining functionalities in R and three case studies of real world applications. The supposed audience of this book are postgraduate students, researchers and

Data Mining: Current and Future Applications - IJRIT
Language. (SQL). Oracle, Sybase,. Informix, IBM,. Microsoft. Retrospective, dynamic data delivery at record level. Data Warehousing. &. Decision Support. (1990s). "What were unit sales in. New England last. March? Drill down to. Boston. On-line analy

Data mining and education -
Overview. Data mining and education. Kenneth R. Koedinger,1∗ Sidney D'Mello,2 .... In both cases, a desirable final step ... Overview wires.wiley.com/cogsci. TABLE 1 Simplified Sample of Data Used in the KDD ..... developed such as LFA,56 Rule Spac

Data mining and education -
An emerging field of educational data mining (EDM) is building on and ... ing system, and (4) how machine learning techniques applied to discussion data.

Mining comparative sentences and information extraction
... come about pattern-match rules that directly clear substance fillers for the machines of chance in the template, which makes into company techniques from several way of discovery from examples reasoning programming systems and gets unlimited desi