This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

1

SVStream: A Support Vector Based Algorithm for Clustering Data Streams Chang-Dong Wang, Student Member, IEEE, Jian-Huang Lai, Member, IEEE, Dong Huang, and Wei-Shi Zheng, Member, IEEE Abstract—In this paper, we propose a novel data stream clustering algorithm, termed SVStream, which is based on support vector domain description and support vector clustering. In the proposed algorithm, the data elements of a stream are mapped into a kernel space, and the support vectors are used as the summary information of the historical elements to construct cluster boundaries of arbitrary shape. To adapt to both dramatic and gradual changes, multiple spheres are dynamically maintained, each describing the corresponding data domain presented in the data stream. By allowing for bounded support vectors (BSVs), the proposed SVStream algorithm is capable of identifying overlapping clusters. A BSV decaying mechanism is designed to automatically detect and remove outliers (noise). We perform experiments over synthetic and real data streams, with the overlapping, evolving and noise situations taken into consideration. Comparison results with state-of-the-art data stream clustering methods demonstrate the effectiveness and efficiency of the proposed method. Index Terms—Data stream clustering, support vector, clusters of arbitrary shape, overlapping, evolving, noise.

!

1

I

I NTRODUCTION

the past few years, a huge amount of streaming data, such as network flows, phone records, sensor data and web click streams, have been generated due to the progress in hardware technology. Analyzing these data has been a hot research topic [1], [2], [3], [4], [5]. The goal of data stream analysis is to make decisions based on the summary information gathered over the past observed data elements. Common techniques include classification [6], [7], clustering [1], [8], model ensemble [9], [10], multiple or distributed data stream clustering [11], [12], change diagnosis [13], [14], query processing [15], etc. Among them, clustering is one of the most effective means of summarizing data streams and building a model for visualization and analysis [1], [8], [16], [17], [18]. For example, clustering sales log streams helps advertisers discover market segments, and clustering telephone-call records exposes fraudulent telephone use in real time. In this paper, we propose an effective and efficient data stream clustering algorithm, which is based on support vector domain description [19] and support vector clustering [20]. N

1.1

Related Work

One of the earliest and best known clustering algorithms for very large datasets is the BIRCH approach [21], [22], which relies on CF-tree [21] to perform hierarchical clustering. However, it is not designed for clustering data streams and cannot address the concept drift problem. The first wellknown algorithm aiming at performing clustering over entire • The authors are with the School of Information Science and Technology, Sun Yat-sen University, Guangzhou, P. R. China. E-mail: [email protected], [email protected], [email protected], [email protected].

Digital Object Indentifier 10.1109/TKDE.2011.263

data streams is the STREAM algorithm proposed by Guha et al. [16], [23]. The STREAM algorithm extends the classical k-median algorithm in a divide-and-conquer fashion to cluster data streams in a single pass. Such methods simply view data stream clustering as a variant of the one-pass clustering algorithm, and also cannot deal with concept drift. Babcock et al. [24] proposed to extend the STREAM algorithm from one-pass clustering to the sliding window model, where data elements arrive in a stream and only the last N elements are considered relevant at any moment. However, as pointed out in [18], [25], this model only partially addresses the evolving characteristics of stream data. The CluStream framework proposed in [1] is effective in handling evolving data streams. It divides the clustering process into an online component which periodically uses microclusters to store detailed summary statistics and an offline component which uses this summary statistics in conjunction with other user input to produce clusters. For high-dimensional data stream clustering, Aggarwal et al. [17], [26] proposed HPStream, which reduces the dimensionality of the data stream via data projection before clustering. For automatically estimating the cluster number within the data stream, Zhang et al. [27] extended the affinity propagation algorithm [28] to a data stream clustering version called StrAP. One common limitation of the above algorithms is that, they cannot find clusters of arbitrary shape, since most of them are based on k-means-like algorithms. However, nonlinearly separable clusters of arbitrary shape are seen in many applications [8], [18], [25], [29], [30], [31]. To discover clusters of arbitrary shape, Cao et al. [8] proposed a DenStream algorithm, which extends DBSCAN [29] by introducing micro-clusters to the density based connectivity search. Independently, Chen and Tu [18], [25] also proposed a density-based method termed DStream. Rather than using micro-clusters, D-Stream partitions

1041-4347/11/$26.00 © 2011 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

the data space into grids and maps new data points into the corresponding grid to store density information, which are further clustered based on the density. RepStream [32] is another algorithm capable of identifying clusters of arbitrary shape. It is a sparse-graph-based approach that employs representative cluster points to incrementally process incoming data and uses a graph-based description to model the spatio-temporal relationships. Compared with the linear clustering methods, these nonlinear approaches can identify clusters of arbitrary shape; however, they are computationally more expensive, requiring more memory to store nonlinear cluster structures and more time to update them. 1.2 The Proposed Work In this paper, inspired by support vector domain description (SVDD) [19] and support vector clustering (SVC) [20], we propose a novel data stream clustering algorithm termed SVStream (Support Vector based Stream clustering). According to the SVDD theory [19], support vectors (SVs) can obtain flexible and accurate data descriptions by mapping the data into a kernel space. These SVs are used to construct cluster boundaries of arbitrary shape in SVC [20]. Compared to the original SVC algorithm, one major challenge of extending SVC to data stream clustering is how to maintain SVs so as to adapt to dramatic and gradual changes in streaming. To this end, we propose a multi-sphere representation, where multiple spheres are dynamically maintained in a sphere set. Each sphere describes the corresponding data domain presented in a data stream. When a new data chunk arrives, if a dramatic change occurs, a new sphere is created; otherwise, only the existing spheres are updated to take into account the new chunk. The data elements of this new chunk are assigned with cluster labels according to the cluster boundaries constructed by the sphere set. Another challenge of utilizing SVC to data stream clustering lies in properly discarding bounded support vectors (BSVs). Since as new data arrive, more and more BSVs are generated, causing a dramatic demand for storage and computation, but cannot be directly discarded because some of them would become new SVs in later steps. To tackle this problem, a BSV decaying mechanism is proposed, which assigns each BSV with a BSV age, through which appropriate BSVs are discarded as outliers. Compared with the existing data stream clustering algorithms, which would fail in identifying arbitrary-shaped clusters, or cannot adapt to the dynamic data streams, or require expensive computational resources to achieve these goals, the proposed SVStream algorithm has the following advantages: • SVStream accurately discovers clusters of arbitrary shape by constructing cluster boundaries using SVs. • It can adapt to dramatic and gradual changes in an evolving stream by dynamically maintaining multiple spheres. • It can discover overlapping clusters by allowing for BSVs. It is also effective in detecting and removing outliers (noise) via the BSV decaying mechanism. • It is very efficient and requires very small memory space due to the compact representation of the summary information by multiple spheres.

2

The remainder of this paper is organized as follows. In section 2, we briefly review the background including SVDD and SVC, which form the basis of our approach. Section 3 describes in detail the proposed SVStream algorithm. The experimental results are reported in section 4. We conclude this paper in section 5.

2

BACKGROUND

2.1

Support Vector Domain Description

In domain description, the task is to give a description of a set of objects, which should cover the positive objects and reject the negative ones in the object space [19]. Support vector domain description (SVDD) is a sphere shaped data description. By using a nonlinear transformation, SVDD can obtain a very flexible and accurate data description relying on only a small number of SVs. Let X = {xi ∈ Rd |i = 1, . . . , N } be a given data set of N points. Using the nonlinear transformation φ from the input space to a Gaussian kernel feature space1 , we look for the smallest sphere that encloses most of the data points in the feature space, which is described by the center μ and the radius R. That is, we minimize the radius F (R, μ, ξi ) = R2 + C

N 

ξi ,

(1)

i=1

under the constraints φ(xi ) − μ2 ≤ R2 + ξi , ∀i = 1, . . . , N , where the parameter C gives the trade-off between the sphere volume and the accuracy of data description, and ξi ≥ 0 are some slack variables. By introducing the Lagrangian, we have L(R,μ, ξi , βi , αi ) = R2 + C

N 

ξi

i=1



N 

βi (R2 + ξi − φ(xi ) − μ2 ) −

i=1

N 

αi ξi ,

(2)

i=1

where βi ≥ 0, αi ≥ 0 are Lagrange multipliers. Setting to 0 the derivative of L w.r.t. R, μ and ξi respectively leads to N 

βi = 1,

μ=

i=1

N 

βi φ(xi ),

βi = C − αi .

(3)

i=1

The KKT complementarity conditions result in βi (R2 + ξi − φ(xi ) − μ2 ) = 0.

αi ξi = 0,

(4)

By eliminating the variables R, μ, ξi and αi , the Lagrangian can be turned into the Wolfe dual form max W = βi

subject to

N  i=1

N 

βi K(xi , xi ) −

N 

βi βj K(xi , xj )

i,j=1

βi = 1, 0 ≤ βi ≤ C, ∀i = 1, . . . , N,

(5)

i=1

1. Although any kernel function works here, as discussed in [19], [20], Gaussian kernels provide more tight contour representations. Therefore, Gaussian kernels are used.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

3

where the dot products φ(xi ) · φ(xj ) are replaced by an appropriate Gaussian kernel K(xi , xj ) = exp(−qxi − xj 2 ) with the width parameter q. In [33], a dynamic dissimilarity measure was proposed to eliminate the sensitivity to the choice of this kernel parameter. According to the values of the Lagrange multipliers βi , i = 1, . . . , N , the data points are classified into three types: • Inner points (IPs): βi = 0, which lie either inside or on the sphere surface. • Support vectors (SVs): 0 < βi < C, which lie on the sphere surface. • Bounded support vectors (BSVs): βi = C, which lie either outside or on the sphere surface. It is obvious that setting C ≥ 1 will result in no BSV. The kernel radius function, defined by the Euclidian distance of φ(x) from μ, i.e., φ(x) − μ, is given by   N N    R(x) = 1 − 2 βi K(xi , x) + βi βj K(xi , xj ). (6) i=1

i,j=1

The radius of the sphere is defined as2 R = max{R(xi )|xi is a SV, i.e. 0 < βi < C}.

(7)

However, in some extreme case where the parameter C is set so small that all data points are BSVs, there would be no SV. For instance, if C = 0.1 and we have N = 10 data points, N since i βi must equal 1, automatically all βi , i = 1, . . . , N must equal C (which means all the data points are BSVs), so the definition for the sphere radius fails—it is undefined in this case. To handle this problem, we provide an alternative definition for the sphere radius as follows.  max{R(xi )|0 < βi < C} if {xi |0 < βi < C} = ∅ R= (8) min{R(xi )}

otherwise.

The reason for this definition is that although there is no data point satisfying 0 < βi < C due to βi = C, i = 1, . . . , N , according to the definition of BSVs, some BSVs may lie on the sphere surface giving a proper definition to the sphere radius. The contours enclosing most of the points in the data space are defined by the set {x|R(x) = R}. 2.2

space images always lie outside the sphere. One may either leave them unclassified, or assign them to the closest cluster as we do in our method. To speedup cluster labeling, several cluster labeling methods have been developed [34], [35], [36]. For instance, in [34], three cluster labeling methods were introduced, including the delaunay diagram (DD), minimum spanning tree (MST) and K-nearest neighbor (KNN). In [35], a SEP-based complete graph (SEP-CG) labeling method was developed based on some invariant topological properties of the kernel radius function. Another similar cluster labeling method, termed ESVC, is based on the equilibrium vector [36]. In our approach, the CG cluster labeling method is used, which produces the most accurate clustering results.

3

T HE P ROPOSED SVS TREAM A LGORITHM

In this paper, we consider the problem where the data elements of a stream arrive over time in chunks [37], [38], [39]. Without loss of generality, the chunks are of equal size, each containing M data elements. Let X t denote the t-th chunk, X t  {xt1 , . . . , xtM }, where xti is the i-th data element of X t . The data stream has the form, t t , . . . , xt−1 x11 , . . . , x1M , . . . , xt−1 1 M , x1 , . . . , xM , . . .

3.1

Clusters are defined as the connected components of the graph induced by A. Checking the line segment is implemented by sampling a number of points (usually 10 points are used). This is called the complete graph (CG) labeling method [20]. BSVs are unclassified by this procedure since their feature 2. Ideally, according to (3) and (4), all the SVs should have the same R(xi ). However, due to the numerical problem, they may be slightly different. A practical strategy is to use their maximum value as the radius.

X1







X t−1





Xt



(10)

Multi-Sphere Definition

To adapt to both dramatic and gradual changes, multiple spheres are dynamically maintained in a sphere set SS. We begin by defining the concept of the sphere structure S. Definition 1. Given a set of M data elements, the Gaussian kernel parameter q and the trade-off parameter C, the sphere structure S is defined as S = {SV, BSV, μ2 , RSV , RBSV }.

(11)

The definition of each entry is as follows: • SV is a support vector set SV = {(xi , βi , Li , Ti )|0 < βi < C, i = 1, . . . , M }. •



(12)

BSV is a bounded support vector set BSV = {(xi , βi , Li , Ti )|βi = C, i = 1, . . . , M }.

Cluster Labeling

To achieve cluster labeling, an adjacency matrix A = [Aij ]N ×N is computed as ⎧ ⎪ ⎨1 if ∀y on the line segment connecting (9) Aij = xi and xj , R(y) ≤ R, ⎪ ⎩ 0 otherwise.



μ2 is the squared length of the sphere center μ  2 μ =

βi βj K(xi , xj ).

(13)

(14)

(xi ,βi ,... ),(xj ,βj ,... )∈SV∪BSV



RSV is the radius of the sphere  RSV =



max{R(xi )|(xi , . . . ) ∈ SV} min{R(xi )|(xi , . . . ) ∈ BSV}

if SV = ∅ (15) otherwise.

RBSV is the maximum Euclidian distance of the bounded support vectors from the sphere center μ RBSV = max{R(xi )|(xi , . . . ) ∈ BSV}.

(16)

For convenience, RSV and RBSV are termed S-radius and Bradius respectively, with their corresponding sphere surfaces called S-surface and B-surface.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

In the above definition, a (d + 3) tuple (xi , βi , Li , Ti ) is used to represent a SV or BSV, where xi is a d-dimensional vector, βi is the Lagrange multiplier, Li is the cluster label, and Ti is the BSV age defined as the accumulated times for xi to be a BSV. When some entries of this tuple are unknown or when the computation is independent of them, they will be denoted as “. . . ” like in equations (14), (15) and (16). Definition 2. The multi-sphere set SS is defined as a set consisting of multiple spheres, that is, SS = {S 1 , . . . , S |SS| }, where the superscript denotes the index of a sphere. In clustering process, if two spheres S and S are too close to each other, they should be merged. The problem is how to measure the “closeness” between two spheres. According to Fig. 1(a), a natural measurement of the similarity between two spheres S = {SV, BSV, μ2 , RSV , RBSV } and S = {SV, BSV, μ2 , RSV , RBSV } should take into account both of their sphere center distance and their S-radiuses. The sphere distance SDist(S, S) is defined as the ratio of the sphere center distance μ − μ to the sum of the S-radiuses RSV + RSV . Definition 3. The sphere distance between  two spheres  S and S is defined as SDist(S, S) = μ−μ/ RSV + RSV , where μ − μ is computed by μ − μ2 =μ2 + μ2

−2



βi β j K(xi , xj ). (17)

(xi , βi , . . . ) ∈ SV ∪ BSV (xj , β j , . . . ) ∈ SV ∪ BSV

3.2 3.2.1

Multi-Sphere Evolution Overall Procedure

For the first chunk X 1 , the first sphere S is generated by Definition 1 and the data elements of X 1 are directly assigned with cluster labels by the CG cluster labeling method [20]. When a new data chunk X t arrives, it is necessary to determine whether a new sphere should be created according to the amount of the data elements of X t lying outside all the existing B-surfaces. A data element lies outside the B-surface of S if its distance from the sphere center is larger than the B-radius. Let OX t denote the subset of X t consisting of data elements lying outside all the existing B-surfaces. That is, l OX t = {xti ∈ X t |∀l = 1, . . . , |SS|, Rl (xti ) > RBSV }. (18)

According to the ratio of |OX t | to |X t |, there are two cases in updating the sphere set SS: I. If the ratio of |OX t | to |X t | is larger than the outersphere threshold δ, the existing spheres are incapable of describing the major part of this new chunk. A new sphere is created for OX t and each data element of X t \OX t is incorporated into the nearest data domain by updating the corresponding sphere. To create a new sphere for OX t , the Lagrange multipliers are obtained by solving (5). The sphere structure S is obtained by Definition 1. The BSV age of each SV and BSV is set to 0 and 1 respectively. The cluster labels of SVs and BSVs are assigned by the CG cluster

4

labeling [20], and these labels start with the current maximum label plus one3 . For each of the remaining data elements, i.e., ∀xti ∈ X t \OX t , the closest sphere S m is found by l m = arg minl=1,...,|SS| Rl (xti )/RBSV . Let I t,m , ∀m = t 1, . . . , |SS| denote the subset of X \OX t consisting of the data elements of which the closest sphere is S m , I t,m = {xti ∈ X t \OX t |m = arg

Rl (xti ) }, l l=1,...,|SS| RBSV min

X t \OX t = I t,1 ∪ · · · ∪ I t,|SS| .

(19)

The data elements of I t,m are used to update S m , as will be discussed in subsubsection 3.2.2. II. If the ratio of |OX t | to |X t | is not larger than the outersphere threshold δ, there is no need of creating a new sphere. Instead, each data element is directly incorporated into the nearest data domain. Let I˜ t,m , ∀m = 1, . . . , |SS| denote the subset of X t consisting of the data elements of which the closest sphere is S m , I˜ t,m = {xti ∈ X t |m = arg

Rl (xti ) }, l l=1,...,|SS| RBSV

X t = I˜ t,1 ∪ · · · ∪ I˜ t,|SS| .

min

(20)

The data elements of I˜ t,m are used to update S m . After updating the sphere set SS to take into account X t , some spheres may become so close to each other that the data domains described by them are strongly overlapping. For any two spheres S and S, if their SDist(S, S) is less than the sphere merging threshold η, S and S are merged as a new sphere S. We will describe the sphere merging process in subsubsection 3.2.3. In this way, we have successfully updated the summary information represented by the multi-sphere set SS, and the cluster structure has also been formed by constructing the cluster boundaries according to the SVs. As a result, each data element of this new chunk X t can be directly assigned to the closest cluster. In the above sphere updating and merging processes, some BSVs may become SVs, and new BSVs may be generated. Keeping all the BSVs forever after their presences may cause the dramatic demand for storage and computation. On the other hand, directly deleting a BSV once it is detected in creating a new sphere may result in the lost of some potential SVs in later steps. Therefore, a rational choice is to eliminate an old BSV if its BSV age is older than the BSV decaying threshold ζ. That is, for each sphere S in SS, ∀(xi , βi , Li , Ti ) ∈ BSV, if Ti ≥ ζ, the corresponding (xi , βi , Li , Ti ) is deleted from BSV and xi is labeled as an outlier by re-assigning its cluster label as 0. Two advantages of this BSV decaying mechanism are that, 1) eliminating old BSVs can help generate more accurate cluster boundaries and detect the trend for one cluster to shrink or split which is often the case in data stream clustering; 2) it provides a natural way of detecting and handling outliers. 3. For example, if the current maximum label is 10 (the SVStream algorithm has detected 10 clusters so far) and the new sphere S is partitioned into 3 clusters, the first cluster of these 3 clusters is labeled as cluster 11, the second as cluster 12 and the third as cluster 13.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

5

RSV 

P

P'

P RSV 

RSV 

P

P P'

' RSV 

' RSV 

RSV 

RSV 

P



(a)

(b)

Sphere distance

Sphere update



(c)

Sphere merging

Fig. 1. Illustration of SVStream. (a) Sphere distance between S and S. (b) The old and updated S-surfaces are plotted in solid and long dash respectively. The old sphere center and S-radius are marked as μ and RSV respectively. The  updated sphere center and S-radius are marked as μ and RSV respectively. (c) The S-surfaces of S and S are plotted in dash and solid respectively, and the S-surface of the resulting S  is plotted in long dash. 3.2.2

Sphere Update

The goal of sphere update is to update the sphere S such that the resulting sphere describes the data domain comprising of the original one and the one formed by data elements of X . The set X is first partitioned into two subsets I and O, such that the subset I consists of the data elements lying inside the S-surface of S, and the subset O is the complementary set consisting of the data elements lying on or outside the S-surface. That is, I = {xi ∈ X |R(xi ) < RSV )},

O = X \I.

(21)

According to the SVDD theory [19], inner points have no influence over the resulting SVs or BSVs. Therefore, the data elements of I do not contribute to the update of the sphere S. Only the subset O needs to be considered. Since the data elements of O lie on or outside the S-surface of S, the Ssurface of the updated sphere taking into account O must be larger than its original counterpart, as shown in Fig. 1(b). Consequently, it is impossible for the old inner points enclosed by the old S-surface to become SVs or BSVs. It implies that the SVs and BSVs of the updated sphere S  are only from the data elements of SV, BSV and O. For simplicity, we denote the overall data elements of SV, BSV and O in a temporary data set T . That is, T = {xi |(xi , . . . ) ∈ SV ∪ BSV} ∪ O = {xi |i = 1, . . . , |T |}, where |T | = |SV| + |BSV| + |O| is the number of data elements of T . The Lagrange multipliers are obtained for T by solving (5). The sphere structure S  is generated by Definition 1 with the BSV ages and the cluster labels determined as follows. ∀(xi , βi , Li , Ti ) ∈ SV  , if xi ∈ O, i.e., xi is a new data element, Ti is set to 0; otherwise, Ti is set to the original Ti , i.e., kept unchanged. ∀(xi , βi , Li , Ti ) ∈ BSV  , if xi ∈ O, Ti is set to 1; otherwise, Ti is set to the original Ti plus one, i.e., its BSV age increases by one. To assign cluster labels to the data elements of SV  and BSV  , an adjacency matrix A is constructed for the data elements lying inside or on the S-surface of S  (i.e., βi < C) based on the kernel radius function R (x) and the S-radius  RSV via (9). The connected components are obtained. For each connected component, there are four cases:

I. If all data elements of this component are from the same old cluster, which implies that this old cluster (or part of it) is kept unchanged, the data elements of this component are assigned with the same cluster label as the old cluster. II. If some data elements of this component are from the same old cluster and the remaining are from O, which implies that this old cluster (or part of it) is enlarged, the data elements of this component are assigned with the same cluster label as the old cluster. III. If all data elements of this component are from O, which implies that a new cluster emerges, the data elements of this component are assigned with a new cluster label, i.e. the current maximum label plus one. IV. If this component contains data elements from different old clusters, which implies that some old clusters (or parts of them) are merged to generate a new cluster, the data elements of this component are assigned with a new cluster label, i.e. the current maximum label plus one. For the overall connected components, another possible phenomenon is that, some old cluster may be split into two or more clusters. However, from case I and case II, these split clusters are assigned with the same cluster label as the old one. Therefore, it is necessary to check the uniqueness of the above assigned cluster labels of these connected components. If there exist more than one components with the same cluster label, which implies that the old cluster with this label are split into more than one new clusters, each of these components should be re-assigned with a distinct new cluster label, i.e. the current maximum label plus one. In this way, the data elements of SV  are assigned with meaningful cluster labels. For the data elements of BSV  , they are assigned into the closest clusters. 3.2.3

Sphere Merging

The goal of sphere merging is to merge S with S, such that the resulting sphere obtains an accurate data description of the sum of two data domains described by S and S respectively. Figure 1(c) illustrates the merging of two spheres. Similar to that in the sphere update, we denote the overall data elements of SV, BSV and SV, BSV in a temporary

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

Algorithm 1 The SVStream Algorithm 1: Input: DS: data stream, M : chunk size, q: Gaussian kernel parameter, C: trade-off parameter, δ: outer-sphere threshold, η: sphere merging threshold, ζ: BSV decaying threshold. 2: Initialize an empty sphere set SS = ∅, t = 1. 3: Obtain X 1 of M elements from DS. 4: Create a sphere S for X 1 by Definition 1. 5: Assign cluster labels to the data elements of X 1 by the CG labeling method. 6: Add S to SS, i.e., SS ← {S}. 7: repeat 8: t ← t + 1. 9: Obtain X t of M elements from DS. 10: Compute OX t via (18). 11: if |OX t |/|X t | > δ then 12: Create a new sphere S for OX t . 13: Incorporate each element of X t \OX t into the nearest data domain by updating the corresponding sphere. 14: Add the new sphere S to SS, i.e., SS ← SS ∪ {S}. 15: else 16: Incorporate each element of X t into the nearest data domain by updating the corresponding sphere. 17: end if 18: Merge any two spheres S and S if SDist(S, S) < η. 19: Assign the data elements of X t to the closest clusters. 20: Delete any old BSV with the BSV age older than ζ. 21: until No data chunk arrives data set T . That is, T = {xi |(xi , . . . ) ∈ SV ∪ BSV} ∪ {xi |(xi , . . . ) ∈ SV ∪ BSV} = {xi |i = 1, . . . , |T |}, where |T | = |SV|+|BSV|+|SV|+|BSV|. The Lagrange multipliers are obtained for T by solving (5). The sphere structure S  is generated by Definition 1 with the BSV ages and the cluster labels determined as follows. ∀(xi , βi , Li , Ti ) ∈ SV  , Ti is set to the original Ti , i.e., kept unchanged. ∀(xi , βi , Li , Ti ) ∈ BSV  , Ti is set to the original Ti plus one, i.e., its BSV age increases by one. To assign cluster labels to the data elements of SV  and BSV  , like in the sphere update, an adjacency matrix A is constructed. However, for each connected component, only case I and case IV of that in the sphere update will occur. Moreover, we should check the uniqueness of the above assigned cluster labels and re-assign the cluster labels when necessary. In this way, the data elements of SV  are assigned with meaningful cluster labels. For the data elements of BSV  , they are assigned into the closest clusters. 3.3 Multi-Sphere Based Algorithm Algorithm 1 summarizes the proposed SVStream algorithm. The algorithm dynamically maintains multiple spheres in a sphere set SS. These spheres are used to store the most useful summary information of the historical data and to generate the cluster structure. The sphere set is updated in one of the following two ways when a new data chunk arrives. • If a dramatic change occurs, a new sphere is created for the data elements lying outside the existing B-surfaces,

6

(a)

SVC without BSVs

(b)

SVC with SVs

Fig. 2. SVC clustering without and with BSVs [20]. (a) The rings cannot be separated when no BSV is allowed. (b) The separation gap between two overlapping rings is clear when BSVs are allowed. and each of the remaining data elements of this chunk is incorporated into the nearest data domain. • If a gradual change occurs or the stream remains stable, each data element of this chunk is directly incorporated into the nearest data domain. In both cases, some spheres are possibly needed to merge with each other as a result of enlarging or creating spheres. In addition, eliminating old BSVs by the BSV decaying mechanism would help detect the trend for one cluster to shrink or split. In this way, our algorithm is capable of detecting both dramatic and gradual changes in a data stream. There may be overlap between any two clusters, the separation gap between which is disturbed by the data points lying between them. According to the SVC theory [20], if no BSV is allowed (i.e., setting C ≥ 1), the SVs will lie in the overlapping region such that the overlapping clusters cannot be separated, as shown in Fig. 2(a). However, if the data points in the overlapping region are taken as BSVs, the separation gap becomes very clear, as shown in Fig. 2(b). Therefore, by setting the input trade-off parameter C < 1, the proposed SVStream algorithm is effective in dealing with data streams consisting of overlapping clusters. A by-product of allowing for BSVs is the ability to handle outliers (noise) [20]. Since when BSVs are allowed, the kernel sphere only contains the data points of relatively high density, leaving the BSVs (i.e., those far away from the core of the clusters) outside the S-surfaces of spheres. By introducing the BSV decaying mechanism, the old BSVs are taken as outliers and automatically eliminated. The ability to handle outliers is especially beneficial in data stream clustering, where random outliers appear occasionally due to various factors in the generation source of a stream.

4

E XPERIMENTAL R ESULTS

In this section, we evaluate the effectiveness and efficiency of SVStream and compare it with several well-known data stream clustering algorithms, including StreamKM (STREAM KMeans) [16], CluStream [1], DenStream [8], RepStream [32] and StrAP [27]. All the experiments are conducted on a PC with Core(TM)2 Duo 2.4GHz processor and 2GB memory, which runs Windows XP professional operating system. Unless particularly mentioned, the optimal values of the outsphere threshold δ and the sphere merging threshold η are set

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

(a)

Ring-Ball

(b)

Smileface-Twomoons

Fig. 3. The distribution of the two synthetic data streams. to 0.6 and 1 respectively. The reasons are that, setting δ to 0.6 implies that a new sphere is created only if there are more than half of data elements in the new chunk lying outside the existing B-surfaces; setting η to 1 means that two spheres are merged only if their S-surfaces are overlapped, which will be experimentally validated in the Parameter Analysis subsection. 4.1

Data Sets and Evaluation

Two synthetic data streams are first generated to test the effectiveness of SVStream on overlapping and evolving streams respectively. The first synthetic stream, called Ring-Ball, consists of a series of 2000 data elements belonging to two overlapping classes. The separation gap between the two classes is not clear in streaming due to the disturbance caused by the data points lying between them. A randomized order that does not change during the experiment is taken as the order of streaming. The overall distribution of the stream is shown in Fig. 3(a). The second synthetic stream, called SmilefaceTwomoons, is a data stream characterized by both dramatic and gradual changes with noise. It consists of a series of 7000 data elements with 3.4% noise as shown in Fig. 3(b). In order to test the performance with consideration to both dramatic and gradual changes, these data elements are arranged in such an order that the three classes belonging to smile face are firstly presented (Fig. 7(a)), then the two classes belonging to two moons emerge (Fig. 8(a)), and finally the upper two classes belonging to smile face are gradually merging (Fig. 9(a)). Two real-world data streams from the UCI KDD Archive [40] are used, which are the KDD-CUP’99 Network Intrusion Detection stream data set (KDDCUP99) and the Forest CoverType data set (Forest-CoverType) respectively. The KDDCUP99 data set is a real dataset that evolves significantly over time and has been widely used to evaluate data stream clustering algorithms [1], [8], [16], [17], [32]. It consists of a series of TCP connection records of LAN network traffic managed by MIT Lincoln Labs. The complete data set contains approximately 4.9 million records, and as in the previous work [1], [17], [26], [32], a sub-sampled subset of length 494020 is used. Each connection is classified into either a normal connection or an intrusion (attack). The attacks fall into four main categories: DOS (denial-of-service, e.g. syn flood), R2L (unauthorized access from a remote machine, e.g. guessing password), U2R (unauthorized access to local superuser privileges, e.g., various “buffer overflow” attacks), and PROBING (surveillance and other probing, e.g., port scanning). Each connection record in this dataset contains 42 attributes, and as in [1], [8], [16], [17], [32], all 34 continuous attributes are used for clustering and one outlier point has

7

been removed. The Forest-CoverType data set contains totally 581012 observations belonging to seven forest cover types. Each observation consists of 54 geological and geographical features that describe the environment in which trees are observed, including 10 quantitative variables, 4 binary wilderness areas and 40 binary soil type variables. As in [17], [26], all the 10 quantitative variables are used. These two real-world datasets are converted into data streams by taking the data input order as the order of streaming. The clustering quality is evaluated by Rand index (RI) [41], which measures how accurately a clusterer can classify data elements by comparing cluster labels with the  underlying class labels. Given N data points, the total N2 distinct pairs of data points can be categorized into four categories: (1) pairs having the same cluster label and the same class label, with their number denoted as N 11 ; (2) pairs having different cluster labels and different class labels, with their number denoted as N 00 ; (3) pairs having the same cluster label but different class labels, with their number denoted as N 10 ; (4) pairs having different cluster labels but the same class label, with their number denoted as N 01 . The well-known Rand index (RI) is defined as [41]   N . (22) RI = (N 11 + N 00 )/ 2 The value of Rand index lies between 0 and 1. A higher Rand index indicates better clustering results. In data stream clustering, we report both the Rand index for each chunk and the average Rand index for the whole data stream. 4.2 Data Streams of Overlapping Clusters For demonstrating the effectiveness of SVStream in discovering overlapping clusters in data streams, we first compare the performances of SVStream on the Ring-Ball data stream with C set at different values. The parameters are set that the chunk size M = 100, the Gaussian kernel parameter q = 9 and the BSV decaying threshold ζ = 2. Figure 4 Compares the performances of SVStream on RingBall with and without BSVs by setting C = 0.25 and C = 1 respectively. By taking the data elements in the overlapping region as BSVs, the separation gap between the two overlapping clusters becomes very clear and an accurate cluster structure is obtained, i.e., SVs, BSVs and the corresponding contours which directly form the cluster boundaries (Fig. 4(b)). Therefore, it achieves a good classification of the data elements of the current chunk (Fig. 4(c)). On the other hand, when no BSV is allowed by setting C = 1, the overlapping clusters are mistakenly merged into one cluster due to the disturbance caused by the data points lying between them. Please note that, in addition to the disturbed data points of the current chunk, the maintained information (i.e., SVs) of the historical data points may aggravate the failure to separate the overlapping clusters. Since by allowing for no BSV, the maintained SVs may contain some data points lying between clusters. However, by allowing for BSVs, this problem can be effectively tackled. Figure 5 plots the Rand index as a function of the chunk step with C set to 0.25. Amongst 20 chunks, only in the first

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

8

76 SVs 10 BSVs

(a)

(b)

The first 1100 elements

X 11 spheres, C = 0.25

92 SVs

(c)

X 11 clustering, C = 0.25

(d)

X 11 spheres, C = 1

(e)

X 11 clustering, C = 1

Fig. 4. Comparing the performances of SVStream on Ring-Ball with (C = 0.25) and without (C = 1) BSVs. 1

Rand index

0.9 0.8 0.7 0.6 50 SVs 3 BSVs

0.5 0.4

2

4

6

8

10 Step

12

14

16

18

20

Fig. 5. The Rand index as a function of the chunk step by SVStream on the Ring-Ball data stream.

(a)

X 1 ∪X 2 ∪...∪X 10

(b)

Spheres till X 10

(c)

X 10 clustering

Fig. 7. SVStream on Smileface-Twomoons: step 10.

92 SVs 14 BSVs

(a)

X 1 clustering

(b)

X 2 clustering

Fig. 6. The two erroneously clustered chunks on RingBall with M = 100, q = 9, C = 0.25 and ζ = 2. two chunks the low values of Rand index are obtained by mistakenly splitting the two clusters. The main reason is that the relatively sparse distribution of these two chunks leads to the great gaps lying within these two clusters, as shown in Figures 6(a) and 6(b). Amongst the rest 18 chunks, a high Rand index has been obtained for each chunk, and the average Rand index is 0.9648. And the average Rand index for all the 20 chunks is as high as 0.9267. 4.3

(a)

X 1 ∪X 2 ∪...∪X 12

(b)

Spheres till X 12

(c)

X 12 clustering

Fig. 8. SVStream on Smileface-Twomoons: step 12.

117 SVs 9 BSVs

(a)

X 1 ∪X 2 ∪...∪X 51

(b)

Spheres till X 51

(c)

X 51 clustering

Fig. 9. SVStream on Smileface-Twomoons: step 51.

Evolving Data Streams with Noise

In this subsection, we perform SVStream on the SmilefaceTwomoons data stream to demonstrate its effectiveness in clustering evolving data streams with noise. The parameters are set that M = 100, q = 11, C = 0.3 and ζ = 2. As aforementioned, the data elements of Smileface-Twomoons are arranged in such an order that both dramatic and gradual changes will occur in streaming. Figures 7 to 10 plot the results in four key steps (10, 12, 51 and 70) respectively. Figure 7 shows that, the proposed algorithm is capable of discovering clusters of arbitrary shape by constructing accurate cluster boundaries from SVs (Fig. 7(b)). Figure 8 shows when a dramatic change occurs that two new classes are emerging (Fig. 8(a)), the proposed algorithm is also effective in detecting this change (Fig. 8(b)) and identifying the new classes (Fig. 8(c)). In step 51, the data chunk X 51 has brought a new trend that two old classes are merging (Fig. 9(a)). Figure 9 shows that SVStream can effectively discover this change, forming the adaptive cluster boundaries (Fig. 9(b)) and making accurate cluster assignments to the

110 SVs 11 BSVs

(a)

All deleted BSVs

(b)

Spheres till X 70

(c)

X 70 clustering

Fig. 10. SVStream on Smileface-Twomoons: step 70.

data elements of this chunk (Fig. 9(c)). Figure 10 shows the result of the last chunk, as well as the overall deleted BSVs. In particular, by comparing the overall deleted BSVs in Fig. 10(a) and the distribution of all data elements in Fig. 3(b), it is obvious that most of the outliers scattered in the data space have been correctly detected and removed. As will be discussed soon, amongst 70 chunks, SVStream has correctly classified 65 chunks, achieving a Rand index higher than 0.95, which is a very satisfactory result. Figure 11 plots the Rand index as a function of the chunk step. From this figure, amongst 70 chunks, only 5 chunks have been mistakenly clustered by either merging or splitting

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

9 6

1

C=0.26

ζ C q

5

ζ=1

0.9 False detection rate (%)

Rand index

0.95

0.85 0.8 0.75 0.7

10

20

30

Step

40

50

60

70

4

C=0.28

3 q=12

2 q=7q=8

q=6

q=5

Fig. 11. The Rand index as a function of the chunk step on Smileface-Twomoons with M = 100, q = 11, C = 0.3 and ζ = 2.

C=0.32

1 C=0.34 ζ=5

0 82

84

q=9 q=10

←ζ=2,C=0.3,q=11

ζ=3

ζ=4

86

88

90 92 Detection rate (%)

94

96

98

100

Fig. 13. The detection rate and false detection rate of outliers on Smileface-Twomoons with different ζ, C and q.

(a)

X 1 clustering

(d)

(b)

X 11 clustering

X 52 clustering

(e)

(c)

X 47 clustering

X 65 clustering

Fig. 12. All the five mistakenly clustered chunks on SmilefaceTwomoons with M = 100, q = 11, C = 0.3, ζ = 2.

some clusters. These 5 mistakenly classified chunks are shown in Fig. 12. In the first step (Fig. 12(a)), two clusters have been mistakenly split, which is caused by the relatively sparse distribution of X 1 , i.e., the great gaps lying within these two clusters. Due to the same reason, in step 11, the two moons clusters have been mistakenly split. However, in steps 47 and 52, some clusters belonging to the smile face have been mistakenly merged by the outliers lying between them, as shown in Figures 12(c) and 12(d) respectively. And in step 65, one cluster has been split due to the lost of some key SVs (Fig. 12(e)). Nevertheless, amongst the rest 65 correctly clustered chunks, the least Rand index is 0.9655, and the average Rand index is as high as 0.9928. And over all 70 chunks, SVStream achieves an average value of Rand index up to 0.9857. The key reasons for achieving such a good clustering result on the dynamic data stream with noise rely on both the multi-sphere representation capable of capturing dynamic and gradual changes and the BSV decaying mechanism that is effective in eliminating the disturbance caused by outliers. Moreover, Fig. 13 plots the detection rate and the false detection rate of outliers with different ζ, C and q respectively4 , which are the three key parameters affecting the performance of outlier detection. The detection rate is computed as the ratio of the correctly detected outlier number to the overall outlier number, and the false detection rate is computed as the ratio of the number of non-outliers that are detected as outliers to the number of overall non-outliers. It can be seen that, as the BSV decaying threshold ζ decreases from 5 to 1, the detection rate will increase (i.e., more outliers would be correctly detected 4. When analyzing one of the three parameters, the default values of the other two parameters are ζ = 2, C = 0.3 and q = 11.

and removed); however, the false detection rate will also increase (i.e., more non-outliers would be falsely detected as outliers). The reason is that, by decreasing ζ, the BSVs will be maintained in a much shorter time in spheres, and be more likely detected as outliers. Similarly, as C decreases from 0.34 to 0.26, the detection rate and the false detection rate will increase. This phenomenon relies on the SVDD theory [19] itself. As the trade-off parameter C decreases, more and more βi will approach the value of C, i.e., become BSVs. So there are more potential outliers. Another SVDD theory-based parameter affecting the detection of outliers is the Gaussian kernel width parameter q. As shown in this figure, the detection rate as well as the false detection rate will increase, as q increases. According to the SVDD theory [19], the increase of q will directly decrease the smoothness of the kernel sphere, and hence more BVSs will be generated. Overall, there exists a common inflection point for the three parameters (i.e. ζ = 2, C = 0.3, q = 11), in which a proper trade-off between the detection rate and the false detection rate is obtained. It should be pointed out that, unlike C and q, the parameter ζ is independent of the feature space of datasets, i.e., the attribute type and range, since it only controls when to transfer BSVs to outliers. Therefore, there exists a common value for this parameter. According to the inflection point in this analysis, we will set ζ to 2 as default in all experiments. 4.4

Parameter Analysis

In this subsection, we report and compare the clustering results by SVStream on the four testing streams when respectively using different chunk size M , Gaussian kernel parameter q, and trade-off parameter C. When analyzing one of the three parameters, the default values of the other two parameters listed in Table 1 are used. Moreover, this subsection will also analyze the connection between the performance and the parameters involved in multi-sphere representation, i.e., the out-sphere threshold δ and the sphere merging threshold η. TABLE 1 The default values of chunk size M , Gaussian kernel parameter q, and trade-off parameter C. Data streams Ring-Ball Smileface-Twomoons KDDCUP99 Forest-CoverType

M 100 100 100 100

q 9 11 0.011 16

C 0.25 0.3 0.2 0.2

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

10 1

0.9

Average Rand index

Average Rand index

0.95

0.85

0.8

0.75 7

8

9

(a)

10 q

11

12

Execution time

0.8

13

11

(b)

Ring-Ball

12

13

q

14

15

Smileface-Twomoons

0.72

0.66 Average Rand index

(b)

Average Rand index

Average Rand index

0.9 0.85

0.75 10

0.67

(a)

0.95

0.65 0.64 0.63 0.62 0.61

0.71 0.7 0.69 0.68 0.67 0.66

0.6 0.001 0.011 0.021 0.031 0.041 0.051 0.061 0.071 0.081 0.091 q

(c)

0.65 11

12

(d)

KDDCUP99

13

14

15

q

16

17

18

19

20

Forest-CoverType

Fig. 15. Average Rand index vs. Gaussian kernel parameter q on the four testing streams. (d)

Memory usages

Fig. 14. Average Rand index, execution time, number of SVs and BSVs, and memory usages vs. chunk size M .

0.75

1 0.9

0.7 Average Rand index

Number of SVs and BSVs

Average Rand index

(c)

0.8 0.7 0.6 0.5 0.4

4.4.1

Chunk Size M

We first analyze the performance in terms of average Rand index, execution time and memory usage when using different chunk size M on the four testing data streams. Figure 14 reports the results. From Fig. 14(a), it is clear that on all testing streams except Ring-Ball, the average Rand index remains stable when different M are used, with the highest average Rand index achieved in M = 100 in most cases. The execution time however strongly depends on the chunk size M as shown in Fig. 14(b). The memory usage is mainly due to the storage of the data elements of the current chunk, the SVs and the BSVs. Figure 14(c) shows the maximum number of SVs plus BSVs in clustering the whole data streams, which is always less than 200. By comparing Fig. 14(c) and Fig. 14(d), we can see that the memory usage depends on both the number of SVs plus BSVs and the data dimensionality. On KDDCUP99, the number of SVs plus BSVs is less than half of that on Ring-Ball and Smileface-Twomoons. However, the memory usage on KDDCUP99 is about four times higher due to the fact that the data dimensionality of KDDCUP99 is 33, much higher than that of Ring-Ball and Smileface-Twomoons. Based on the above analysis, considering both the effectiveness and efficiency, a rational choice for M is 100, which is used as the default parameter in this paper. 4.4.2

Gaussian Kernel Parameter q

We analyze how the average Rand index and the detected cluster number depend on the Gaussian kernel parameter q. To this end, we report the cluster detection rate, that is, the ratio of the chunks in which the correct cluster number is detected to the overall chunks. As discussed in [20], the Gaussian kernel width parameter q controls the smoothness of the contour generated by SVDD, and its value should be chosen larger than the multiplicative inverse of the maximum squared Euclidean distance between all data points. In the data stream model, this can be selected in the first chunk by setting q slightly larger 1 than maxi,j x 1 −x1 2 . Figure 15 shows the average Rand index i

j

Smileface−Twomoons Ring−Ball 0.3 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 C

(a)

Two synthetic streams

0.65 0.6 0.55 0.5 0.45 0.4 0.1

0.15

(b)

0.2 C

Forest−CoverType KDDCUP99 0.25 0.3

Two real streams

Fig. 16. Average Rand index vs. trade-off parameter C on the four testing streams. on the four testing streams using different kernel parameter q. As in SVC, there exists a relatively wide range of q that can generate good clustering. On the Ring-Ball data stream, by setting q to the values ranging between 7 and 13, SVStream generates acceptable clustering results, achieving the minimum average Rand index as high as 0.85 (Fig. 15(a)) and the average cluster detection rate of 0.83. In particular, it achieves the highest average Rand index of 0.9267 when q is set to 9. On Smileface-Twomoons, by setting q to the values ranging from 10 to 15, the minimum average Rand index of 0.94 (Fig. 15(b)) and the average cluster detection rate of 0.68 are obtained. And it obtains the highest average Rand index of 0.98 when q is set to 11. On the real KDDCUP99 data stream, the minimum average Rand index of 0.658 is achieved when q is set to the range of [0.006, 0.096] as shown in Fig. 15(c), and in most cases it can detect the right cluster number with the average cluster detection rate higher than 0.7. From the figure, we can see that, the most suitable kernel parameter on this data stream is 0.011. Similarly, on the real Forest-CoverType, setting q to the range of [11, 19] can generate the minimum average Rand index higher than 0.705 (Fig. 15(c)) with the average cluster detection rate as high as 0.75. And the most suitable kernel parameter achieving the highest average Rand index is 16. 4.4.3 Trade-off Parameter C This subsubsection analyzes the connections between the trade-off parameter C and the clustering results in terms of the average Rand index and cluster number detection rate. As discussed in [20], the trade-off parameter C mainly controls the size of the sphere, as well as the number of BSVs. If it is too small, very large amount of data elements may become

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

1

Average Rand index

Average Rand index

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

0.8 0.6 0.4 1

0.9

0.8

0.7

0.6

0.5 0.1

η

0.4

0.5

0.6

0.5

0.7

0.6

η

(c)

0.5 0.1

0.9

0.8

0.2

0.3

0.7

0.6

η

0.6

0.8

0.7 1

(b)

0.7

0.9

0.8

δ

0.8

0.4 1

1 0.9

Ring-Ball Average Rand index

Average Rand index

(a)

0.2

0.3

0.4

0.5

0.6

11

0.5 0.1

0.2

0.3

0.4

0.5

0.6

δ

Smileface-Twomoons

0.8 0.7 0.6 0.5 0.4 1

δ

KDDCUP99

0.9

0.8

0.7 η

(d)

0.6

0.5 0.1

0.2

0.3

0.4

0.5

0.6

δ

Forest-CoverType

Fig. 17. Average Rand index vs. multi-sphere parameters, i.e., δ and η. BSVs; on the other hand, if it is larger than 1, no BSV will be generated. According to the previous work on SVC [20], [33], [34], [35], [36], the value of C should be chosen from [0.1, 0.3] such that a moderate number of BSVs can be generated. The average Rand indices as a function of C on the four testing streams have validated this empirical conclusion, as shown in Fig. 16. From the figure, it is clear that on both synthetic and real data streams, there exist wide ranges of C where satisfactory clustering results have been obtained. On Ring-Ball and Smileface-Twomoons, when setting C to the range [0.22, 0.3], SVStream has obtained average Rand indices higher than 0.8 and 0.9 respectively, as shown in Fig. 16(a). Moreover, in this range, the average cluster detection rates are 0.77 and 0.57 respectively. On Ring-Ball, the most suitable value of C is 0.25, where the average Rand index obtained is 0.9267 and the cluster detection rate is 0.85. On Smileface-Twomoons, the most suitable value of C is 0.3, where SVStream obtains the average Rand index as high as 0.98. On the two real data streams, setting C to the range from 0.1 and 0.3 can generate stable results, with the average Rand indices being at least 0.65 and 0.68 respectively. The cluster detection rates on both two real streams with C setting between 0.1 and 0.3 are at least 0.7, which is a relatively good result. On the two real streams, the most suitable values of C are 0.2, as shown in Fig. 16(b). 4.4.4

can be easily satisfied, therefore too many spheres would be generated. In conjunction with small η, i.e., smaller than 0.9, which means that these too many spheres are merged only if they are too close (with their sphere distance less than 0.9), the consequence is that these too many spheres would exist for many chunks, leading to incorrect clustering results of these chunks. From this analysis, a good choice for these two parameters is to take the two bounds respectively, i.e., δ = 0.6 and η = 1.0, which are used as default in all experiments.

Multi-sphere Parameters

In this subsubsection, we will analyze the connection between the clustering performance of SVStream and the parameters involved in multi-sphere representation, i.e., the out-sphere threshold δ and the sphere merging threshold η. Figure 17 plots the average Rand indices vs. δ and η on the four data streams. In general, the clustering performance is stable at the pair ranges [δ, η] with δ and η being within [0.5, 0.6] and [0.9, 1.0] respectively. In particular, on the Smileface-Twomoons and the two real data streams, a pair of (δ, η) with δ smaller than 0.5 and η smaller than 0.9 would result in not so good clustering results. Please note that, a new sphere is created only if there are more than δ percentage of data elements in the new chunk lying outside the existing B-surfaces. If setting δ too small, i.e., smaller than 0.5, the condition |OX t |/|X t | > δ

4.5

Clustering Quality Comparison

In this subsection, we compare the effectiveness and efficiency of SVStream with the well-known data stream clustering algorithms in terms of average Rand index, execution time and memory usage. 4.5.1 Compared Algorithms and Experimental Setting The optimal parameters analyzed in the aforementioned subsection (Table 1) are used in the SVStream algorithm. The five compared algorithms and their parameter settings are summarized below. 1. StreamKM (STREAM K-Means) [16], which processes data streams in chunks of M data points. It summarizes each chunk by clustering its data into k centers, each weighted by the number of assigned data points. Once M centers have been collected, they are clustered to generate a smaller set of 2k centers, with each being weighted by the sum of the weights of the assigned centers. After one pass of the stream data, these intermediate centers are further clustered into k final centers. Each data point is assigned to the nearest center. The chunk sizes M are set to 500 on the four streams. 2. CluStream [1], which separates the clustering process into an online micro-clustering and an offline macro-clustering. At the beginning, the first InitN umber data points are collected and clustered by k-means to create q initial micro-clusters. When a new data point arrives, it is added to the nearest micro-cluster if this point is within its maximum boundary, which is defined as a factor of t of the root-mean-square deviation of its data points from the centroid; otherwise, a new micro-cluster is created. When a new micro-cluster is created, one old micro-cluster is deleted if its least relevance stamp is below a user-defined threshold δ; otherwise, two closest micro-clusters are merged. At each clock time divisible by αi for any integer i, the current set of micro-clusters is stored on disk referring as snapshots using pyramidal time frame techniques. In the offline component, the desired k macroclusters are generated at each clock time using the microclusters within the time horizon h. As in [1], the parameters are set as follows: the number of micro-clusters q = 10 × k, InitN umber = 1000, the maximum boundary factor t = 2, the user-defined threshold δ = 512, the pyramidal time frame parameters α = 2 and l = 10, and the time horizon h = 100. 3. DenStream [8], which divides the clustering process into an online part and an offline part. Each data point is weighted by a fading function f (t) = 2−λt with t being its elapsed time and λ being the fading factor. Micro-clusters are defined according to the sum w of the weights of data points, and the radius r

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

(which cannot be larger than the maximum radius threshold ). Micro-clusters are classified into p-micro-clusters (w ≥ β · μ) and o-micro-clusters (w < β · μ). The first InitN umber data points are collected and clustered by DBSCAN to create initial p-micro-clusters. When a new data point arrives, we try to merge it to the nearest p-micro-cluster; or merge it to the nearest o-micro-cluster and check whether this o-micro-cluster would become a p-micro-cluster; or directly create a new omicro-cluster by this data point. In the offline part, clusters are generated by applying a variant of DBSCAN [29] to pmicro-clusters. As in [8], the parameters are set as follows: InitN umber = 1000, the fading factor λ = 0.25, the maximum radius threshold = 16, the weighting threshold μ = 10 and the outlier threshold β = 0.2. 4. RepStream [32], which updates two sparse k-NN graphs. The first sparse graph (SG) is used to select new representative vertices. The second representative sparse graph (RSG) is used to track the connectivity between all the representative vertices. When a new data point arrives, it is first added to SG. This data point will join an existing cluster if it is reciprocally connected to the corresponding representative vertex; otherwise, it will itself become a predictor vertex, which will later become a representative vertex and form a new cluster. The unused representative vertices are discarded by the first-in-first-out mechanism according to a user specified decay rate λ. Larger clusters are formed when two or more representative vertices are density-related, depending on a user specified density scaler α. According to [32], the parameters are set as follows: the k-NN graph parameter k = 9, the decay rate λ = 0.99 and the density scaler α = 2.0. 5. StrAP [27], which extends affinity propagation (AP) [28] to data stream. A hierarchical AP (Hi-AP) is introduced, which first randomly splits the dataset into some subsets, followed by performing AP on each subset to generate exemplars with each being weighted by the number of assigned data points. These weighted exemplars are clustered using weighted AP (WAP) [27]. The stream clustering model is initialized by applying Hi-AP to the first InitN umber data points. When a new data point arrives, if its distance to the closest exemplar is less than the threshold ε, this data point joins the corresponding cluster containing this closest exemplar; otherwise, this data point is considered as an outlier and put in the reservoir which gathers the last M outliers. At every consecutive time of length Δ, if one cluster has received no data point, it is taken as old cluster and is deleted. At each step, a Page Hinkley (PH) statistical test [42], [43] is applied to the outlier rate. If the PH value is larger than a user specified parameter λ, the stream clustering model is rebuilt using WAP from the weighted exemplars and the outliers in the reservoir. According to [27], the parameters are set as follows: InitN umber = 1000, the outlier threshold ε is set to the average distance of data points in the initialization step, the maximum number of outliers stored in reservoir M = 100, the time length Δ = 50 and the PH test threshold λ = 0.01. 4.5.2 Rand index We first compare the average Rand indices by the six algorithms, as shown in Fig. 18(a). In general, SVStream obtains

12

the highest Rand indices on the four testing streams. On the two synthetic data streams, SVStream has significantly outperformed the linear clustering methods such as StrAP, CluStream and StreamKM, and it is still better than the stateof-the-art nonlinear clustering methods such as DenStream and RepStream. For instance, on the Ring-Ball data stream, our approach has obtained the average Rand index as high as 0.9267, which is 0.0604 higher than the second winner RepStream and is 0.2826 higher than the most classical StreamKM algorithm (achieving a 43.8% improvement). Similarly, on the SmilefaceTwomoons data stream, our approach has obtained the average Rand index of 0.9857, which is 0.0565 higher than the second winner RepStream and is 0.3145 higher than StrAP (a 46.8% improvement). On the two real data streams, SVStream also outperforms the other five methods by an average 5% improvement. On the KDDCUP99 data stream, the proposed SVStream has obtained the average Rand index of 0.6664, which is slightly better than the second winner DenStream by 0.0237 and makes a significant improvement compared with the three linear clustering methods (i.e., StrAP, CluStream and StreamKM) by at least 12.5%. On the other real data stream Forest-CoverType, relatively greater improvement has been obtained, i.e., 5.8% compared with the second winner and 32.1% compared with StreamKM. In general, although SVStream has not made such significant improvements on the two real data streams as on the two synthetic data streams, they are still encouraging improvements, especially compared with the linear clustering methods. The main reason is that, compared with the linear clustering methods such as StrAP, CluStream and StreamKM, the nonlinear SVStream forms accurate cluster boundaries of arbitrary shape; meanwhile, compared with the existing nonlinear clustering methods such as DenStream and RepStream, the proposed SVStream is more effective in dealing with both dramatic and gradual changes and outliers. Additionally, the cluster boundaries constructed from the support vectors are more accurate than those by the compared algorithms due to the generation of BSVs. 4.5.3 Execution Time This subsubsection compares the execution time consumed by the six algorithms. The execution time includes the time used to update the summary information (or cluster structure) and the time used to realize cluster labeling, over the whole data stream. For instance, in StreamKM, the execution time includes two parts: the time used to generate the specified number of cluster centers at the first pass of the stream, and the time used to assign each data point to the nearest cluster center at the second pass of the stream. In CluStream and DenStream, the computational time includes both the online micro-clustering component and the offline macro-clustering component. In RepStream, the execution time includes the time used to construct graphs and the time used in cluster labeling. In StrAP, the execution time is measured as the time elapsed from the beginning of the streaming to the end, including the time for the PH statistical test [42], [43]. Figure 18(b) plots the execution time in seconds by the six algorithms. From the figure, the proposed SVStream algorithm

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

(a)

Comparing the average Rand index

(b)

13

Comparing the execution time

(c)

Comparing the memory usages

Fig. 18. Comparing the performances of SVStream with five algorithms. is comparable to its counterparts in terms of the execution time, that is, it is faster than some of them, i.e., RepStream, StrAP and StreamKM; meanwhile the same as or slower than CluStream and DenStream. For instance, although it is a little bit slower than DenStream by on average 3%, SVStream is much faster than another nonlinear clustering method RepStream by at least 10%. The main computational component that consumes the major part of SVStream execution time is cluster labeling, which includes constructing an adjacency matrix and finding connected components. However, as introduced in the Background section, there exist some recent algorithms aiming at speeding up the labeling process, the fastest ones of which are almost linear, e.g. [35]. These efforts may contribute to the proposed approach to make SVStream more efficient in computational time complexity.

which are used to represent the summary information of the historical data elements. According to the SVDD theory, the multi-sphere representation provides a very compact and accurate data description of the historical data chunks, so it has a limited memory consumption with precision guaranteed. The support vectors are used to construct cluster boundaries of arbitrary shape. To adapt to both dramatic and gradual changes of a stream, when a new data chunk arrives, the multi-sphere set is updated accordingly (e.g., creating a new sphere when a dramatic change occurs). By allowing for BSVs, the proposed algorithm is capable of partitioning overlapping clusters. In addition, the outliers (noise) can be effectively detected and removed via the BSV decaying mechanism. Experimental results over synthetic and real data streams demonstrate the effectiveness and efficiency of the proposed algorithm.

4.5.4 Memory Usage We also compare the memory used by the six algorithms. The memory usage is directly measured by the peak memory usage of each algorithm during the stream clustering process. Figure 18(c) shows that the proposed SVStream algorithm has a significant advantage over its competitors in memory usage. For instance, on the two synthetic 2-D data streams, SVStream requires respectively 62.4 and 67.6 KB memory, which are only about 10% of the memory used by the second winner. On KDDCUP99, the proposed SVStream approach consumes only 9% of the memory used by RepStream and no more than 5% of the memory used by StreamKM. The comparison result on the Forest-CoverType data stream is similar to that on the two synthetic data streams, which demonstrates that SVStream outperforms existing data stream clustering algorithms in terms of memory usage. The main reason is that, the multi-spheres provide a very compact representation of the summary information of data streams. For instance, from Fig. 14(c), in all cases, the number of SVs and BSVs is bounded by 200. That is, no more than 200 data elements are stored as the summary information in SVStream. On the other hand, the other algorithms take more memory to store the summary information, where more than 1000 intermediate points/micro-clusters should be stored.

ACKNOWLEDGMENTS This project was supported by the NSFC-GuangDong (U0835005) and the NSFC (60803083). The authors would like to thank the associate editor and reviewers for their comments which are very helpful in improving the manuscript.

5

C ONCLUSIONS

In this paper, we have proposed an effective and efficient algorithm, called SVStream, for clustering evolving data streams consisting of overlapping clusters with noise. The algorithm dynamically maintains multiple spheres in a multi-sphere set,

R EFERENCES [1]

C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for clustering evolving data streams,” in Proc. of the 29th VLDB Conf., 2003. [2] A. Zhou, F. Cao, Y. Yan, C. Sha, and X. He, “Distributed data stream clustering: A fast EM-based approach,” in Proc. of the 23rd Int. Conf. Data Eng., 2007. [3] H. Kargupta and B.-H. Park, “A Fourier spectrum-based approach to represent decision trees for mining data streams in mobile environments,” IEEE Trans. Knowl. Data Eng., vol. 16, no. 2, pp. 216–229, Feb. 2004. [4] P. Zhang, X. Zhu, and Y. Shi, “Categorizing and mining concept drifting data streams,” in Proc. of the 14th ACM SIGKDD Int. Conf. on Knowl. Discovery and Data Mining, 2008. [5] P. Wang, H. Wang, X. Wu, W. Wang, and B. Shi, “A low-granularity classifier for data streams with concept drifts and biased class distribution,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 9, pp. 1202–1213, Sept. 2007. [6] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for ondemand classification of evolving data streams,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 5, pp. 577–589, May 2006. [7] S. Hashemi, Y. Yang, Z. Mirzamomen, and M. Kangavari, “Adapted One-versus-All decision trees for data stream classification,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 5, pp. 624–637, May 2009. [8] F. Cao, M. Ester, W. Qian, and A. Zhou, “Density-based clustering over an evolving data stream with noise,” in Proc. of the 6th SIAM International Conference on Data Mining, 2006. [9] P. Zhang, X. Zhu, J. Tan, and L. Guo, “Classifier and cluster ensembles for mining concept drifting data streams,” in Proc. of the 10th Int. Conf. on Data Mining, 2010. [10] X. Zhu, P. Zhang, X. Lin, and Y. Shi, “Active learning from stream data using optimal weight classifier ensemble,” IEEE Trans. Syst., Man, Cybern. Part B, Cybern., vol. 40, no. 6, pp. 1607–1621, Dec. 2010.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

[11] Q. Zhang, J. Liu, and W. Wang, “Incremental subspace clustering over multiple data streams,” in Proc. of the 7th Int. Conf. on Data Mining, 2007. [12] ——, “Approximate clustering on distributed data streams,” in Proc. of the 24th Int. Conf. on Data Eng., 2008. [13] C. C. Aggarwal, “On change diagnosis in evolving data streams,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 5, pp. 587–600, May 2005. [14] Y. Yang, X. Wu, and X. Zhu, “Combining proactive and reactive predictions for data streams,” in Proc. of the 11th ACM SIGKDD Int. Conf. on Knowl. Discovery and Data Mining, 2005. [15] X. Lin, J. Xu, Q. Zhang, H. Lu, J. X. Yu, X. Zhou, and Y. Yuan, “Approximate processing of massive continuous quantile queries over high-speed data streams,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 5, pp. 683–698, May 2006. [16] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan, “Clustering data streams: Theory and practice,” IEEE Trans. Knowl. Data Eng., vol. 15, no. 3, pp. 515–528, May 2003. [17] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “On high dimensional projected clustering of data streams,” Data Mining and Knowledge Discovery, vol. 10, pp. 251–273, 2005. [18] Y. Chen and L. Tu, “Density-based clustering for real-time stream data,” in Proc. of the 13th ACM SIGKDD Int. Conf. on Knowl. Discovery and Data Mining, 2007. [19] D. M. Tax and R. P. Duin, “Support vector domain description,” Pattern Recognition Letters, vol. 20, pp. 1191–1199, 1999. [20] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, “Support vector clustering,” JMLR, vol. 2, pp. 125–137, 2001. [21] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method for very large databases,” in Proc. SIGMOD, 1996. [22] ——, “BIRCH: A new data clustering algorithm and its applications,” Data Mining and Knowledge Discovery, vol. 1, no. 2, pp. 141–182, 1997. [23] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan, “Clustering data streams,” in Proc. of the 41st Annual IEEE Symposium on Foundations of Computer Science, 2000. [24] B. Babcock, M. Datar, and R. M. L. O’Callaghan, “Maintaining variance and k-medians over data stream windows,” in Proc. of the 22nd ACM Symposium on Principles of Databases Systems, 2003. [25] L. Tu and Y. Chen, “Stream data clustering based on grid density and attraction,” ACM Transactions on Knowledge Discovery from Data, vol. 3, no. 3, pp. 1–27, July 2009. [26] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for projected clustering of high dimensional data streams,” in Proc. of the 30th VLDB Conf., 2004. [27] X. Zhang, C. Furtlehner, J. Perez, C. Germain-Renaud, and M. Sebag, “Toward autonomic grids: Analyzing the job flow with affinity streaming,” in Proc. of the 15th ACM SIGKDD Int. Conf. on Knowl. Discovery and Data Mining, 2009. [28] B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, vol. 315, pp. 972–976, 2007. [29] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proc. of 2nd Int. Conf. on Knowledge Discovery and Data Mining, 1996. [30] C.-D. Wang, J.-H. Lai, and J.-Y. Zhu, “A conscience on-line learning approach for kernel-based clustering,” in Proc. of the 10th Int. Conf. on Data Mining, 2010, pp. 531–540. [31] ——, “Conscience on-line learning (COLL): An efficient approach for robust kernel-based clustering,” Knowl Inf Syst, In Press, 2011. [32] S. L¨uhr and M. Lazarescu, “Incremental clustering of dynamic data streams using connectivity based representative points,” Data & Knowledge Engineering, vol. 68, pp. 1–27, 2009. [33] D. Lee and J. Lee, “Dynamic dissimilarity measure for support-based clustering,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 6, pp. 900–905, June 2010. [34] J. Yang, V. Estivill-Castro, and S. K. Chalup, “Support vector clustering through proximity graph modelling,” in Proc. of the 9th Int. Conf. on Neural Inf. Processing, 2002. [35] J. Lee and D. Lee, “An improved cluster labeling method for support vector clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 3, pp. 461–464, March 2005. [36] ——, “Dynamic characterization of cluster structures for robust and inductive support vector clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11, pp. 1869–1874, Nov. 2006. [37] N. A. Syed, H. Liu, and K. K. Sung, “Handling concept drifts in incremental learning with support vector machines,” in Proc. of the 5th ACM SIGKDD Int. Conf. on Knowl. Discovery and Data Mining, 1999.

14

[38] R. Klinkenberg and T. Joachims, “Detecting concept drift with support vector machines,” in Proc. of the 17th Int. Conf. on Machine Learning, 2000. [39] C. Domeniconi and D. Gunopulos, “Incremental support vector machine construction,” in Proc. of the 1st Int. Conf. on Data Mining, 2001. [40] S. Hettich and S. D. Bay, “The UCI KDD Archive,” [http://kdd.ics.uci.edu] Irvine, CA: University of California, Department of Information and Computer Science., 1999. [41] W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, Dec. 1971. [42] E. Page, “Continuous inspection schemes,” Biometrika, vol. 41, pp. 100– 115, 1954. [43] D. Hinkley, “Inference about the change-point from cumulative sum tests.” Biometrika, vol. 58, pp. 509–523, 1971. Chang-Dong Wang received the B.S. degree in applied mathematics in 2008 and M.Sc. degree in computer science in 2010 from Sun Yat-sen University, Guangzhou, P. R. China. He started the pursuit of the Ph.D. degree with Sun Yat-sen University in September 2010. His current research interests include machine learning and data mining, especially focusing on data clustering and its applications. He has published several scientific papers in some international journals and conferences such as Knowledge and Information Systems (KAIS) and ICDM. In ICDM 2010, he was selected for the IEEE TCII Student Travel Award and his paper won the Honorable Mention for Best Research Paper Awards. Jian-Huang Lai received his M.Sc. degree in applied mathematics in 1989 and his Ph.D. in mathematics in 1999 from SUN YAT-SEN University, China. He joined Sun Yat-sen University in 1989 as an Assistant Professor, where currently, he is a Professor with the Department of Automation of School of Information Science and Technology and vice dean of School of Information Science and Technology. Dr. Lai had successfully organized the International Conference on Advances in Biometric Personal Authentication’2004, which was also the Fifth Chinese Conference on Biometric Recognition (Sinobiometrics’04), Guangzhou, in December 2004. He has published over 80 scientific papers in the international journals and conferences on image processing and pattern recognition. His current research interests are in the areas of digital image processing, pattern recognition, multimedia communication, wavelet and its applications. Prof. Lai serves as a standing member of the Image and Graphics Association of China and also serves as a standing director of the Image and Graphics Association of Guangdong. Dong Huang received the B.S. degree in computer science in 2009 from South China University of Technology and the M.Sc. degree in computer science in 2011 from Sun Yat-sen University, Guangzhou, P. R. China. Since 2011, he has been working toward the Ph.D. degree under the supervision of Jian-Huang Lai, at Sun Yatsen University. His current research focuses on streaming data clustering and video analysis. Wei-Shi Zheng received his Ph.D. degree in Applied Mathematics at Sun Yat-Sen University, China, 2008. After that, he has been a Postdoctoral Researcher on the European SAMURAI Research Project at the Department of Computer Science, Queen Mary University of London, UK. He has joined Sun Yat-sen University as a faculty under the one-hundred-people program of Sun Yat-sen in 2011. He has published widely in IEEE TPAMI, IEEE TNN, IEEE TIP, IEEE TSMC-B, Pattern Recognition, ICCV, CVPR and AAAI. His current research interests are in object association and categorization for visual surveillance. He is also interested in discriminant/sparse feature extraction, dimension reduction, kernel methods in machine learning, transfer learning, and face image analysis.

SVStream: A Support Vector Based Algorithm for ...

6, NO. 1, JANUARY 2007. 1. SVStream: A Support Vector Based Algorithm for Clustering Data ..... cluster label, i.e. the current maximum label plus one. For the overall ..... SVs and BSVs, and memory usages vs. chunk size M. 4.4.1 Chunk Size ...

1MB Sizes 0 Downloads 245 Views

Recommend Documents

A Fast Bit-Vector Algorithm for Approximate String ...
Mar 27, 1998 - algorithms compute a bit representation of the current state-set of the ... *Dept. of Computer Science, University of Arizona Tucson, AZ 85721 ...

A Fast Bit-Vector Algorithm for Approximate String ...
Mar 27, 1998 - Simple and practical bit- ... 1 x w blocks using the basic algorithm as a subroutine, is significantly faster than our previous. 4-Russians ..... (Eq or (vin = ;1)) capturing the net effect of. 4 .... Figure 4: Illustration of Xv compu

Support vector machine based multi-view face detection and recognition
theless, a new problem is normally introduced in these view- ...... Face Recognition, World Scientific Publishing and Imperial College. Press, 2000. [9] S. Gong ...

Support vector machine based multi-view face ... - Brunel University
determine the bounding boxes on which face detection is performed. .... words, misalignment in views may lead to a significant drop in performance.

Model Selection for Support Vector Machines
New functionals for parameter (model) selection of Support Vector Ma- chines are introduced ... tionals, one can both predict the best choice of parameters of the model and the relative quality of ..... Computer Science, Vol. 1327. [6] V. Vapnik.

Support Vector Machines
Porting some non-trivial application to SVM tool and analyze. OR а. Comparison of Neural Network and SVM using tools like SNNS and. SVMLight. : 30 ...

a niche based genetic algorithm for image registration
Image registration aims to find the unknown set of transformations able to reduce two or more images to ..... of phenotypic similarity measure, as domain-specific.

A Block-Based Gradient Descent Search Algorithm for ...
is proposed in this paper to perform block motion estimation in video coding. .... shoulder sequences typical in video conferencing. The NTSS adds ... Hence, we call our .... services at p x 64 kbits,” ITU-T Recommendation H.261, Mar. 1993.

A Graph-based Algorithm for Scheduling with Sum ...
I. INTRODUCTION. In a wireless ad hoc network, a group of nodes communicate ... In addition to these advantages, by analyzing the algorithm, we have found a ...

ItemRank: A Random-Walk Based Scoring Algorithm for ...
A recommender system makes personalized product sugges- tions by extracting ... the MovieLens data set (in subsection 2.1) and illustrates the data model we ...

A DNA-Based Genetic Algorithm Implementation for ... - Springer Link
out evolutionary computation using DNA, but only a few implementations have been presented. ... present a solution for the maximal clique problem. In section 5 ...

BALLAST: A Ball-based Algorithm for Structural Motifs
expensive, it is likely to be cheap due to the filtering, the locality and compactness of the ball, ..... capping domain for substrate specificity and a C-terminal TIM ...

A Graph-based Algorithm for Scheduling with Sum ...
in a real wireless networking environment is ignored, 2) It is not obvious how to choose an appropriate disk radius d because there is no clear relation between d ...

A Motion Modification Algorithm for Memory-based ...
Computer simulation of human movements is an essential element of Computer-Aided. Ergonomic Design. As a general, accurate, and extendable motion ...

A Hierarchical Goal-Based Formalism and Algorithm for ...
models typically take a large amount of human effort to create. To alleviate this problem, we have developed a ... and Search—Plan execution, formation, and generation. General Terms. Algorithms ...... data point is an average of 10 runs. There are

ItemRank: A Random-Walk Based Scoring Algorithm for ...
compared ItemRank with other state-of-the-art ranking tech- niques (in particular ... on Bayesian networks [Breese et al., 1998], Support Vec- tor Machines [Grcar et .... where ωq is the out-degree of node q, α is a decay factor3. ..... In IEEE Con

A TASOM-based algorithm for active contour modeling
Active contour modeling is a powerful technique for modeling object boundaries. Various methods .... close to the input data, the learning parameters of. TASOM ...

A Distributed Clustering Algorithm for Voronoi Cell-based Large ...
followed by simple introduction to the network initialization. phase in Section II. Then, from a mathematic view of point,. derive stochastic geometry to form the algorithm for. minimizing the energy cost in the network in section III. Section IV sho

2012_ASCE_JTE_Fuzzy Logic–Based Mapping Algorithm for ...
2012_ASCE_JTE_Fuzzy Logic–Based Mapping Algorithm for Improving Animal-Vehicle Collision Data.pdf. 2012_ASCE_JTE_Fuzzy Logic–Based Mapping ...

a niche based genetic algorithm for image ... - Semantic Scholar
Image registration can be regarded as an optimization problem, where the goal is to maximize a ... genetic algorithms can address this problem. However ..... This is akin to computing the cosine ... Also partial occlusions (e.g. clouds) can occur ...

a fast algorithm for vision-based hand gesture ...
responds to the hand pose signs given by a human, visually observed by the robot ... particular, in Figure 2, we show three images we have acquired, each ...

Support Vector Echo-State Machine for Chaotic ... - Semantic Scholar
Dalian University of Technology, Dalian ... SVESMs are especially efficient in dealing with real life nonlinear time series, and ... advantages of the SVMs and echo state mechanisms. ...... [15] H. Jaeger, and H. Haas, Harnessing nonlinearity: Predic