Incremental Support Vector Clustering

Viewer
Transcript

2011 11th IEEE International Conference on Data Mining Workshops

Incremental Support Vector Clustering Chang-Dong Wang, Jian-Huang Lai, Dong Huang School of Information Science and Technology, Sun Yat-sen University, Guangzhou, P. R. China. Email:[email protected],[email protected],[email protected]

(Matlab R2009a 64 bit edition on Windows 64 bit, 8 Intel 2.00GHz processors, 8GB of RAM). Some methods have been proposed to accelerate the cluster labeling process [5], [8]–[10]. In [5], a proximity graph modeling was used to reduce cluster labeling time, where three cluster labeling methods were introduced, namely, delaunay diagram (DD), minimum spanning tree (MST) and K-nearest neighbor (KNN). In [8], the topological property of the trained kernel radius function was utilized to efﬁciently and robustly assign cluster labels, where a SEP (Stable Equilibrium Point)-based complete graph cluster labeling method was developed. Similarly, an equilibrium vector-based cluster labeling method was proposed in [9]. Another cluster labeling method only relies on computing a support vector graph rather than a complete graph, which is termed SVG (Support Vector Graph) labeling method [1]. These efforts have made cluster labeling very efﬁcient in some real-world applications. However, the very intensive time complexity of sphere construction is a remaining bottleneck which restricts the scalability of SVC. The efﬁcient version (e.g. incremental learning) of support vector machines (SVM), on the other hand, has been widely studied [14]–[19], which can efﬁciently construct decision boundaries in large-scale SVM problems. In [14], the data are taken as arriving over time in chunks. At each step, only the support vectors (SVs) of the previous data are preserved and added to the new training data, which generates a SVM model the same as or similar to that by using all the data together. Two improved approaches were presented in [15] and [16] respectively, which are effective in dealing with concept changes in incremental learning. In [19], a fast gradient method was proposed for SVM to solve large-scale problems. Most of these efﬁcient SVM approaches are based on the same basic idea presented in [14]. That is, only SVs are preserved at each step, which are added to the new training data so as to learn a new model. The incremental (unsupervised) clustering is more challenging than the supervised incremental learning [4], [20], which hinders the development of incremental clustering. In incremental clustering, without any cluster label or other prior knowledge about the class distribution, it is difﬁcult to decide what summary information of the historical data should be used for accurately learning an updated model. Inspired by the previous work on incremental SVM, this paper for the ﬁrst time proposes an incremental support

Abstract—Support vector clustering (SVC) is a ﬂexible clustering method inspired by support vector machines (SVM). Due to its advantage in discovering clusters of arbitrary shapes, it has been widely used in many applications. However, one bottleneck which restricts the scalability of the method is its signiﬁcantly high time complexity. Both of its two main stages, namely, sphere construction and cluster labeling, are quite time-consuming. Although some methods have been developed to speedup cluster labeling, it is still an intractable task to construct a sphere for a large-scale dataset. To this end, we propose a novel incremental support vector clustering (ISVC) algorithm, which constructs a sphere incrementally and efﬁciently. In our approach, by taking the data as arriving over time in chunks, the support vectors of the historical data and the data points of the new chunk are used to learn an updated sphere. Theoretical analysis has shown that the proposed ISVC algorithm can generate completely the same clustering results as SVC with much lower time and memory consumption. Experimental results on large-scale datasets have validated the theoretical analysis. Keywords-data clustering; support vector clustering; incremental learning; kernel method;

I. I NTRODUCTION Data clustering has received a signiﬁcant amount of attention in various research ﬁelds, and many approaches have been developed from different viewpoints [1]–[6]. Support vector clustering (SVC) [1] is a ﬂexible clustering method inspired by support vector machines (SVM) [7]. By mapping data points from the input data space to a high dimensional kernel space, a spherically shaped boundary enclosing most of the mapped data points is constructed. When mapped back to the input data space, this sphere is separated into several components, each of which encloses a separate cluster of data points. Constructing connected components is accomplished by computing an adjacency matrix of the dataset according to the sphere [1]. By using a nonlinear mapping, the cluster boundaries discovered by SVC can be of arbitrary shapes, which is one of the major advantages of SVC over other clustering methods. Despite its success [5], [8]–[13], one bottleneck restricting its scalability is the high time complexity. Both of its two main stages, which are sphere construction and cluster labeling (i.e. computing the adjacency matrix), are of high computational complexity. For instance, it takes at least 3.09 × 105 seconds to construct a sphere on a dataset consisting of 3.5×103 data points, and it runs out of memory when constructing a sphere on a dataset of size 1.0 × 104 978-0-7695-4409-0/11 $26.00 © 2011 IEEE DOI 10.1109/ICDMW.2011.100

839

vector clustering (ISVC) algorithm. The basic idea is to use SVs as the representative information obtained so far to learn a new sphere. At each step, only the SVs are persevered, which are added to the data points of the new chunk to form the current data so as to learn an updated sphere. Theoretical analysis has shown that the proposed ISVC algorithm can generate cluster structure (i.e. sphere) the same as SVC, assuming that no outlier is presented in the data; meanwhile it dramatically reduces time consumption. Experimental results have validated the theoretical analysis. The remainder of this paper is organized as follows. Section II reviews support vector clustering, and presents the motivation for this work. In section III, we describe the proposed incremental support vector clustering algorithm, and provide a theoretical analysis. Experimental results are reported in section IV. We conclude our paper and present the future work in section V.

3 2 1 0 −1 −2 −3 −3

This section reviews support vector clustering (SVC) derived in [1]. Let {xi } ⊂ X be a given dataset consisting of N data points in Rd . Using a nonlinear transformation φ from the input data space to a high-dimensional feature space, we ﬁrst look for the smallest enclosing sphere of radius R, which is described by the constraints (1)

where μ is the sphere center and ξi ≥ 0, i = 1, . . . , N are slack variables allowing for soft boundaries. To solve this problem, we introduce the Lagrangian

i=1

βi = 1,

μ=

i=1

N

βi φ(xi ),

βi = C − αi .

βi

subject to

N i=1

N

βi K(xi , xi ) −

N

(7)

In theory, all the SVs should have the same R(xi ). However, due to the numerical problem, they may be slightly different. A practical strategy is to use their maximum value as the radius. The contours enclosing most of the data points in the input data space are deﬁned by the set

(3)

{x ∈ Rd |R(x) = R},

(4)

(8)

which are used to construct cluster boundaries. Figure 1 illustrates SVs, BSVs and contours in the data space. To achieve cluster labeling, an adjacency matrix A = [Aij ]N ×N of the entire dataset is computed as ⎧ ⎪ ⎨1 if ∀y on the line segment connecting (9) Aij = xi and xj , R(y) ≤ R, ⎪ ⎩ 0 otherwise.

βi βj K(xi , xj )

i,j=1

βi = 1, 0 ≤ βi ≤ C, ∀i = 1, . . . , N,

i,j=1

R = max{R(xi )|xi ∈ SV}.

By eliminating the variables R, μ, ξi and αi , the Lagrangian can be turned into the Wolfe dual form max W =

3

The radius of the sphere is deﬁned as

The KKT complementarity conditions result in (R2 + ξi − φ(xi ) − μ 2 )βi = 0.

2

(6)

i=1

ξi αi = 0,

1

i=1

(2) where βi ≥ 0 and α ≥ 0 are Lagrange multipliers, C is a N i constant, and C i=1 ξi is the penalty term. Setting to zero the derivative of L with respect to R, μ and ξi respectively leads to N

0

R(x) = φ(x) − μ N N βi K(xi , x) + βi βj K(xi , xj ). = 1 − 2

N N N L = R2 − (R2 +ξi − φ(xi )−μ 2 )βi − ξi αi +C ξi , i=1

−1

where the inner product φ(xi ) · φ(xj ) is replaced by an appropriate Gaussian kernel K(xi , xj ) = exp(−q xi − xj 2 ) with the width parameter q. According to the values of Lagrange multipliers βi , i = 1, . . . , N , there are two types of representative data points: • Support vector (SV): 0 < βi < C, which lies on the sphere surface. We use SV to denote the SV set. That is, SV = {xi |0 < βi < C, i = 1, . . . , N }. • Bounded support vector (BSV): βi = C, which lies outside the sphere surface. We use BSV to denote the BSV set. That is, BSV = {xi |βi = C, i = 1, . . . , N }. It is obvious that setting C ≥ 1 will result in no BSV. The kernel radius function R(x) is deﬁned by the Euclidian distance of φ(x) from μ,

A. Support Vector Clustering

i=1

−2

Figure 1. Illustration of the SVC algorithm: SVs, BSVs, contours and the clustering results. Contours are plotted in solid curves. Different clusters are plotted in different colors.

II. BACKGROUND AND M OTIVATION

φ(xi ) − μ 2 ≤ R2 + ξi , ∀i = 1, . . . , N,

SV BSV

(5)

i=1

840

T IME USAGE ( IN SECONDS )

Dataset size

OF

proving that ISVC generates the same cluster structure as SVC. Then the computational complexity of ISVC is analyzed and compared with that of SVC.

Table I SVC ON DATASETS OF VARIOUS SIZES .

SVC Sphere construction

CG Cluster Labeling

100

7.45 × 10−1

8.79 × 10−2

500

5.50 × 101

8.97 × 10−1

1000

2.00 × 103

8.42 × 100

1500

1.30 × 104

2.88 × 101

2000

3.91 × 104

7.23 × 101

2500

1.17 × 105

1.45 × 102

3000

5

1.57 × 10

2

2.54 × 10

3500

3.09 × 105

4.69 × 102

A. Incremental Support Vector Clustering In this paper, we propose an incremental support vector clustering (ISVC), which constructs a sphere in an incremental manner. The N data points of a large-scale dataset are taken as arriving over time in chunks [17]. Without loss of generality, the chunks are of equal size, each containing M data points, and the dataset is divided into T = N/M chunks, x11 , . . . , x2M , . . . , xt1 , . . . , xtM , . . . , xT1 , . . . , xTM

(10)

where xti is the i-th data point of the t-th chunk which is denoted as X t . We aim to sequentially learn the updated cluster structure by using only a very small subset or summary statistics of the data points from the previous chunks X 1 ∪ · · · ∪ X t . The major challenge is to determine which subset or summary statistics of the historical data should be used to generate cluster structure the same as or similar to what is learned non-incrementally. From the previous work on incremental support vector machines (ISVM) [14], [15], one suitable choice is to train an updated ISVM based on the new data chunk and the SVs of the previous step. The underlying rationale is that the SVs provide a sufﬁcient description of the decision boundary [7], [21], as shown in Figure 2. Our theoretical analysis will give a sound proof to this intuitive idea. Throughout the paper, we will denote the SV set at the t-th step as SV t . In this paper, we adopt the common assumption that there is no outlier presented in the data, which is often the case in real-world applications [22], [23]. Therefore, the parameter C has to be set C ≥ 1, e.g., C = 1 as in [1]. Under this assumption, we will show that, by using only SVs and the new data, the proposed ISVC algorithm is guaranteed to generate completely the same cluster structure as SVC. That is, by using only SV t−1 ∪ X t , ISVC can generate the same SV set as what is generated by SVC on X 1 ∪ · · · ∪ X t .

Clusters are now deﬁned as the connected components of the graph induced by A. Checking the line segment is implemented by sampling a number of data points (usually 10). This is called the complete graph (CG) labeling method [1]. Another similar yet more efﬁcient cluster labeling method is called the support vector graph (SVG) labeling method, where the adjacency matrix A˜ = [A˜ij ]|SV|×|SV| is constructed only on SV [1]. B. Motivation Despite the theoretic soundness, one drawback of SVC is that, it takes a signiﬁcant amount of time in large-scale data clustering. Two main stages of SVC, namely, sphere construction and cluster labeling, are of high computational complexity. In sphere construction, it takes approximately O(N 2 ) kernel evaluations to achieve convergence [1], with each kernel evaluation at least quadratically depending on the kernel size (i.e., N ). Therefore, the time complexity of sphere construction of SVC is O(N 4 ). Also, the time complexity of CG cluster labeling is O(N 2 d) [1]. Table I lists the time usage (in seconds) of SVC on datasets of different sizes1 . From the viewpoint of computational time complexity, it is even impractical to apply SVC to a dataset of size 2.0 × 103 . And it is impossible to apply SVC to a dataset of size 1.0×104 for that it would run out of memory. Although some methods have been proposed to speedup cluster labeling [5], [8], [9], there is still a lack of method for efﬁciently constructing spheres. This paper focuses on addressing this issue by incrementally constructing spheres.

Lemma 1. At the t-th step, ∀t = 2, . . . , T , the non-support vectors (non-SVs) of the previous t − 1 steps, i.e. ∀x ∈ X 1 ∪ · · · ∪ X t−1 \SV t−1 , would not turn into SVs; but some SVs of the previous t − 1 steps, i.e. ∃x ∈ SV t−1 , would turn into non-SVs. Proof: According to the assumption of no outlier, the parameter C should be set as C ≥ 1. So there exists no BSV by deﬁnition, which implies that the only outer data points surrounding all the mapped data points are the SVs that lie on the sphere surface. As the new data chunk arrives, some of these new data points may lie inside or on the current surface; meanwhile the rest would form a larger surface than the current one. The data points that lie inside the current surface would not affect the generation of the

III. T HE P ROPOSED ISVC A LGORITHM This section ﬁrst describes the proposed incremental support vector clustering (ISVC) algorithm with theoretically 1 Throughout the paper, all the experiments are implemented in Matlab R2009a 64 bit edition on a workstation running Windows 64 bit, with 8 Intel 2.00GHz processors and 8GB of RAM.

841

1

1

0.5

I

I 1

0.5

0

0

−0.5

−0.5

−1 −2

−1

0

1

−1 −2

2

(a) 100 data points in data space

(b) Feature space

−1

0

1

2

(c) Mapped back to data space

Add a chunk of 100 data points

1

1

0.5

I

I 1

0.5

0

0

−0.5

−0.5

−1 −2

−1

0

1

−1 −2

2

(d) 200 data points in data space

(e) Feature space

−1

0

1

2

(f) Mapped back to data space

Figure 2. Relation of the SVs (as well as contours) in two consecutive steps in the data space and the feature space. In (c), the SVs are designated by black squares. In (f), the old SVs (i.e. that are the same as the previous step) are plotted by black squares, and the new SVs are plotted by black circles.

new SVs. However, according to (1) and deﬁnition of the smallest sphere, the new data points that lie outside the current surface may form a larger surface. In both cases, it is impossible for the non-SVs of the previous steps to move from inside the sphere to outside, which implies that the non-SVs cannot turn into SVs. On the other hand, if some of the new data points form a larger surface, some of the current SVs can be driven within the new sphere due to the enlargement of the sphere, becoming non-SVs. Figure 2 intuitively illustrates this Lemma.

From Lemma 1, the non-SVs of X 1 ∪ · · · ∪ X n will not turn into SVs at step n + 1. That is, the Lagrange multiplier βiτ of xτi ∈ X 1 ∪ · · · ∪ X n \SV n will keep unchanged such that βiτ = 0 at step n+1. Therefore, we can replace X 1 ∪· · ·∪X n with SV n in (11), resulting in the same problem restricted to Z n+1 = SV n ∪ X n+1 (here we use Z n+1 to denote the dataset used at step n + 1 in ISVC), max W = K(xi , xi )βi − βi βj K(xi , xj )

Theorem 1. At the t-th step, ∀t = 1, . . . , T , the SV t computed on SV t−1 ∪ X t are completely the same as what are computed on X 1 ∪· · ·∪X t in a non-incremental manner.

subject to

βi

βi

M n+1

K(xτi , xτi )βiτ −

τ =1 i=1

subject to

M n+1

M n+1

xi ,xj ∈Z n+1

βi = 1, 0 ≤ βi ≤ C, ∀i : xi ∈ Z n+1 . (12)

The solution of this problem will lead to SV n+1 . This implies that it is also true for t = n + 1. According to the principle of mathematical induction, it is true ∀t = 1, . . . , T . The proof ends. Consequently, we have the following theorem. Theorem 2. ISVC generates the same sphere, i.e. SVs and contours, as SVC. Proof: From Theorem 1, the SV t computed by ISVC at the t-th step, ∀t = 1, . . . , T , are the same as what are computed on X 1 ∪ · · · ∪ X t in batch mode. So the SV T computed by ISVC at the last step are the same as the SV computed on X = X 1 ∪ · · · ∪ X T . Additionally, the radius kernel functions R(x) are equal, leading to the same radius of the sphere according to (7). Thus they have the same contours by deﬁnition given by (8). For assigning cluster labels to data points, any aforementioned cluster labeling method can be used, such as the equilibrium vector-based cluster labeling method [9].

βiτ βjτ K(xτi , xτj )

τ =1 i,j=1

βiτ = 1, 0 ≤ βiτ ≤ C,

i:xi ∈Z n+1

Proof: To prove this theorem, we compute the SVs on X 1 ∪ · · · ∪ X t and show that they are equal to SV t computed on SV t−1 ∪ X t . To this end, we use the principle of mathematical induction. When t = 1, it is obviously true. Assume that it is true for t = n, 1 < n < T . That is, the SVs computed on X 1 ∪ · · · ∪ X n can be denoted as SV n . At the next step, we compute the SVs on X 1 ∪ · · · ∪ X n ∪ X n+1 and show that they are the same as SV n+1 computed on SV n ∪ X n+1 . According to the Wolfe dual form (5), for X 1 ∪· · ·∪X n ∪ X n+1 , we have W = max τ

xi ∈Z n+1

(11)

τ =1 i=1

∀τ = 1, . . . , n + 1, i = 1, . . . , M. 842

1

1

1

1

0.5

0.5

0.5

0.5

0

0

0

0

−0.5

−0.5

−0.5

−0.5

−1 −2

−1

0

(a) Step 1

1

2

−1 −2

−1

0

(b) Step 2

1

2

−1 −2

−1

0

(c) Step 3

1

2

−1 −2

−1

0

(d) Step 4

1

2

Figure 3. SVs and contours at the ﬁrst four successive steps by ISVC. In the ﬁrst step, i.e. (a), the SVs are designated by black squares. In (b), (c) and (d), the old SVs (i.e. that are the same as the previous step) are plotted by black squares, and the new SVs are plotted by black circles.

1

Algorithm 1 Incremental Support Vector Clustering 1: Input: X , q, C, M . 0 2: Initialize SV = ∅, t = 1. 3: repeat 4: Obtain the chunk X t of M data points from X . 5: Form the dataset Z t = SV t−1 ∪ X t . 6: Compute a Gaussian kernel matrix K = t t [K(xi , xj )]|Z |×|Z | with the parameter q. 7: Solve (12) to obtain SV t = {xi ∈ Z t |0 < βi < C}. 8: until No data chunk arrives 9: Make assignment by some cluster labeling method, e.g., the equilibrium vector-based cluster labeling method. 10: Output: Clustering results.

0.5

0

−0.5

−1 −2 Figure 4.

−1

0

1

2

The two moons dataset consisting of 1.0 × 104 data points.

manner. Then we compare ISVC and SVC in terms of time consumption used to construct spheres. Experimental results demonstrate that, without loss of clustering accuracy, the proposed ISVC algorithm dramatically reduces the time complexity of SVC. To demonstrate the efﬁciency of ISVC in large-scale data clustering, the widely tested two moons dataset is used as the testing set, which consists of 1.0 × 104 data points. The classic two moons dataset is generated as 2 halfcircles, each consisting of 5000 data points in R2 . Figure 4 shows the two moons dataset. For both ISVC and SVC, the parameters C and q are set to the most appropriate values as C = 1 and q = 12.5 respectively by trial and error. We test the ISVC algorithm with several chunk sizes, e.g. M ∈ {50, 100, 150, 200}.

Figure 3 demonstrates the SVs and contours by ISVC at the ﬁrst four consecutive steps. Algorithm 1 summarizes the proposed ISVC method. B. Computational Complexity By constructing spheres incrementally, the proposed ISVC algorithm is quite efﬁcient in time and memory consumption. For the SVC algorithm, it takes about O(N 2 ) kernel evaluations (of size N ) to construct a sphere [1]. For the ISVC algorithm, since the chunk size is M and the number of SVs at each step is |SV|, it takes O((M + |SV|)2 ) kernel N evaluations (of size M + |SV|) at each step. There are M N 2 chunks, so it takes ISVC overall about O((M +|SV|) × M ) kernel evaluations to accomplish the incremental sphere construction. Additionally, the time complexity of kernel evaluations strongly (at least quadratically) depends on the kernel size [1]. Therefore, ISVC reduces the time complexity N ). of sphere construction from O(N 4 ) to O((M +|SV|)4 × M N 4 Since |SV| is independent of M , O((M + |SV|) × M ) O(N 4 ). Likewise, ISVC dramatically reduces memory consumption, since it only performs on datasets of size M +|SV| at each step, rather than loads the entire dataset of size N .

A. Comparing Cluster Structures This subsection compares the spheres generated by ISVC and SVC to show that they generate completely the same sphere. Although the dataset is of size 1.0 × 104 , due to the very large computational time required by SVC, we only perform the sphere construction on a subset consisting of 3.5 × 103 data points, as shown in Figure 5(a). Figure 5(b) plots the sphere generated by SVC, and Figure 5(c) to 5(f) plot the spheres generated by ISVC when M is set to 50, 100, 150 and 200 respectively. By comparison, we can see that the proposed ISVC algorithm with different M can generate the same sphere as SVC.

IV. E XPERIMENTAL R ESULTS In this section, we ﬁrst demonstrate that the proposed ISVC algorithm can generate the same sphere as SVC. That is, the cluster structure generated incrementally is completely the same as what is generated in a non-incremental 843

1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1 −2

0

−1

1

(a) 3.5 × 103 data points

2

−1 −2

−1

0

1

2

(b) The sphere by SVC

−1 −2

1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1 −2

0

−1

1

2

−1 −2

−1

0

1

2

−1

0

1

2

−1

0

1

2

(c) The sphere by ISVC with M = 50

−1 −2

(d) The sphere by ISVC with M = 100 (e) The sphere by ISVC with M = 150 (f) The sphere by ISVC with M = 200 Figure 5.

Computional time (in seconds)

200

150

Comparing the spheres (i.e. SVs and contours) generated by SVC and ISVC.

time used during all consecutive chunks increases due to regarding the data as too many chunks. As aforementioned, the same sphere can be generated when different chunk sizes are used. Therefore, in real-world applications, it is better to set M to some median value, say, M = 50 or M = 100. Table II lists the computational time (in seconds) used by incremental and non-incremental sphere constructions on datasets of various sizes. By comparison, we can see that the proposed ISVC is several order of magnitudes faster than its counterpart SVC in constructing spheres. Additionally, in our experiment, it runs out of memory when constructing a sphere non-incrementally on a dataset of size 1.0 × 104 ; meanwhile as shown in Figure 6, the proposed ISVC algorithm can efﬁciently accomplish sphere construction with consuming less than 100 megabytes memory. As an overall clustering method, we use the equilibrium vector-based cluster labeling method to assign cluster labels [9], which is one of most recently developed clustering methods. Therefore, the proposed incremental sphere construction plus the equilibrium vector-based cluster labeling method is termed E-ISVC, and the non-incremental sphere construction plus the equilibrium vector-based cluster labeling method is termed E-SVC as in [9]. Figure 7 plots the logarithm of computational time used by E-ISVC (setting M = 100) and E-SVC respectively. Please note that log is used, which implies that small difference between the plotted log values is indeed quite large. E-ISVC and E-SVC generate the same clustering results, since they construct the same sphere as discussed before and the same cluster labeling method is used. However, from the ﬁgure, we can see that the proposed E-ISVC algorithm is several order of magnitudes faster than E-SVC. The comparative result has demonstrated the signiﬁcant improvement obtained by ISVC on reducing the computational complexity.

M=200 M=150 M=100 M=50 M=10

100

50

0 600 1800 3000 4200 5400 6600 7800 900010000 Dataset size

Figure 6. The computational time in incremental sphere construction using different M .

B. Comparing Time Usage In this subsection, we compare the time usage of SVC and ISVC in sphere construction. First we test the computational time of incremental sphere construction when using different chunk size M ∈ {10, 50, 100, 150, 200}. Figure 6 plots the computational time in seconds used to incrementally construct spheres using different M . From the ﬁgure, in all cases, the computational time grows almost linearly w.r.t. the dataset size. The results coincide very well with the time N ) as analyzed before. complexity O((M + |SV|)4 × M By comparing the computational time when using different M , we can also see that, when M is not too small, e.g., M ≥ 50, the computational time monotonically decreases as the chunk size M decreases. The main reason is that N ) of the computational complexity O((M + |SV|)4 × M incremental sphere construction is positively correlated to M . However, when M is set to a very small value, e.g., M = 10, it takes more time than setting M = 50, because although the time for each chunk decreases, the total 844

Table II C OMPUTATIONAL TIME ( IN SECONDS )

Dataset size

USED BY INCREMENTAL AND NON - INCREMENTAL SPHERE CONSTRUCTION ON DATASETS OF VARIOUS SIZES .

Incremental sphere construction (ISVC)

Non-incremental sphere construction (SVC)

M = 50

M = 100

M = 150

M = 200

100

0.75

0.75

0.75

0.75

7.45 × 10−1

200

1.05

1.22

1.58

1.74

1.25 × 100

300

1.42

1.87

2.43

2.99

3.59 × 100

400

1.80

2.54

2.87

4.49

8.05 × 100

500

2.25

3.26

4.12

5.73

5.50 × 101

600

2.71

3.94

5.61

7.59

1.56 × 102

700

3.26

4.72

6.43

8.91

3.36 × 102

800

3.80

5.50

8.01

10.56

6.96 × 102

900

4.30

6.35

9.33

11.79

1.28 × 103

1000

4.85

7.27

10.37

13.70

2.00 × 103

1100

5.42

8.08

11.42

14.79

3.27 × 103

1200

5.96

8.97

12.89

16.92

4.05 × 103

1300

6.46

9.79

14.03

17.99

6.21 × 103

1400

6.95

10.52

15.24

20.07

1.01 × 104

1500

7.44

11.32

16.43

21.46

1.30 × 104

1600

7.95

12.11

17.13

23.29

1.69 × 104

1700

8.50

12.92

18.72

24.31

2.03 × 104

1800

9.12

13.84

19.96

26.57

2.47 × 104

1900

9.70

14.77

20.47

28.00

3.00 × 104

2000

10.35

15.73

21.65

30.19

3.91 × 104

2100

10.99

16.67

23.81

31.34

6.09 × 104

2200

11.61

17.61

24.33

33.80

6.92 × 104

2300

12.19

18.56

26.19

35.04

7.11 × 104

2400

12.80

19.49

27.52

37.35

7.90 × 104

2500

13.41

20.42

28.43

39.03

1.17 × 105

2600

14.02

21.41

30.16

41.10

1.28 × 105

2700

14.61

22.35

31.49

42.87

1.38 × 105

2800

15.27

23.31

32.72

44.72

1.49 × 105

2900

15.86

24.23

34.51

45.92

1.50 × 105

3000

16.50

25.15

35.56

48.20

1.57 × 105

3100

17.18

26.16

36.84

50.07

1.98 × 105

3200

17.83

27.19

37.49

51.56

2.04 × 105

3300

18.47

28.20

39.43

53.41

2.39 × 105

3400

19.09

29.18

40.77

55.08

2.77 × 105

3500

19.92

30.14

42.15

57.32

3.09 × 105

845

Logarithm of computional time (in seconds)

14 12

[5] J. Yang, V. Estivill-Castro, and S. K. Chalup, “Support vector clustering through proximity graph modelling,” in Proc. of the 9th Int. Conf. on Neural Inf. Processing, 2002, pp. 898–903. [6] C.-D. Wang and J.-H. Lai, “Energy based competitive learning,” Neurocomputing, vol. 74, pp. 2265–2275, 2011. [7] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Leaming, vol. 20, pp. 273–297, 1995. [8] J. Lee and D. Lee, “An improved cluster labeling method for support vector clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 3, pp. 461–464, March 2005. [9] ——, “Dynamic characterization of cluster structures for robust and inductive support vector clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11, pp. 1869–1874, Nov. 2006. [10] K.-H. Jung, D. Lee, and J. Lee, “Fast support-based clustering method for large-scale problems,” Pattern Recognition, vol. 43, pp. 1975–1983, 2010. [11] S. Asharaf, S. Shevade, and M. N. Murty, “Rough support vector clustering,” Pattern Recognition, vol. 38, pp. 1779– 1783, 2005. [12] D. Yankov, E. Keogh, and K. F. Kan, “Locally constrained support vector clustering,” in Proc. of the 7th Int. Conf. on Data Mining, 2007, pp. 715–720. [13] D. Lee and J. Lee, “Dynamic dissimilarity measure for support-based clustering,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 6, pp. 900–905, June 2010. [14] N. A. Syed, H. Liu, and K. K. Sung, “Handling concept drifts in incremental learning with support vector machines,” in Proc. of the 5th ACM SIGKDD Int. Conf. on Knowl. Discovery and Data Mining, 1999, pp. 317–321. [15] S. R¨uping, “Incremental learning with support vector machines,” in Proc. of the 1st Int. Conf. on Data Mining, 2001, pp. 641–642. [16] C. Domeniconi and D. Gunopulos, “Incremental support vector machine construction,” in Proc. of the 1st Int. Conf. on Data Mining, 2001, pp. 589–592. [17] R. Klinkenberg and T. Joachims, “Detecting concept drift with support vector machines,” in Proc. of the 17th Int. Conf. on Machine Learning, 2000, pp. 487–494. [18] Y.-M. Wen and B.-L. Lu, “Incremental learning of support vector machines by classiﬁer combining,” in Proc. of the 11th Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining, 2007, pp. 904–911. [19] T. Zhou, D. Tao, and X. Wu, “NESVM: A fast gradient method for support vector machines,” in Proc. of the 10th Int. Conf. on Data Mining, 2010, pp. 679–688. [20] B. Liu, Y. Shi, Z. Wang, W. Wang, and B. Shi, “Dynamic incremental data summarization for hierarchical clustering,” in Proc. of WAIM, 2006, pp. 410–421. [21] D. M. Tax and R. P. Duin, “Support vector domain description,” Pattern Recognition Letters, vol. 20, pp. 1191–1199, 1999. [22] K. Yamanishi, J. i. Takeuchi, G. Williams, and P. Milne, “Online unsupervised outlier detection using ﬁnite mixtures with discounting learning algorithms,” Data Mining and Knowledge Discovery, vol. 8, pp. 275–300, 2004. [23] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Computing Surveys, vol. 31, pp. 264–323, 1999. [24] K. Wagstaff and C. Cardie, “Clustering with instance-level constraints,” in Proc. of the 17th Int. Conf. on Machine Learning, 2000, pp. 1103–1110. [25] X. He, “Incremental semi-supervised subspace learning for image retrieval,” in ACM Multimedia, 2004, pp. 2–8. [26] I. Davidson, S. S. Ravi, and M. Ester, “Efﬁcient incremental constrained clustering,” in Proc. of the 13th ACM SIGKDD Int. Conf. on Knowl. Discovery and Data Mining, 2007, pp. 240–249.

E−SVC E−ISVC

10 8 6 4 2

500

1000 1500 2000 2500 3000 3500 Dataset size

Figure 7. E-ISVC (setting M = 100) vs. E-SVC. Notice that the logarithm of computational time in seconds are plotted.

V. C ONCLUSION AND F UTURE W ORK This paper has presented a novel incremental support vector clustering (ISVC) algorithm for addressing the computational bottleneck suffered by support vector clustering (SVC) in sphere construction. In the proposed approach, the sphere is constructed incrementally. By regarding the data as arriving over time in chunks, only the support vectors of the historical data and the data points of the new data chunk are used to learn an updated sphere. We have theoretically shown that the proposed ISVC approach can generate the same cluster structure as SVC. Computational complexity analysis has also revealed that our approach is several magnitude faster and requires much lower memory consumption in sphere construction than the conventional SVC method. Experimental results have validated the theoretical analysis. In our future work, we plan to try ISVC on more complex setting. For example, to develop an extension to deal with outliers, which requires more theoretical work to accomplish the equivalent clustering performance. We also plan to extend ISVC to semi-supervised clustering [24]–[26], where the must/cannot-link constraints or other prior knowledge from users can be incrementally incorporated to produce clustering according to the speciﬁc needs. ACKNOWLEDGMENT This project was supported by the NSFC-GuangDong (U0835005) and the NSFC (61173084). R EFERENCES [1] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, “Support vector clustering,” Journal of Machine Learning Research, vol. 2, pp. 125–137, 2001. [2] C.-D. Wang, J.-H. Lai, and J.-Y. Zhu, “A conscience on-line learning approach for kernel-based clustering,” in Proc. of the 10th Int. Conf. on Data Mining, 2010, pp. 531–540. [3] ——, “Conscience on-line learning (COLL): An efﬁcient approach for robust kernel-based clustering,” Knowledge and Information Systems, in press, 2011. [4] J. Lin, M. Vlachos, E. J. Keogh, and D. Gunopulos, “Iterative incremental clustering of time series,” in Proc. of EDBT, 2004, pp. 106–122.

846