Nested Subtree Hash Kernels for Large-Scale Graph ...

Viewer
Transcript

2012 IEEE 12th International Conference on Data Mining

Nested Subtree Hash Kernels for Large-scale Graph Classiﬁcation over Streams Bin Li, Xingquan Zhu, Lianhua Chi, Chengqi Zhang Centre for Quantum Computation & Intelligent Systems (QCIS) Faculty of Eng. & Info. Tech., University of Technology, Sydney, NSW 2007, Australia {bin.li-1, xingquan.zhu, lianhua.chi, chengqi.zhang}@uts.edu.au they need to maintain a global subtree pattern list as a common feature space for all the graphs during the entire learning process. This approach may be feasible in the cases of thousands of graphs, from which all the subtree patterns can be extracted through a pre-scan to span a complete feature space. However, when massive graphs are presented in a stream with rapidly emerging subtree patterns (see Figure 1), two problems emerge immediately: First, at least two rounds of data scan are required to obtain the kernel matrix for a chunk of graphs, one round for extracting subtree patterns to construct a common feature space and the other for computing kernels in the constructed feature space. Second, as graphs are fed in, the dimensionality of the common feature space will increase rapidly, one can hardly compute kernels for graphs in different chunks over the stream since historical graphs cannot be fully projected onto the unlimitedly expanding feature space again. In many realworld scenarios, data streams are continuously generated and can be read only once, so constructing a common feature space through a pre-scan and expanding the common feature space over the stream are both infeasible. A common approach for data stream mining is to divide a data stream into chunks, learn an individual classiﬁer for each chunk, and construct a classiﬁer ensemble on the most recent chunks. In contrast to the traditional data stream mining problem where the underlying instances in the stream are represented in the same feature space, in graph streams, the classiﬁers of different chunks have different feature spaces such that we can hardly construct a classiﬁer ensemble on them. Thus, to enable large-scale graph classiﬁcation over streams using the classical ensemble learning framework [6], we need to ﬁnd a way to project emerging subtree patterns of arriving graph chunks onto a common feature space compatible to historical graph chunks, with good estimation quality, computational and spatial efﬁciency. In this paper, we propose a Nested Subtree Hashing (NSH) algorithm to recursively project multi-resolution subtree patterns of different graph chunks onto a set of common lowdimensional feature spaces, which can facilitate learning a classiﬁer ensemble over graph chunks with different subtree pattern sets. We theoretically analyze the derived NSH kernel and obtain the following favorable properties: 1) The NSH kernel is an unbiased and highly concentrated estimator of the Weisfeiler-Lehman graph kernel [5]

Abstract—Most studies on graph classiﬁcation focus on designing fast and effective kernels. Several fast subtree kernels have achieved a linear time-complexity w.r.t. the number of edges under the condition that a common feature space (e.g., a subtree pattern list) is needed to represent all graphs. This will be infeasible when graphs are presented in a stream with rapidly emerging subtree patterns. In this case, computing a kernel matrix for graphs over the entire stream is difﬁcult since the graphs in the expired chunks cannot be projected onto the unlimitedly expanding feature space again. This leads to a big trouble for graph classiﬁcation over streams – Different portions of graphs have different feature spaces. In this paper, we aim to enable large-scale graph classiﬁcation over streams using the classical ensemble learning framework, which requires the data in different chunks to be in the same feature space. To this end, we propose a Nested Subtree Hashing (NSH) algorithm to recursively project the multi-resolution subtree patterns of different chunks onto a set of common low-dimensional feature spaces. We theoretically analyze the derived NSH kernel and obtain a number of favorable properties: 1) The NSH kernel is an unbiased and highly concentrated estimator of the fast subtree kernel. 2) The bound of convergence rate tends to be tighter as the NSH algorithm steps into a higher resolution. 3) The NSH kernel is robust in tolerating concept drift between chunks over a stream. We also empirically test the NSH kernel on both a large-scale synthetic graph data set and a real-world chemical compounds data set for anticancer activity prediction. The experimental results validate that the NSH kernel is indeed efﬁcient and robust for graph classiﬁcation over streams. Keywords-Graph classiﬁcation; data stream mining; graph hash kernels; nested subtree hashing;

I. I NTRODUCTION Graph can be used as a natural representation of many structured data in a broad range of real-world applications, such as chemical compounds, XML documents, program ﬂows, and social networks. Graph classiﬁcation thus becomes an important research issue for better understanding the graph data generated from these applications. Due to the arbitrary structure of a graph, computing graph similarities is not easy since graphs are not in the same intrinsic space. Most approaches have to enumerate a set of substructures (e.g., subgraphs, paths/walks, and subtrees) to construct a common feature space. For this reason, existing graph classiﬁcation studies mainly focus on designing fast and effective graph kernels [1], [2], [3]. Although a family of fast subtree kernels [4], [5] have achieved a linear time-complexity w.r.t. the number of edges, 1550-4786/12 $26.00 © 2012 IEEE DOI 10.1109/ICDM.2012.101

905 399

is represented as a graph, its atoms form the vertex set and the bonds between atoms form the edge set. In our problem setting, we assume that graphs arrive chunk by chunk. For the graphs in St , a feature extraction process is required to perform beforehand to transform graph (t) (t) structures into a set of feature vectors {xn }N , n=1 ∈ X (t) where each dimension of X corresponds to a substructure of graphs (e.g., subtrees used in this paper). The class labels (t) of the graphs in St are denoted by {yn }N n=1 ∈ {±1}. The goal of graph classiﬁcation over streams is to learn a classiﬁer ensemble from the latest K chunks ST −K+1 , . . . , ST to classify the graphs in ST +1 with maximum accuracy. Due to different substructure distributions of graphs, the feature spaces of different graph chunks are usually not identical but have overlaps, i.e., X (t) = X (t−k) and X (t) ∩X (t−k) = ∅. This will lead to a big trouble for learning a classiﬁer ensemble over streams – How to project unlimitedly increasing substructures from different chunks onto a common low-dimensional feature space? The heterogeneity of feature spaces will certainly lead to different marginal distributions, i.e., P (x(t) ) = P (x(t−k) ), which will further introduce the concept-drift problem. We aim to address these problems in this paper.

6

10

5

# of Subtree Patterns

10

4

10

3

10

r=1 r=2 r=3 r=4 r=5 r=6

2

10

1

10

0

10 0 10

1

10

2

10 # of Graphs

3

10

4

10

Figure 1. Rapid increasing of subtree patterns of different heights r, as a large number of graphs are sequentially fed in (the graphs are randomly sampled from the real-world NCI data set used in Section IV).

(also known as fast subtree kernels [4]); 2) The bound of convergence rate tends to be tighter as the NSH algorithm steps into a higher resolution; 3) Compared to the frequent-pattern based graph kernel, the NSH kernel is more robust in tolerating concept drift between graph chunks over a stream. Besides the above theoretical contributions, we also empirically test the NSH kernel on both a large-scale synthetic graph data set and a real-world chemical compounds data set for anticancer activity prediction. The experimental results show that the NSH kernel not only has much lower computational and spatial complexities, but also outperforms the frequent-pattern based graph kernel due to its strong capability in concept-drift tolerance. To the best of our knowledge, this is the ﬁrst endeavor to study large-scale graph classiﬁcation over streams. The rest of the paper is organized as follows: We ﬁrst describe our problem and introduce the preliminaries in Section II. In Section III, the NSH algorithm and the NSH kernel are proposed, followed by the theoretical analysis and the application of the NSH kernel to graph classiﬁcation over streams. We empirically test the NSH kernel in Section IV. The related work is introduced in Section V and the paper is concluded in Section VI.

B. Weisfeiler-Lehman Graph Kernels The Weisfeiler-Lehman (WL) graph kernel [5] (or fast subtree kernels [4]) is one of the fastest graph kernel families which scale only linearly to the number of edges of the graphs, |E|, and the number of iterations, R, of the Weisfeiler-Lehman (WL) isomorphism test [7]. WL graph kernels have the following form κ(g, g ) =

R r=1

κ(g(r) , g(r) )=

R r=1

x(r) , x(r) = x, x

(1) where g(1) = (V, E, (1) ) = (V, E, ) = g (i.e., g(1) is as same as the input graph g), g(r) = (V, E, (r) ), and x = [x (1) . . . x(R) ] . At the ﬁrst WL isomorphism test iteration, each node v in g(1) is assigned an ID (unique integer number) according to its node pattern (nodes with same pattern receive same ID); at the rth iteration for r > 1, each node v in g(r) is assigned a new ID according to the subtree pattern rooted at v. Each unique subtree is encoded as a string, which is initiated with v’s ID, followed by the ordered IDs of v’s one-step leaf nodes. At each iteration, a histogram x(r) is used to count the numbers of subtree patterns in g(r) , where each dimension of x(r) corresponds to a certain subtree and is indexed by the subtree ID. The intuition of WL graph kernels is to compare the “bags of subtrees” at multi-scale resolutions. Here, we can view deeper iterations of the WL isomorphism test as higher resolutions for describing graphs since r-steps of structure information (a subtree of height r) is encoded. The proposed NSH kernel will be built on the WL isomorphism test.

II. P RELIMINARIES A. Problem Description A stream of graphs are partitioned into sequential chunks, S1 , S2 , . . . , ST , where ST denotes the most recent chunk. (t) (t) Each chunk comprises N graphs {gn }N n=1 . A graph gn (t = 1, . . . , T ; n = 1, . . . , N ) is represented with a triplet (V, E, ), where V denotes the vertex set, E denotes the undirected edge set, and : V → L is a function that assigns (t) patterns from a pattern set L to the nodes1 in gn . Take chemical compounds classiﬁcation for example: a molecule 1 Edge patterns can also be considered in a similar way. For simplicity, we only consider node patterns in this paper.

400 906

Although the time complexity is only O(N R|E| + N 2 R|V|) for computing a WL graph kernel matrix, the WL isomorphism test needs to maintain a global subtree pattern list to span the feature space. To let all graph chunks share a common feature space, the global subtree pattern list must be cached in the memory. As numerous chunks of graphs are fed in, the subtree pattern list will be dramatically expanded. The search and insertion operations and subtree pattern storage will be infeasible very soon because of both time- and spatial-inefﬁciency problems.

Algorithm 1 Nested Subtree Hashing Input: g = (V, E, ), R, {M(r) }R r=1 Output: {¯ x(r) }R r=1 1: for r = 1 : R do ¯ (r) ← zeros(M(r) , 1) x 2: 3: for v ∈ V do 4: if r = 1 then 5: str(r) (v) ← (v) 6: else Jv 7: temp ← sort {hid(r−1) (σj (v))}j=1 , ↑ 8: str(r) (v) ← strcat hid(r−1) (v), temp 9: end if 10: hid(r) (v) ← (str(r) (v), {1, . . . , M(r) }) 11: sgn(r) (v) ← δ(str(r) (v), {±1}) 12: i ← hid(r) (v) 13: b ← sgn(r) (v) 14: x ¯(r),i ← x ¯(r),i + b 15: end for 16: end for

C. Classiﬁer Ensemble over Data Streams A classical accuracy-weighted ensemble learning framework for streaming data classiﬁcation was introduced in [6]. It selects top K classiﬁers, which have the best accuracy on ST , out of T classiﬁers learned from all the historical chunks to construct the ensemble. To apply this framework to our problem setting, we slightly adapt the framework that we only use K classiﬁers from the most recent graph chunks to construct a weighted classiﬁer ensemble fE (x) =

T

wt ft (x)

(2)

using some substructure mining techniques. Then, the graphs in all the observed chunks and future chunks can be projected onto X ∀ followed by a classiﬁer ensemble construction as introduced above. The disadvantage is that the predeﬁned feature space is based on a small portion of graphs. As numerous graphs are fed in, new graphs may have very different feature spaces from the predeﬁned one.

t=T −K+1

where ft :X → {±1} is the single classiﬁer learned from 1 St , wt = y∈{±1} PT (y)(1 − PT (y))2 − |S1T | N n=1 ( 2 (1 − (T )

(T )

yn ft (xn )))2 is the weight of classiﬁer ft , and PT (y) denotes the distribution of class label y in St . If ft performs better on ST , wt tends to be larger. From Eq. (2), we notice that, to predict a test example x using the classiﬁers in the ensemble, all the classiﬁers {ft } as well as the test examples are required to be in the same feature space. This requirement can be naturally satisﬁed in most cases. However, due to the different substructure sets of {St }, the classiﬁers learned on {St } will be in different feature spaces, if no other treatment is provided.

III. N ESTED S UBTREE H ASHING As mentioned above, to classify large-scale graphs over streams, the major challenge is to project unlimited graph substructures from numerous chunks onto a common lowdimensional feature space. Random projection (RP) [8], [9] is a popular technique to sketch data streams. In RP, however, each dimension of the original feature space is explicitly associated to a projection vector, which is infeasible in graph stream scenarios where feature spaces keep expanding. The subtree encoding method used in the WL isomorphism test (see Section II-B), which uses a unique string to identify a subtree pattern, inspired us to employ hashing techniques to map unlimited subtree patterns into a small set of buckets (i.e., a common low-dimensional feature space). In this way, we no longer need to know how many dimensions in the original feature spaces.

D. Baseline Approaches Before proceeding to introduce the proposed algorithm, we ﬁrst list two simple approaches as the baselines to solve the considered problem: • Single Classiﬁers in Individual Feature Spaces. This is the simplest approach that one learns a single classiﬁer ft from each chunk St , where the feature space X (t) is spanned by all the observed subtrees in St . To predict a graph g (T +1) ∈ ST +1 , we project g (T +1) onto X (T ) and predict the class label using fT . The disadvantage is that each chunk only has a small portion of data so the obtained classiﬁer will have less generalization capability and can hardly tolerate concept drift. • Classiﬁer Ensemble in a Predeﬁned Feature Space. Feature selection is widely used in graph mining. To apply it to graph classiﬁcation over streams, we can predeﬁne a common feature space X ∀ for all the chunks {St }

A. The Algorithm Algorithm 1 lists the pseudocode of the proposed Nested Subtree Hashing (NSH) algorithm. The input includes a graph g = {V, E, }, the number of iterations R, and the dimensionalities of the common feature spaces {M(r) }R r=1 (i.e., to hash all the observed subtree patterns at the rth

401 907

at the rth iteration. Obviously, the NSH kernel is a valid kernel and, in practical use, it can be computed directly using ¯ based on Algorithm 1. ¯ x, x A key step in the above kernel derivation is φ(x(r) ) = ¯ (r) . To theoretically analyze the NSH kernel, we deﬁne an x explicit projection for this operation

iteration into M(r) buckets). The output is a set of feature vectors, {¯ x(r) }R r=1 , for all the iterations. In the beginning of each iteration, an M(r) -dimensional ¯ (r) is initialized with zeros (line 2 in Algofeature vector x rithm 1). Then, we adopt the idea of the WL isomorphism test [7] to encode subtree patterns as unique strings (lines 4–9). For r = 1, all the nodes are assigned with their node patterns as the initial strings (line 5); while for r > 1, the subtree pattern of each node is updated based on the hashIDs {hid(r−1) (v)} as follows: First, we sort all the hash-IDs v as an array temp, of node v’s neighboring nodes {σj (v)}Jj=1 in ascending order (line 7). Then, we concatenate the node v’s hash-ID and the ordered hash-ID array temp as a string (line 8), which is the new subtree pattern of node v used for the subsequent hashing process. At the rth iteration, for each node in the graph, the string of a node str(r) (v) (i.e., the subtree pattern rooted at v) is hashed to two values, a hash-ID hid(r) (v) ∈ {1, . . . , M(r) } (line 10) and a bias-sign sgn(r) (v) ∈ {±1} (line 11), using two random hash functions : str → N and δ : str → {±1}, respectively. The hash-ID hid(r) (v) is used to allocate a dimension for the subtree pattern in the common feature space and the bias-sign sgn(r) (v) is used to update the value2 of that dimension. Based on the hash¯ (r) is updated (line ID and the bias-sign, the feature vector x 14), which can be used later for computing kernels. It is worth noting that, in the proposed NSH algorithm, all the graphs can be projected onto a set of common feature spaces without knowing in prior how many subtree patterns are in the original feature spaces. This advantage will play a pivot role to enable large-scale graph mining over streams. In contrast, in WL graph kernels [4], [5], a global subtree pattern list is required to maintain in the memory during the entire feature extraction process. Thus, WL graph kernels can only be computed in batch mode.

¯ (r) φ(x(r) ) = R (r) x(r) = x

where R(r) ∈ {±1, 0}D(r) ×M(r) , D(r) denotes the dimensionality of the complete feature space at the rth iteration and M(r) is deﬁned before. At the ﬁrst iteration, where x(1) is a histogram of only node patterns, D(1) is the number of all the node patterns, which span the complete feature space X(1) . At the second iteration, the complete feature space B X(1) , where denotes the tensor becomes X(2) = (Kronecker) product and B is the maximum number of leaf nodes of all the subtree patterns; the dimensionality of X(2) B thus becomes D(2) = D(1) . This is because, for r > 1, a subtree pattern at the rth iteration is a combination of B patterns of the (r − 1)th iteration (assume dummy nodes used in the subtrees with fewer than B leaf nodes). In this way, we can deﬁne complete feature spaces recursively: Deﬁnition 1: (Complete Feature Spaces) Let {X(r) }R r=1 denote the complete feature spaces of subtree patterns for B R iterations. For r > 1, X(r) = X(r−1) , where X(1) is the feature space spanned by all the node patterns. Given Deﬁnition 1, we have the following nested relationships between {x(r) ∈ X(r) } and {¯ x(r) ∈ X¯(r) }, where {X¯(r) } denote the hashed low-dimensional feature spaces. ¯ (1) = U x (1) V(1) x(1) B ¯ (2) = U U V x(2) x (2) (1) (2) ¯ (3) = U x (3) V(3)

B. Nested Subtree Hash Kernels A kernel between two graphs, g and g , can be naturally derived from the NSH iterations by computing the inner productions of their corresponding sets of feature vectors {¯ x(r) } and {¯ x(r) }, that is κ ¯ (g, g ) =

R r=1

=

R

¯ (r) = U x (r) V(r)

B B B

. . . U(r−1)

...

x(r)

(6) (7)

U(1) U(2) x(3)

B

B

(8)

U(1) U(2) (9)

where U(1) ∈ {0, 1}D(1) ×M(1) and V(1) ∈ {±1}D(1) ×D(1) B and, for r > 1, U(r) ∈ {0, 1}M(r−1) ×M(r) and V(r) ∈ B B {±1}M(r−1) ×M(r−1) . Speciﬁcally,

1 if (str(r) (i), {1, . . . , M(r) }) = j, [U(r) ]ij = 0 otherwise. (10) and

δ(str(r) (i), {±1}) if i = j, (11) [V(r) ]ij = 0 if i = j.

R κ ¯ (g(r) , g(r) )= φ(x(r) ), φ(x(r) ) (3) r=1

¯ (r) = ¯ ¯ ¯ x(r) , x x, x

(5)

(4)

r=1

where κ ¯ denotes the NSH kernel and x(r) and x(r) denote the feature vectors in the complete feature spaces, which are spanned by all possible subtree patterns (without hashing) 2 Signs are used for eliminating the bias of the estimator (see Sec¯ (r) can be interpreted tion III-C). If signs are not used, the feature vector x as a histogram that counts the number of subtrees with same hash-IDs.

Note that each row in U(r) and V(r) corresponds to a subtree pattern, thus str(r) (i) returns a string that corresponds to the

402 908

subtree pattern of the ith row. The two hash functions and δ are those used in Algorithm 1. Eqs. (9), (10) and (11) have deﬁned the explicit projection in Eq. (5). Now we can write R (r) = U(r) V(r)

B

...

B

¯ (r) ] = x for i = j. Thus, Eφ [¯ x(r) , x (r) Eφ [R(r) R(r) ]x(r) = x(r) Ix(r) = x(r) , x(r) . 2 ¯ (r) 2 = (x We have ¯ x(r) , x (r) R(r) R(r) x(r) ) according to Eq. (5). Let B = R(r) R(r) , then

U(1) . . . U(r−1) (12)

2 ¯ (r) 2 ] = Eφ [(x x(r) , x Eφ [¯ (r) Bx(r) ) ] 2 = Eφ x(r),i Bii x(r),i + x(r),i Bij x(r),j (13)

which has the following property. Lemma 1: There is only one nonzero entry, with the value in {±1}, in each row of R(r) . Proof: According to Eq. (10), we can see that each row of U(r) , for r = 1, . . . , R, only has one nonzero entry being 1. Based on the deﬁnition of the Kronecker product, each B B row of U⊗ = U(1) . . . U(r−1) also has ... one nonzero entry being 1. Since V(r) is a diagonal matrix with values in {±1}, it is easy to verify that each row of [U (r) V(r) U⊗ ] has only one nonzero entry in {±1}. Lemma 1 implies a nice property of the proposed NSH kernel – Although a series of nested hashing operations have been performed for iterations, at each iteration, there still exists an explicit mapping between the complete feature spaces and the hashed low-dimensional feature spaces, as ¯ (r) , where R(r) is a random projection matrix. R (r) x(r) = x This projection is similar to the database-friendly random projections [9] and the hash kernels [10], but our projection matrix is generated in a recursive way. Deﬁnition 2: (Resolutions of the Kernel) Let κ ¯ (g(r) , g(r) ) be an r-resolution kernel, in which the string of a node, str(r) (v), encodes a subtree of height r, rooted at v. A highresolution kernel is constructed on high subtree patterns. From Eq. (4), we can see that the NSH kernel is a sum of sub-kernels from R iterations. At each iteration, one more step of neighboring node information is taken into account to encode subtree patterns. In particular, at the rth iteration, the subtree patterns {str(r) (v)} have encoded all the neighboring nodes within r-steps. Obviously, as r increases, more comprehensive substructures can be encoded.

i=j

i

2 x(r),i Bij x(r),j = Eφ x(r) , x(r) + i=j

= x(r) , x(r) 2 + Eφ

x(r),i Bij x(r),j

2

(14) (15)

i=j

We use the results of the ﬁrst proof to deduce from Eq. (13) to Eq. (15). According to the deﬁnition of variance, ¯ (r) ] = Eφ [¯ ¯ (r) 2 ] − Eφ [¯ ¯ (r) ]2 = V arφ [¯ x(r) , x x(r) , x x(r) , x ¯ (r) ]2 = i=j x2(r),i x 2(r),j Eφ [B2ij ] + Eq. (15) − Eφ [¯ x(r) , x x(r),i x (r),i x(r),j x (r),j Eφ [Bij Bji ] . Since Eφ [B2ij ] = Eφ [Bij Bji ] = M1(r) , the result is proved. 2) Rate of Convergence: In the following, we analyze a tail bound of the NSH kernel to show its convergence rate. The bound can be further used to show an interesting phenomenon that, as the nested subtree hashing goes deeper (i.e., the kernel resolution becomes higher), the bound of convergence rate tends to be tighter. Theorem 2: For > 0, the probability of deviation ¯ (r) and its unhashed counbetween the NSH kernel ¯ x(r) , x terpart x(r) , x(r) is bounded by the following inequality

¯ (r) − x(r) , x(r) > ≤ P ¯ x(r) , x 2 /2 . (16) 2 exp − ¯ (r) ] + |V||V |/3 V arφ [¯ x(r) , x Proof:

Let

1 M(r) x(r) , x(r) ,

zm = x (r) r(r),m r(r),m x(r) − where r(r),m denotes the mth column of M

(r) R(r) . Because {r(r),m }m=1 are independently generated by random functions and x(r) , x(r) , and M(r) are constants,

C. Analysis In practical use, we wish the NSH kernel κ ¯(g, g ) = ¯ can perform as well as its unhashed counterpart ¯ x, x ¯ κ(g, g ) = x, x . In this subsection, we show that ¯ x, x is an unbiased and highly-concentrated estimator of x, x . 1) Bias & Variance: Theorem 1: The NSH kernel is unbiased at all resolu¯ (r) ] = x(r) , x(r) , r = 1, . . . , R; tions, that is Eφ [¯ x(r) , x ¯ (r) ] = M1(r) i=j x2(r),i x 2(r),j + and V arφ [¯ x(r) , x x(r),i x(r),i x(r),j x(r),j , r = 1, . . . , R. ¯ (r) = x Proof: We have ¯ x(r) , x (r) R(r) R(r) x(r) according to Eq. (5). Since there is only one 1 or -1 in each row of R(r) according to Lemma 1, the diagonal entries in R(r) R (r) are all 1’s. The expectation of any off-diagonal en1 try Eφ [R(r) R (r) ]ij = M(r) Eδ [δ(str(r) (i))δ(str(r) (j))] = 0,

M

(r) are independent zero-mean random variables. {zm }m=1 M(r) . Then we apply the Bernstein’s inequality [11] to {zm }m=1 Given Eφ [zm ] = 0, |zm | ≤ C, then, for > 0, we have (r)

M 2 /2 . P zm > ≤ 2 exp − M(r) 2 m=1 m=1 Eφ [zm ] + C/3 M(r) ¯ (r) −x(r) , x(r) , so It’s easy to verify m=1 zm = ¯ x(r) , x 2 2 ]= we only need to compute Eφ [zm ] and C. We have Eφ [zm 2 Eφ [(x(r) r(r),m r(r),m x(r) ) ] − Eφ [x(r) r(r),m r(r),m x(r) ]2 , 2 where the second term equals to M1(r) x(r) , x(r) = A. 2 Some steps of deduction give Eφ [(x (r) r(r),m r(r),m x(r) ) ] = 1 2 2 i,j x(r),i x (r),j + 2 i=j x(r),i x(r),i x(r),j x(r),j = M2 (r)

403 909

2 2 2 B. Then Eφ [zm ] = B − A = M12 i=j x(r),i x (r),j + (r) M(r) 2 ]}m=1 have the same x(r),i x(r),i x(r),j x(r),j . Since {Eφ [zm M(r) 2 2 ¯ (r) ]. Eφ [zm ] = M(r) Eφ [zm ] = V arφ [¯ x(r) , x result, m=1 Finally, we have |zm | ≤ |V||V | = C, which might be obtained in the worst case that all the nodes of the two graphs are hashed to the same dimension. Corollary 1: As the resolution of the NSH kernel goes ¯ (r) ), the bound of higher (i.e., r becomes larger for ¯ x(r) , x convergence rate Eq. (16) tends to be tighter. Proof: In Eq. (16), since |V||V | is unchanged over ¯ (r) ] ≤ iterations, we only need to prove that V arφ [¯ x(r) , x ¯ (r−1) ] for r = 2, . . . , R. According to x(r−1) , x V arφ [¯ Deﬁnition 1, the complete feature space X(r) keeps splitting over iterations, without merging. Recall that a feature vector x(r) ∈ X(r) is a histogram counting subtree patterns. At the rth iteration, each dimension in x(r−1) reallocates its subtree patterns to a set of dimensions in x(r) , for example, P P let x(r−1),1 = (r),p and x(r−1),1 = p=1 x p=1 x(r),p . 2 2 Based onthe fact that i ai ≤ ( i ai ) and i ai bi ≤ ( i ai )( i bi ) for all ai ≥ 0, bi ≥ 0, We can obtain 2 2 x (r),j + x(r),i x(r),i x(r),j x(r),j ≤ x the result i = j (r),i 2 2 i=j x(r−1),i x (r−1),j + x(r−1),i x(r−1),i x(r−1),j x(r−1),j . Let M(r) ≥ M(r−1) for r = 2, . . . , R, it is proved. 3) Complexity Analysis: The computational complexity of the NSH kernel is O(N R|E|+N 2 R|V|), which is as same as that of the WL graph kernel [5]. In particular, O(N R|E|) is for computing feature vectors for N graphs at R iterations, where each graph require O(|E|) for computing strings for hashing; and O(N 2 R|V|) is for computing an N × N kernel matrix based on the obtained R sets of feature vectors, where each feature vector has at most |V| nonzero entries. In practical implementation, the proposed NSH kernel can be more computationally efﬁcient than the WL graph kernel since it can directly assign a dimension of the feature space to a subtree pattern via hashing, while the WL graph kernel needs to search the global subtree pattern list. For spatial complexity, an advantage of the NSH kernel is that no additional cost is required for storing subtree patterns. In contrast, the WL graph kernel requires a complexity of O(N R|E|) in the worst case to maintain a global subtree pattern list in the memory, which keeps increasing as new graphs are fed in. Moreover, the NSH algorithm can further reduce the spatial complexity in caching signiﬁcantly shortened feature vectors before computing a kernel.

this subsection for notation concision, since the superscript is used to index data chunks. In particular, as a chunk of graphs St arrive in the ¯ (t) memory, we ﬁrst compute an N × N kernel matrix K on St using Algorithm 1 and Eq. (4). Then, we can train a ¯ (t) . To classify a graph g, we have kernel machine on K (t)

ft (g) =

N

α(t) ¯(gn(t) , g) + β (t) n κ

(17)

n=1 (t)

where κ ¯ is the NSH kernel deﬁned in Eq. (3), gn ∈ St , (t) (t) are the coefﬁcients of the model. By {αn }N n=1 and β substituting Eq. (17) into Eq. (2), we can obtain the classiﬁer ensemble, based on the NSH kernel, for graph classiﬁcation over streams. 1) Concept-Drift Tolerance: A major concern in data stream mining is the concept-drift problem due to changes of data distributions over streams. In graph stream mining scenarios, new subtree patterns will continuously emerge as new graphs arrive while old patterns may fade away. Thus, the marginal distributions of subtree patterns of different chunks, {P (x(t) )}, are different from one another. Thus, the concept-drift problem can be even severe in graph streams since subtree patterns may be changed more rapidly, compared to gradual changes of vectorial data distributions in traditional data stream mining scenarios. In the following, we will show that the low-dimensional feature space, X¯ , induced by the NSH kernel is more robust than the low-dimensional feature space, Xˆ , induced by an explicit frequent-pattern selection, in tolerating concept drift between chunks of graphs. Here, Xˆ denotes a subspace corresponding to the dimensions with most subtree patterns in the complete space X . Theorem 3: Given X¯ induced by the NSH kernel and Xˆ induced by a frequent-pattern selection, X¯ is more robust than Xˆ in tolerating concept chunks of drift between graphs, that is, ρ P (¯ x(t) ), P (¯ x(s) ) ≤ ρ P (ˆ x(t) ), P (ˆ x(s) ) , for t = s, where ρ denotes a function empirically measuring the discrepancy of two distributions based on the observed data. Proof: We adopt the Maximum Mean Discrepancy (MMD) [12] to empirically measure the discrepancy of two distributions based the observed on N (t) (s) (t) data: M M D {xn }, {xn } = N1 n=1 φ(xn ) − 2 (s) N 1 2 n=1 φ(xn ) H , where · H denotes the L -norm N in the reproducing kernel Hilbert space. By applying ¯ , we have MMD two projected data sets 1 Nto the in X N 1 (t) (s) 2 = xΔ RR xΔ , n=1 R xn − N n=1 R xn N H N N (t) (s) 1 where xΔ = N xn − n=1 xn . Similarly, for 1 n=1 N (t) N (s) 2 ˆ X , we have N n=1 S xn − N1 = n=1 S xn H x Δ SS xΔ , where S has the same size as R, but only has one nonzero entry being 1 in each column corresponding to a preselected dimension. Then, we only need to compare x RR x x SS x Eφ 1Δ RR 1Δ and 1Δ SS 1Δ (no expectation taken over S

D. Applied to Streams As the advantage of the NSH kernel we highlighted before, it is able to project graphs with different subtree patterns onto a set of common low-dimensional feature spaces, efﬁciently in both computational and spatial complexities. All these characters have paved the road to directly plug the NSH kernel to the classical ensemble learning framework for stream data mining [6]. Note that we omit the subscript (r) in

404 910

since it is manually selected). the results in the proof Using x 1 D 2 Δ RR xΔ of Theorem 1, we have Eφ 1 RR 1 = D d=1 xΔ,d ; x SS x 1 2 it’s also easy to have 1Δ SS 1Δ = |P| p∈P xΔ,p , where P denotes the selected frequent-pattern set and |P| < D. Since the selected |P| dimensions correspond to the mostfrequent patterns, x2Δ,p for p ∈ P always have larger values D 1 2 than the other dimensions. Thus, we have D d=1 xΔ,d ≤ 1 2 p∈P xΔ,p . |P|

Table I S UMMARY OF THE NCI DATA SETS USED IN THE EXPERIMENTS .

Bioassay-ID (Data sets) NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145

IV. E XPERIMENTS In this section, we test the proposed NSH kernel on both synthetic and real-world graph data sets. Complementary to the theoretical proofs in Sections III-C and III-D, this section aims to empirically validate the effectiveness of the NSH kernel for large-scale graph classiﬁcation over streams, where most of the state-of-the-art graph kernels can hardly be applied because of unlimitedly increasing features (e.g., subtrees [5]). Moreover, we will show that the NSH kernel is also superior in computational efﬁciency. The classiﬁcation performance of different graph kernels are evaluated using LIBSVM [13]. All the kernels are precomputed and then plugged into the LIBSVM executable for training, where the cost parameter C is in its default value. In our experiments, all the reported results are the average performance over ten random data splits. The experiments are conducted on a PC with 2.67GHz CPU.

Compounds (Graphs) 42161 41860 28547 42133 42401 28958 42382 41806 41850

Active (Pos) 2232 1806 1690 2181 2620 2437 2252 3388 2120

Description Lung cancer Melanoma Prostate cancer Nervous sys. tumor Colon cancer Breast cancer Ovarian tumor Leukemia Renal cancer

activity prediction: If a chemical compound in a data set is active against the corresponding cancer, its label is positive. Table I lists the summary of the nine data sets. Since the labels of the original data are unbalanced (about 5% positive samples), a data preprocessing is performed in advance. From each data set, we randomly select a negative data set with the same size as the positive one (e.g., 2232 positive and 2232 negative samples in NCI1). As a result, we obtain nine balanced data sets, which comprise 41,452 samples in total. We combine them sequentially and use the combined data set to simulate a stream of graphs. We split the entire stream into 100 data chunks and each chunk comprises 400 graphs (the rest are discarded). We assume that the graphs are fed into memory chunk by chunk. The stream of graphs have concept drift, which are caused by the changes of bioassay tasks along the stream.

A. Data Description (1) Synthetic: We use a synthetic graph data generator3 to generate 100,000 labeled, undirected and connected graphs. We set the number of unique node labels to 30 and the average number of edges in each graph to 20. To simulate a graph classiﬁcation task over streams with potential concept drift, we perform the following steps: 1) Find 7811 frequent subgraph patterns (support > 0.01) using gSpan4 [14]. 2) Represent each graph as a 7811-D feature vector, with a dimension being 1 if it has the corresponding subgraph and 0 otherwise. 3) Split all the graphs into 100 chunks of equal size. 4) Perform 2-Means clustering on each data chunk and let one cluster be positive and the other negative. The clustering is performed chunk by chunk. We let the initial cluster centers of the current chunk be the resulting centers of the previous chunk. In this way, the binary classiﬁcation boundary will slightly drift over chunks. (2) NCI: The NCI cancer screen data sets are widely used for graph classiﬁcation evaluation [15], [16]. We download nine NCI data sets from the PubChem database5 . Each data set has 28,000 – 42,000 chemical compounds, each of which is represented as a graph, with atoms as nodes and bonds as edges. Each data set belongs to a bioassay task for anticancer

B. Classiﬁcation Performance over Streams In this experiment, we aim to show that, given a proper dimensionality setting of the hashed feature spaces, the NSH kernel can achieve the state-of-the-art performance in graph classiﬁcation over streams, without a pre-scan of the data or prior knowledge in subtree patterns. In Table II, six different dimensionality settings are deﬁned for the hashed low-dimensional feature spaces of the NSH kernels. From NSH (1) to NSH (6), the dimensionality of the hashed feature space increases. The last two rows in Table II list the numbers of the observed subtree patterns, which span the feature spaces of the WL kernels for the Synthetic and NCI data sets, respectively. Note that, at each resolution, the dimensionality of a hashed space is no higher than that of the corresponding feature space spanned by the observed subtree patterns. We compare the performance of the NSH kernels with the two baseline approaches introduced in Section II-D: For the NSH kernels (1)–(6) listed in Table II, we directly plug them into the ensemble learning framework [6] introduced in Section II-C to learn a classiﬁer ensemble. For the ﬁrst baseline approach, we use the WL kernel to train a single classiﬁer on each chunk. For the second baseline approach, we perform a pre-scan on the entire stream of graphs and

3 http://www.cais.ntu.edu.sg/˜jamescheng/graphgen1.0.zip 4 http://www.cs.ucsb.edu/˜xyan/software/gSpan.htm 5 http://pubchem.ncbi.nlm.nih.gov/

405 911

Table II D IMENSIONALITY SETTINGS FOR THE NSH KERNELS AT DIFFERENT RESOLUTIONS (r = 1, . . . , 6), COMPARED TO THE DIMENSIONALITIES OF THE WL KERNELS FOR THE TWO DATA SETS . (1k = 103 )

r=2 50 100 500 500 500 500 49781 583

r=3 50 100 500 1000 5000 5000 817k 5947

r=4 50 100 500 1000 5000 10000 988k 28814

r=5 50 100 500 1000 5000 10000 1004k 52734

CPU Time (seconds)

r=1 30 30 30 30 30 30 30 44

50

r=6 50 100 500 1000 5000 10000 1005k 67935

40 30

NSH (1) NSH (2) NSH (3) NSH (4) NSH (5) NSH (6) WL

20 10 0

r=1

r=2

r=3

r=4

r=5

r=6

r=3

r=4

r=5

r=6

160 140 CPU Time (seconds)

Kernels NSH (1) NSH (2) NSH (3) NSH (4) NSH (5) NSH (6) WL-Syn WL-NCI

60

collect the most frequent subtree patterns at each iteration to construct the feature spaces, where a set of kernels based on counting these frequent patterns can be derived, say the FPS kernels. To make a fair comparison, we let the dimensionality settings for the feature spaces of the FPS kernels be the same as those for the NSH kernels (1)–(6), respectively, and denote the corresponding kernels by FPS kernels (1)–(6). For the single classiﬁer method (WL), the performance is evaluated on the tth chunk using the classiﬁer learned from the (t − 1)th chunk. For the two ensemble learning methods (NSH and FPS), we set K = 10 as the number of classiﬁers in an ensemble, and the performance is evaluated on the tth chunk using the classiﬁer ensemble learned from the K chunks prior to the tth chunk. The comparison results on the two streams of graph chunks are plotted in Figures 2 and 3, respectively. In each ﬁgure, each subplot corresponds to a dimensionality setting for the NSH kernel and the FPS kernel. Since the WL kernel is based on the observed subtree patterns to span the feature spaces over chunks, we duplicate its performance curve in all the subplots for reference. From all the subplots in each ﬁgure, we can observe a clear trend that the performance of both NSH kernels and FPS kernels turns better as the dimensionality increases from setting (1) to setting (6). A more interesting phenomenon observed from Figures 2 and 3 is that, the NSH kernel has the biggest improvement amount, from the bottom position at NSH (1) to a clearly leading position at NSH (6). This result implies, in an extremely low dimensionality, a few most frequent patterns (FPS) can be more useful than all the patterns messed in few dimensions (NSH). However, in moderate dimensionality settings (4) – (6), the NSH kernel begins to exhibit its capability in high-quality estimation. It is worth noting that, although the WL and the FPS kernels outperform the NSH kernels at the two lowest dimensionality settings, the WL kernel has a much higher dimensionality while the FPS kernel needs a pre-scan of the data stream, both of which may be infeasible in real-world applications. Finally, let’s take the marked potential concept-drift positions in Figures 2 and 3 for reference. We observe that the

120 100 80

NSH (1) NSH (2) NSH (3) NSH (4) NSH (5) NSH (6) WL

60 40 20 0

r=1

r=2

Figure 4. CPU time for computing WL and NSH kernel matrices at different resolutions (r = 1, . . . , 6) on the Synthetic (upper panel) and NCI (bottom panel) data sets.

performance curves of the WL kernel and the FPS kernel have several drops coarsely corresponding to the marked concept-drift positions. In the last two dimensionality settings with relative more dimensions, the NSH kernels (5) and (6) have very different curve shapes with those of the other two kernels. This phenomenon is especially true in Figure 3 since the real-world NCI data set may have clearer concept drift than the Synthetic graph stream. This observation can empirically validate the concept-drift tolerance capability of the proposed NSH kernel. C. Computational Efﬁciency Our second experiment is to show that the NSH kernel is more computationally efﬁcient than its unhashed counterpart – the WL kernel [5]. We compare the computation time of the NSH kernels with that of the WL kernel in batch mode. Due to memory limitation, the WL kernel can only be applied to a set of about 5000 graphs. Thus, this experiment is performed on subsets of the Synthetic and NCI data sets obtained in Section IV-A. Each subset comprises 5000 graphs randomly sampled from the two entire data sets. Figure 4 plots the CPU time for computing kernel matrices on the Synthetic and NCI data sets. On both data sets, the WL kernel spends almost two times the CPU time consumed by the NSH kernels, for computing a kernel matrix on the same data set. As mentioned in “Complexity Analysis” in Section III-C, this is because the WL kernel needs to search a global subtree pattern list; while the NSH kernel can directly assign a dimension of the feature space to a subtree pattern via hashing. The superiority of the NSH kernel in computation efﬁciency will be more signiﬁcant in larger data sets.

406 912

85

85 NSH (1) FPS (1) WL

70

65

60

60

20

40

60 Chunks

80

100

75 70 65

20

40

60 Chunks

80

60

100

85

85

80

80

80

75 70 NSH (4) FPS (4) WL

20

Figure 2.

40

60 Chunks

80

75 70 NSH (5) FPS (5) WL

65 60

100

20

40

60 Chunks

80

20

40

60 Chunks

80

75 70 NSH (6) FPS (6) WL

65 60

100

100

20

40

60 Chunks

80

100

Performance comparison on the Synthetic graph stream. The dashed vertical lines indicate potential concept-drift positions.

85

85

NSH (1) FPS (1) WL

70

75 70

65

65

60

60

20

40

60 Chunks

80

100

85

20

40

60 Chunks

80

65

75 70 65

20

Figure 3.

40

60 Chunks

80

100

60

20

40

60 Chunks

80

100

85

NSH (6) FPS (6) WL

80

Accuracy

70

70

60

100

NSH (5) FPS (5) WL

80

Accuracy

75

75

65

85

NSH (4) FPS (4) WL

80

NSH (3) FPS (3) WL

80

Accuracy

75

85

NSH (2) FPS (2) WL

80

Accuracy

80

60

Accuracy

85

60

Accuracy

70

65

65

Accuracy

75

NSH (3) FPS (3) WL

80 Accuracy

Accuracy

75

85 NSH (2) FPS (2) WL

80

Accuracy

Accuracy

Accuracy

80

75 70 65

20

40

60 Chunks

80

100

60

20

40

60 Chunks

80

100

Performance comparison on the NCI graph stream. The dashed vertical lines indicate potential concept-drift positions.

Mining high-speed data streams was ﬁrst considered in [21] and the idea of using an ensemble learning method on data streams was proposed soon after [22]. In our implementation, we simply adopt the widely used accuracy-weighted ensemble learning framework [6] for classifying graphs over a stream. Recently, a streaming graph classiﬁcation problem was studied in [23]. However, the “graph streams” in [23] refers to the data streams on a graph (e.g., trafﬁc ﬂows in a network), which is substantially different from our problem setting for classifying a stream of graph objects.

V. R ELATED W ORK The proposed NSH kernel is an estimator of the Weisfeiler-Lehman graph kernel [4], [5]. A large number of graph kernels have been proposed in the last decade, a majority of which are based on the similar idea of extracting substructures from graphs to compare their co-occurrences. Typical substructures for describing graphs include walks [1], [2], [3] and paths [17], subtrees (node neighbors) [18], [4], [19], and subgraphs (which usually based on a frequent subgraph mining technique, e.g., [14]). These graph kernels can be further traced back to the convolution kernel [20]. Most of the existing graph kernels focus on designing fast yet effective kernels for comparing complex graphs. In this paper, we consider from a different perspective that aims to apply graph kernels to the scenarios of large-scale graph classiﬁcation over streams, with a different focus on addressing the expanding feature space problem, and time and spatial budget problems.

Finally, the proposed NSH algorithm is based on the hashing techniques. The hashing method we adopt at each individual iteration of the NSH algorithm is related to random projections [8], in particular, the database-friendly random projections [9] and the hash kernels [24], [10]; but our projection matrix is generated in a recursive way. In [25], a multi-resolution graph encoding method based on the idea of wavelet transforms was proposed, but it is not applicable

407 913

to streams of graphs. To the best of our knowledge, the proposed NSH kernel is the ﬁrst endeavor to recursively encode graphs into multi-resolution representations using random hash functions.

[8] P. Indyk, “Stable distributions, pseudorandom generators, embeddings and data stream computation,” Journal of the ACM, vol. 53, no. 3, pp. 307–323, 2006. [9] D. Achlioptas, “Database-friendly random projections: Johnson-Lindenstrauss with binary coins,” Journal of Computer and System Sciences, vol. 66, no. 4, pp. 671–687, 2003.

VI. C ONCLUSION This paper introduces a Nested Subtree Hashing (NSH) algorithm and its derived NSH kernel to recursively project multi-resolution subtree patterns of different graph chunks onto a set of common low-dimensional feature spaces. The NSH kernel can enable large-scale graph classiﬁcation over streams, where graph chunks have different subtree pattern sets, based on the classiﬁers trained from successive graph chunks to construct an ensemble framework. The theoretical analysis shows that the NSH kernel has a number of favorable properties, especially the concept-drift tolerance capability, which is particularly beneﬁcial for data stream mining scenarios. Moreover, the NSH kernel is extremely efﬁcient in both computational and spatial complexities. We conduct a case study of anticancer activity prediction on a large-scale chemical compounds data set collected from the ChemPub database and empirically test the proposed algorithm. The experimental results show that the NSH kernel is indeed effective for graph classiﬁcation over streams.

[10] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, “Feature hashing for large scale multitask learning,” in ICML, 2009, pp. 1113–1120. [11] S. N. Bernstein, The theory of Probabilities. Gastehizdat Publishing House, 1946.

Moscow:

[12] A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨olkopf, and A. Smola, “A kernel method for the two-sample problem,” in NIPS, 2007, pp. 513–520. [13] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011, software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm. [14] X. Yan and J. Han, “gSpan: Graph-based substructure pattern mining,” in ICDM, 2002, pp. 721–724. [15] N. Wale, I. A. Watson, and G. Karypis, “Comparison of descriptor spaces for chemical compound retrieval and classiﬁcation,” Knowledge and Information Systems, vol. 14, no. 3, pp. 347–375, 2008.

ACKNOWLEDGMENT This work was supported in part by Australian Research Council (ARC) Discovery Project DP1093762, Australian Research Council (ARC) Future Fellowship FT100100971, and a UTS Early Career Researcher Grant.

[16] X. Kong, W. Fan, and P. S. Yu, “Dual active feature and sample selection for graph classiﬁcation,” in KDD, 2011, pp. 654–662. [17] K. M. Borgwardt and H.-P. Kriegel, “Shortest-path kernels on graphs,” in ICDM, 2005, pp. 74–81.

R EFERENCES [1] H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized kernels between labeled graphs,” in ICML, 2003, pp. 321– 328.

[18] P. Mah´e and J.-P. Vert, “Graph kernels based on tree patterns for molecules,” Machine Learning, vol. 75, no. 1, pp. 3–35, 2009.

[2] T. G¨artner, P. Flach, and S. Wrobel, “On graph kernels: Hardness results and efﬁcient alternatives,” in COLT, 2003, pp. 129–143.

[19] S. Hido and H. Kashima, “A linear-time graph kernel,” in ICDM, 2009, pp. 179–188. [20] D. Haussler, “Convolution kernels on discrete structures,” UC Santa Cruz, Tech. Rep. UCSC-CRL-99-10, 1999.

[3] S. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt, “Graph kernels,” Journal of Machine Learning Research, vol. 11, pp. 1201–1242, 2010.

[21] P. Domingos and G. Hulten, “Mining high-speed data streams,” in KDD, 2000, pp. 71–80.

[4] N. Shervashidze and K. Borgwardt, “Fast subtree kernels on graphs,” in NIPS, 2009, pp. 1660–1668.

[22] W. N. Street and Y. Kim, “A streaming ensemble algorithm (SEA) for large-scale classiﬁcation,” in KDD, 2001, pp. 377– 382.

[5] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeiler-Lehman graph kernels,” Journal of Machine Learning Research, vol. 12, pp. 2539–2561, 2011.

[23] C. C. Aggarwal, “On classiﬁcation of graph streams,” in SDM, 2011, pp. 652–663.

[6] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining conceptdrifting data streams using ensemble classiﬁers,” in KDD, 2003, pp. 226–235.

[24] Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S. Vishwanathan, “Hash kernels for structured data,” Journal of Machine Learning Research, vol. 10, pp. 2615–2637, 2009.

[7] B. J. Weisfeiler and A. A. Leman, “A reduction of a graph to a canonical form and an algebra arising during this reduction,” Nauchno-Technicheskaya Informatsia, vol. 2, no. 9, pp. 12– 16, 1968.

[25] X. Wang, A. Smalter, J. Huan, and G. H. Lushington, “GHash: Towards fast kernel-based similarity search in large graph databases,” in EDBT, 2009, pp. 472–480.

408 914

Nested Subtree Hash Kernels for Large-Scale Graph ...

such as chemical compounds, XML documents, program flows, and social networks. Graph classification thus be- comes an important research issue for better ...

Download PDF

393KB Sizes 1 Downloads 220 Views

Report

Nested Subtree Hash Kernels for Large-Scale Graph ...

Recommend Documents