Community Structure Identification: A Probabilistic ...

Viewer
Transcript

Community Structure Identification: A Probabilistic Approach Nacim Fateh Chikhi, Bernard Rothenburger, Nathalie Aussenac-Gilles Institut de Recherche en Informatique de Toulouse Université Paul Sabatier, 118 route de arbonne, 31062 Toulouse Cedex {chikhi,rothenburger,aussenac}@irit.fr

Abstract A large variety of techniques has been developed for community structure identification (CSI) including modularity optimization, graph partitioning, and hierarchical clustering. In this paper, we argue that generative models are a promising approach for community structure identification, although these models have received very little attention from CSI researchers. Following the work of Cohn and Chang on link analysis, we propose a new probabilistic model for community structure detection. The originality of our model is the use of smoothing in order to overcome the sparsity of network data. A method based on the modularity criterion is also proposed for the estimation of smoothing parameters. Experiments carried out on three real datasets show that our new model SPCE (Smoothed Probabilistic Community Explorer) significantly outperforms PHITS (Probabilistic HITS).

1. Introduction In recent years, network data has become ubiquitous in many fields. For example, in biology, networks are extensively used to model interactions between proteins, metabolites and genes. In social sciences, networks are a convenient way to represent social relationships (such as co-authorship or friendship) between actors (such as researchers). On the World Wide Web, social networking websites have allowed many people to create their own social networks. With this large availability of network data, it becomes essential to have tools that allow one to extract meaningful and useful information from these networks. A popular type of such information which arises in many networks is community structure. Basically, a community is a set of nodes (vertices, actors, etc.) that have more links to members of the same community than to members of other communities. Automatic detection of community structure is a challenging problem for which various solutions have been developed [8].

In this paper, we argue that generative models constitute a promising approach for community structure identification (CSI), although these models have received very little attention from CSI researchers. Indeed, generative models allow the identification of multiple communities and moreover they are able to discover overlapping communities. They can also analyze both directed and undirected networks in contrast to other methods which are adapted to undirected networks only. Generative models use probability theory to model the data generation process. More precisely, they assume an underlying probability distribution which explains the observed data. Starting from the PHITS (Probabilistic HITS) model [5], we propose the SPCE (Smoothed Probabilistic Community Explorer) model which uses smoothing to overcome the sparsity of network data. Since SPCE is very sensitive to smoothing parameters, we also propose a method to estimate these hyperparameters. We compare our model to PHITS on three real datasets. Experimental results show that SPCE significantly outperforms PHITS. The rest of the paper is organized as follows. In Section 2, we briefly review existing methods to community detection. In Section 3, we present our probabilistic model to community structure identification. Experimental results are presented in Section 4 and discussed in Section 5. Section 6 concludes the paper.

2. Related work In the literature, there exists a wide variety of techniques for community structure identification. They include graph partitioning methods [16], random walk approaches [15], modularity optimization [4], etc. Fortunato and Castellano [8] give an exhaustive review of existing methods. In this paper, we focus on a specific family of techniques to CSI, namely generative models. Generative models have been successfully used in many data mining tasks such text clustering, recommender systems and image analysis. While these models have become very

popular for knowledge discovery practitioners, they have not been actively explored by CSI researchers. A notable exception is the PHITS model of Cohn and Chang [5] which has been proposed for the analysis of links between documents (citations or hyperlinks). Conceptually, PHITS is based on the PLSA model [10] for co-occurrence data analysis. The basic principle of PHITS is that links between nodes in a network can be explained by a small set of latent variables. These variables correspond to the notion of community. PHITS has several interesting properties that make him a tempting choice to CSI. For instance, PHITS is able to discover overlapping communities where each node of the network has membership degrees in communities. These membership scores are expressed by a probability distribution over a set of communities. Overlapping is an important characteristic of many real networks such as social networks where actors can belong to several communities. Furthermore, PHITS performs a coclustering of in-links and out-links when the network is directed. This simultaneous clustering gives two different views of the community structure. While this information is irrelevant in the case of undirected networks, it turns out to be very useful in directed networks. Indeed, in some networks such webpage networks or citation networks, hubs and authorities can be identified [11]. Other advantages of PHITS include simplicity and the ability to analyze both directed and undirected networks.

3. Smoothed Probabilistic Explorer (SPCE)

Community

At first glance, PHITS seems to be a perfect solution to the CSI problem. Unfortunately, in practice, we found that PHITS gives unsatisfactory results and fails to identify communities in networks. This bad behavior is due to its parameter learning technique namely Maximum Likelihood Estimation (MLE). In the following, we describe our probabilistic approach to community structure identification. The model SPCE we propose is a generative model. Likewise any other probabilistic model, SPCE is defined by a structure and a parameter learning procedure. The generative process of SPCE is similar to the generative process of PHITS. However, the main difference is in parameter learning.

3.1. Generative process Let’s first define our notation: given a directed network G=(N,L) where N is the set of nodes and L the set of links (or edges), we denote by NN=|N| the number of nodes and NL=|L| the number of links. Moreover, we make a distinction between source nodes (S) and destination

nodes (D). Source nodes are vertices that have at least one out-link while destination nodes are vertices that have at least one in-link. The S and D sets are subsets of N. In the particular case where every node has at least one in-link and one out-link we have S=D=N. SPCE assumes that links from nodes of S to nodes of D are generated by the following procedure: NC For i = 1 to NL (1) select a node si ~ Mult(1, P(S)) (2) pick a community ci ~ Mult(1, P(C|S=si)) (3) generate a link from si to di where node di is selected ~ Mult(1, P(D|C=ci)) where Mult(n,p) denotes a multinomial distribution of parameters n and p. The graphical model corresponding to this generative process is shown in Figure 1. Variable c in the graphical model is shaded because it is an unobserved variable. More information on this representation can be found in [2].

Figure 1 – SPCE graphical model

3.2. Parameter learning The main drawback of PHITS is the use of MLE for parameter learning. MLE is known to give reliable results when the sample size is sufficiently large comparatively to the number of parameters [1]. In the case of network data, this condition is not always satisfied. For example, the Cora network used in our experiments contains 5429 links. If we fix the number of communities (NC) to 7, the total number of parameters will be 1565 ( 7 − 1) + 7 ( 2222 − 1) = 24937 . This number is quite superior to the number of links. Thus, in SPCE, we replace MLE by the Maximum a Posteriori (MAP) estimation. Our idea is to use priors over the parameters that will have a smoothing effect. Assuming that links are i.i.d., the joint probability expressed by SPCE is: L

L

C

L(G ) = ∏ P ( si , d i ) = ∏ ∑ P( si , di , ci = k ) i =1 k =1

i =1

L

C

i =1

k =1

= ∏ P ( si )∑ P (ci = k | si ) P ( di | ci = k ) (1)

Rather than maximizing the above likelihood L(G) to obtain γ=P(C|S) and ϕ=P(D|C), we use Dirichlet priors over these parameters and maximize the posterior distribution:

P(γ , φ | G ) α L(G ) P (γ | α ) P(φ | β )

(2)

where P(γ|α) and P(ϕ|β) are Dirichlet priors over γ and ϕ respectively; α=(α1, … αC) and β=(β1, … β) are vector parameters of the Dirichlet priors. Here, we consider symmetric Dirichlet distributions i.e. α1=α2=…=αC and β1=β2=…= β. We use Dirichlet priors over the parameters because they are conjugate to the multinomial, and consequently they greatly simplify the parameter estimation procedure. Due to the presence of hidden variables, maximization of (2) has no closed form solution. It is, however, still possible to do this maximization by means of an iterative algorithm such as EM (Expectation Maximization) [6]. In its basic formulation, the EM algorithm iterates between two steps: the Expectation (E) step and the Maximization step (M). In the E step, a lower bound to the objective function is computed. This bound is then maximized in the M step. These two steps are repeated until a convergence criterion is met. Below, we give the EM steps for SPCE (we omit further details due to space limitations):

E − step :

3.3. Smoothing parameters estimation

γ 'k s φ 'd k

P(ci = k | si , d i , γ ', φ ') =

i

i

c

∑γ '

φ 'd t

tsi

t =1

i

M − step : L

α − 1 + ∑ As d P (c = k | si , d j , γ ', φ ') γks = i

i

j =1

C

j

L

C (α − 1) + ∑∑ As d P(c = t | si , d j , γ ', φ ') t =1 j =1

i

j

L

φd k = j

β − 1 + ∑ As d P(c = k | si , d j , γ ', φ ') i =1

i

j

L

( β − 1) + ∑ As d P (c = k | si , d i , γ ', φ ') i =1

i i

where A is the adjacency matrix of the network,

γ ks = P ( C = k | S = si ) i

, φd k = P ( D = d j | C = k ) , i

γ ' and φ ' are the current parameter estimates, γ and φ are the new parameter estimates.

Starting from an initialization of γ ’ and φ ’, the EM steps are applied iteratively until the relative change between two successive iterations is below a threshold. We can observe from the M step that α and β play the role of pseudo-counts. These pseudo-counts will help SPCE to deal with the sparsity of network data. Moreover, we notice that when α = β = 1, SPCE reduces exactly to PHITS. This is obvious because in this case, Dirichlet priors are uniform and the posteriors depend on the likelihoods only. To avoid incoherent probability values, we impose that α ≥ 1 and β ≥ 1. A particular case is when α = β = 2, which corresponds to the add-one smoothing used in information retrieval. An interesting property of the model is that it gives communities according to two possible views i.e. “inlinks” and “out-links”. In a directed network, it is important to make a distinction between source nodes and destination nodes. This depends on whether we consider a community as a set of nodes that have more links from members of their community than from members of other communities, or if we view a community as a set of nodes having more links to (i.e. linking) members of the same community than to members of other communities. Let’s notice that the model we have so far presented applies to directed networks. Its extension to undirected networks is straightforward. The variables S and D become the same, and bidirectional links are generated in step (3) of the generative process instead of unidirectional ones.

In the next section, we show that smoothing parameters (α and β) have a strong effect on SPCE. This is why an estimation method for these hyperparameters turns out to be essential. A widely used technique to hyperparameter estimation is cross-validation [18]. This method consists in splitting a dataset into two sets: one set is used for training and the other is used as a validation set to estimate the best values of the hyperparameters. Unfortunately, the crossvallidation method cannot be applied with our model. Indeed, SPCE inherits a drawback of the PLSA (and PHITS) model where the model is not fully generative [3]. More precisely, the PLSA and SPCE models are generative of the training data only and are unable to generate new data. Thus, they are not able to assign probabilities and likelihood to unseen data. Although, Brants [3] proposed a method to compute partial test data likelihood for the PLSA model, his method might be useful only for convergence assessment. It is it not correct to use it to compare likelihoods of different test datasets. Here, we propose a method to hyperparameters estimation for SPCE based on the modularity criterion.

umber of nodes

umber of communities

umber of links

odes with inlinks

odes with outlinks

Directed

Cora

2708

7

5429

1565

2222

yes

Citeseer

2994

5

4277

1760

2099

yes

Wikipedia

5360

7

41978

5360

5360

no

etwork

Table 1 - Datasets Modularity is a quality function that has been recently proposed by Newman [14]. It has been recognized as a good community structure evaluation measure, and has moreover been used in optimization frameworks as an objective function in order to identify community structures [8]. Modularity indicates how much modular is a community structure. It is defined as the divergence from a random network model where each node has the same degree as in the original network. Formally, it is defined as [14]:

Q=

 Aij i , j∈V 

di d j

∑  2m − 4m

2

  δ ( ci , c j ) 

where A is the adjacency matrix, di is the degree of node i, m is the total number of links, and δ is the Kronecker delta function. The above modularity definition applies when the network is undirected. In the directed case, the modularity is defined as [13]:

 Aij d iin d out  j Q=∑ −  δ ( ci , c j ) 2 m  i , j∈V  m in

out

where di is the indegree of node i, and di is the outdegree of node i. Modularity takes values in the interval ]-1, 1[. A negative or zero value indicates the absence of community structure, whereas a value greater or equal to 0.3 indicates the presence of a modular community structure [14]. In SPCE, we consider modularity as our model performance criterion. Therefore, we reduce the hyperparameters estimation task to a search for the values of α and β for which SPCE achieves the best modularity value. Although this method can be implemented using a two-dimensional grid search, we make a further simplification by considering α = β. Our experiments show that even with this simplification, SPCE gives very satisfactory results. Empirically, we found that the best value for α lies in the interval ]1, 2].

4. Experiments In this section, we present the experimental results we have obtained by comparing SPCE to PHITS.

4.1. Datasets Modularity is a widely accepted evaluation measure for community structure identification. However, since it is an internal measure, we believe that it is not fair to compare CSI algorithms using modularity only. Thus, in order to conduct an “objective” evaluation, we use three real datasets where for each dataset the community structure is known. The first datasets we used is Wikipedia [19]. It is a collection of web pages crawled from the free encyclopedia Wikipedia. Each webpage belongs to one of the following classes: Biology, Chemistry, Computer science, Geography, Mathematics, Physics, and Politics. Let’s notice that this network is available in the undirected version i.e. the direction of links has been ignored. The other two datasets we used are the Citeseer and Cora networks. Nodes in these networks correspond to scientific publications, and the links correspond to citations. Each publication in Cora belongs to one of the following categories: Neural networks, genetic algorithms, reinforcement learning, learning theory, rule learning, probabilistic learning methods, and case based reasoning. The Citeseer network contains five communities: Agents, databases, information retrieval, machine learning, and human computer interaction. Since the Cora and Citeseer networks are directed, we additionally conducted experiments using two variants of them, namely the transpose and the undirected versions. Statistics on the three datasets are shown in Table 1.

4.2. Clustering evaluation Here we evaluate the ability of SPCE and PHITS to identify the community structure of a network. The community structure found by each method is compared to the ground truth classification. We use three evaluation measures for comparison: the normalized mutual information (NMI) [17], F-measure, and modularity.

NMI

F-measure

Modularity

Network

KM

PHITS

SPCE

KM

PHITS

SPCE

KM

PHITS

SPCE

Cora (O)

0.21

0.04

0.25

0.42

0.24

0.46

0.15

0.13

0.26

Cora (T)

0.27

0.07

0.32

0.47

0.29

0.51

0.28

0.20

0.39

Cora (U)

0.31

0.09

0.31

0.49

0.32

0.50

0.52

0.42

0.72

Citeseer (O)

0.12

0.01

0.12

0.41

0.26

0.37

0.10

0.13

0.25

Citeseer (T)

0.12

0.03

0.18

0.38

0.29

0.41

0.14

0.16

0.33

Citeseer (U)

0.18

0.02

0.16

0.43

0.28

0.41

0.38

0.34

0.71

Wikipedia (U)

0.54

0.52

0.63

0.67

0.68

0.77

0.60

0.62

0.67

Table 2 – Clustering results (O: Original, T: Transpose, U: Undirected)

Since both PHITS and SPCE return soft clusters, it is necessary to transform their partitions into hard ones because, in the ground truth classification, each node belongs to only one community. This transformation is made by assigning each node to the most probable community i.e. Community ( X ) = argMax ( P ( C | X ) ) C

We report in Table 2 the averaged results over ten runs when applying PHITS and SPCE on the benchmark datasets. We also included results of K-means (KM), which is a baseline algorithm for community structure identification. For each run, the number of communities is fixed to the number of classes in the dataset. As Table 2 shows, SPCE significantly outperforms PHITS on all datasets. This difference is more emphasized on the Cora and Citeseer networks where the improvement is up to 100% in terms of NMI and modularity. Results of PHITS are quite poor on Citeseer and Cora, while they are satisfactory on Wikipedia. According to modularity, SPCE is always better than Kmeans. However, this latter achieves slightly better results than SPCE in terms of NMI and F-measure on the original and undirected versions of Citeseer.

4.3. Smoothing parameters impact To study how smoothing parameters affect SPCE, we plot in Figure 2 the obtained modularity on each dataset with regard to different values of α (with α = β). Let’s recall that SPCE with α = 1 is equivalent to PHITS. From Figure 2, we observe that smoothing always increases modularity comparatively to the case where no smoothing is performed (i.e. α = 1). This increase is more remarkable on Cora and Citeseer than on Wikipedia. On Cora and Citeseer, the best modularity value is achieved for αϵ]1, 2]. For Wikipedia, SPCE seems to be not very sensitive to α in the interval ]1, 6.5]. Modularity decreases, however, when α > 6.5 .

5. Discussion It is quiet surprising to see the poor performances of PHITS. In our early experiments we suspected a problem in our implementation of PHITS. To ensure that it is not the case, we tried an equivalent implementation of PLSA (i.e. of PHITS) called Non-negative Matrix Factorization (NMF) [7],[9],[12]. However, even NMF achieved poor results similarly to PHITS.

0.7 Cora (O) Cora (T) Cora (U)

0.7

Citeseer (O) Citeseer (T) Citeseer (U)

0.7

0.6

0.6

0.5

0.5

Wikipedia (U) 0.65

0.4

Modularity

Modularity

Modularity

0.6

0.4

0.55 0.5

0.3

0.3

0.45

0.2

0.2

0.4

0.1 1

1.5

2 Alpha

(a)

2.5

3

0.1 1

1.5

2 Alpha

(b)

2.5

3

0.35 1

2

3

4

5

6

7

8

9

10

Alpha

(c)

Figure 2 – Modularity of SPCE with regard to smoothing parameters on Cora (a), Citeseer (b), and Wikipedia (c)

Whereas results of K-means are close to results of SPCE in terms of NMI and F-measure, SPCE significantly outperforms K-means according to modularity (c.f. Table 2). This is an interesting finding with two implications. On the one hand, it shows that K-means and SPCE find different community structures. On other hand, it suggests that it is not sufficient to rely only on modularity to compare CSI techniques. Results in Figure 2 might indicate that there is a direct correlation between the necessary amount of smoothing and the number of links in the network (i.e. with the network connectivity degree). Indeed, the Wikipedia network contains a large amount of links that makes smoothing less crucial than in Cora and Citeseer which have very networks.

6. Conclusion In this paper, we proposed SPCE, a generative model to community structure analysis. The main idea of SPCE is the use of smoothing to overcome the sparsity of network data. Furthermore, we showed that smoothing parameters of our model can be estimated using the modularity criterion. Experimental results conducted on three real networks indicate that SPCE significantly outperforms PHITS. We think that SPCE is an interesting solution to the problem of community structure identification for the following reasons: it finds coherent community structures, identifies overlapping communities, it takes as input only the number of communities to identify and not their size, it detects communities in directed and undirected networks, it provides a two-view community structure in directed networks, and it is able to analyze weighted and unweighted networks. As future work, we plan to perform simulations by applying SPCE on generated artificial networks having different properties. These simulations may help us to understand the behavior of our model, and to determine situations in which smoothing is crucial. Furthermore, we plan to test our model on very large networks in order to evaluate its scalability. Last but not least, we believe that it is necessary to compare SPCE not only to PHITS and K-means, as we did in this paper, but also to other approaches for community structure identification such as: normalized cut, random-walk techniques, modularity optimization, hierarchical methods, etc.

7. Acknowledgment This work was supported in part by the INRIA under Grant 200601075.

8. References [1] A. Agresti. An Introduction to Categorical Data Analysis, 2nd Edition, Wiley: New York, 2007. [2] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2007. [3] T. Brants. Test Data Likelihood for PLSA Models. Information Retrieval 8(2): 181–196, 2005. [4] A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large networks. Physical Review E, Vol 70, 066111, 2004. [5] D. Cohn, and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. of the 17th ICML, 2000. [6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. B, 39, 1–38, 1977. [7] C. Ding, T. Li, and W. Peng. On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing. Comput. Stat. Data Anal. 52(8): 3913-3927, 2008. [8] S. Fortunato and C. Castellano. Community Structure in Graphs. Encyclopedia of Complexity and System Science. Springer, 2008. [9] E. Gaussier and C. Goutte. Relation between PLSA and NMF and implications. In Proc. of the 28th annual intl. ACM SIGIR conf., Brazil, ACM, 2005. [10] T. Hofmann. Probabilistic latent semantic analysis. In Proc. of the 15th UAI Conference, 1999. [11] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5): 604–632, 1999. [12] D. Lee, and H. Seung. Algorithms for non-negative matrix factorization. In Proc. of NIPS, pages 556–562, 2000. [13] E. A. Leicht and M. E. J. Newman. Community structure in directed networks. Physical Review Letter, 100:118703, 2008. [14] M. E. J. Newman. Modularity and community structure in networks. PNAS, USA, 103:8577, 2006. [15] P. Pons and M. Latapy. Computing communities in large networks using random walks. Journal of Graph Algorithms and Applications, 10(2) :191–218, 2006. [16] J. Shi and J. Malik. Normalized Cuts and Image Segmentation. In Proc. of CVPR '97, IEEE Computer Society, 1997. [17] A. Strehl, Relationship-based clustering and cluster ensembles for high-dimensional data mining. Ph.D. Thesis, Austin University, USA, 2002. [18] A. Utsugi. Hyperparameter selection for selforganizing maps. Neural Comput. 9(3): 623-635, 1997. [19] http://www.mpi-inf.mpg.de/~angelova/DataSets/

Community Structure Identification: A Probabilistic ...

models have been successfully used in many data mining tasks such text clustering, .... generative of the training data only and are unable to generate new data.

Download PDF

220KB Sizes 0 Downloads 287 Views

Report

Community Structure Identification: A Probabilistic ...

Recommend Documents