Web Data Clustering using FCM and Proximity Hints 1 ...

Viewer
Transcript

Web Data Clustering using FCM and Proximity Hints Deepak Agrawal ∗ Chandan Singh ∗ Abhishek Dayal B. Biswas∗ K.K. Shukla ∗

∗

November 24, 2007

Abstract Clustering is one of the major tasks in web mining. Due to the high dimensionality of the web pages, clustering of the web pages is always a challenging task. In this study, Fuzzy C-means clustering along with proximity hints (P-FCM) is applied to the web data (pages) for clustering. Proximity hints has been provided on the basis of hyperlink structure and textual similarity. The results were compared with FCM clustering. The results are correlated with human clustering. Keywords: Search engines; Fuzzy logic; Fuzzy C-mean algorithm; Similarity; Human-computer interaction

1

Introduction

With the evolution of the web in an astonishing way, Standard web search services play an important role as useful tools for the Internet community even though they suffer from a certain difficulty. Search engines fail to make a clear distinction between items of varying relevance when presenting search results to users. Intelligent web will need new tools and infrastructure components in order to create an environment that serves its users wisely. The use of soft computing tools, including fuzzy logic, in data mining has been adequately reported in [1]. In the web, the objects or data are used by a variety of different users and the clustering done by the system may not suite the requirements of the different users. So, the participation of the users is needed to get the ”correct” results. The participation of the ∗ Department

of Computer Science and Engineering, Institute of Technology, Banaras Hindu University, Varanasi, India. [email protected]

1

users is done through proximity hints. The concept of proximity between two objects (patterns) is one of the fundamental options of high practical relevance. Formally, given two patterns ”a” and ”b”, their proximity, p (a, b) is a mapping to the unit interval such that it satisfies the following two conditions [3]. 1. Symmetry : p (a, b) = p(b, a) 2. Reflexivity : p (a, a) = 1 We define an extension of fuzzy C-means algorithm, namely proximity fuzzy C-means (P-FCM) incorporating a measure of similarity or dissimilarity as proximity hints on the clusters. We present the theoretical framework of this extension and then we observe, through a suite of web based experiments, how significant is the impact of proximity hints during P-FCM functioning.

2

P-FCM

The algorithm consists of two main phases that are realized in interleaved manner. The first phase is data driven and is primarily the standard FCM applied to the patterns. The second concerns an accommodation of the proximity-based hints and involves some gradient oriented learning.

2.1

FCM

The detailed algorithm of FCM proposed in [2]. The aim of FCM is to find cluster centers (centroids) that minimize a dissimilarity function. The formulation of FCM is given as follows [2]. To accommodate the introduction of fuzzy partitioning, the membership matrix (U) is randomly initialized according to Equation 1 Constraint c X uik = 1, ∀k (1) i=1

The dissimilarity function which is used in FCM is given Equation 2 c X 2 min(U,V ) Jm (U, V ) = um ik Dik

(2)

i=1

Where uik is between 0 and 1 Dik is the Euclidian distance between ith

2

centroids (vi ) and k th data point; 2

2 Dik = kXk − Vi kA q √ kX|A = hX, XiA = xT AX n X

um ik Xk /

uik ), ∀i

(4)

  c X 2 Dik m−1  , ∀i, k ) = ( D jk j−1

(5)

Vi = (

i=1

uik

n X

(3)

i=1

Here Degree of fuzzification m ≥ 1. This algorithm determines the following steps 1. Step 1. Randomly initialize the membership matrix (U) that has constraints in Equation 1. 2. Step 2. Calculate centroids (vi) by using Equation 4 3. Step 3. Compute dissimilarity between centroids and data points using equation 3. Stop if its improvement over previous iteration is below a threshold. 4. Step 4. Compute a new U using Equation 5. Go to Step 2.

2.2

Proximity based optimization

The inner loop is based on the concept of proximity, well explained in [3]. . In inner loop, the clusters produced by the outer loop is transformed into its proximity counterpart which is governed by the expression Pˆ [k1 , k2 ] =

c X

min uik1 , uik2

i=1

Owing to the well-known properties of the proximity matrix, we observe that for k1 = k2 we end up with the value of Pˆ [k1 , k2 ] equal to 1. After initializing number of clusters, fuzzification coefficient etc. and taking care of other factors [3], the performance is formulated as V =

N X N X

2 Pˆ [k1 , k2 ] − p[k1 , k2 ] b[k1 , k2 ]kXk1 − Xk2 k

k1 =1 k2 =1

3

The optimization of V with respect to the partition matrix does not lend itself to a closed-form expression and requires some iterative optimization. The gradient-based scheme comes in a well-known format [3] ∂V Ust (iter + 1) = U (iter) − α ∂Ust (iter) 0,1 s = 1; 2; : : : ; c; t = 1; 2; : : : ; N where [ 0, 1] indicates that the results are clipped to the unit interval, stands for a positive learning rate. Successive iterations are denoted by “iter”. The detailed computations of the above derivative are straightforward. Taking the derivative with Ust, s = 1; 2; : : : ; c; t = 1; 2; : : :;N , one has N X N X ∂V ∂ = ∂Us t(iter) ∂Ust k1 =1 k2 =1

c X

!2 (Uik1 ∧ Uik2 ) − p[k1 , k2 ]

i=1

The inner derivative assumes binary values depending on the satisfaction of the conditions

3

Web Data

Textual information, content, tags, Meta tags etc. are common features in web space that are used for classification/clustering. Search engines like http://search.yahoo.com/ and http://www.altravista.com/ are based on the stated theory. The dimension of Web data through which web document can be layout structure of the web page VALTER and hyperlink structure [9]. Hyper link structure is one of the prime focuses of research and contemporary search engine in one form or the other. SALSA [12] and page ranking [13] algorithm are example of such arguments. This is the usually the analysis of some selected Web pages comprises the extraction and examination of some words that describe the page according the content and context relevance. In this approach, the considered feature space is built by using a collection of some keywords. These keywords are fixed and represent domain knowledge. The first step of this process is to parse a set of Web page and extract some knowledge converged by the prefixed keywords and links. Each Web page is translated in a sequence of data that represents the presence of some characteristic in web document: keywords, hyperlinks and images (shortly features of given data set). Formally we build a normalized vector of data, which represents the probability that each selected feature appears in that Web page. Fig. 2 explains the procedure.

4

Figure 1: Extraction of keywords

4

Proximity-based knowledge

One popular way for clustering data objects into subgroups is based on proximity metric between objects, with the goal that objects within a subgroup are very similar, and objects between different subgroups are less similar. In the web documents clustering problem, The textual information can be included to better cluster the web documents. Moreover, compared to printed literature, web documents reference each other more randomly. This is another reason that the text information is incorporated in order to regulate the influence of the document. We have experimented with the approach that (a) utilizes the entire text of a web document, not just the anchor text; (b) measures the textual similarity Sij between two web documents i; j, instead of between the user query and the web document; (c) uses Sij as the strength of the hyperlink between web documents i; j. The key observation here is that if two web documents have very little text similarity, it is unlikely that they belong to the same topic, even though they are connected by a hyperlink. Therefore Sij properly gauges the extent or the importance of an individual hyperlink. We represent each web document as a vector in the vector space model of IR (Information Retrieval) then compute the similarity between them. The higher the similarity, the more likely the two documents deal with the same topic. For each element of the vector we use the standard tf.idf weighting: tf(i; j) * idf(i). tf(i; j) is the Term Frequency of word i in document j, representing the number of occurrences of word i in document j. idf is the Inverse Document Frequency corresponding to word i, defined as

5

Figure 2: Extraction of keywords idf(i) = log ( no. of total docs / no. of docs containing word i ) Some words appear too frequently in many documents. We assume these words are not very useful to identify the documents. Inverse Document Frequency can effectively decrease the influence of these words. Since the term vector lengths of the documents vary, we use cosine normalization in computing similarity. That is, if x and y are vectors of two documents d1 and d2, then the similarity between d1 and d2 is: P i xi yi S(d1 , d2 ) = S(d2 , d1 ) = kxk2 kyk2 Where kxk2 =

pP

i

x2i

The similarities between documents form the similarity matrix S. This similarity matrix can be used as proximity matrix.

6

5

Experiments and Results

The collection of web pages were done from ODP (Open Directory Project http://dmoz.org, informally known as Dmoz, e.g., Mozilla Directory) is the most widely distributed database of Web content classified by a volunteer force of more than 8000 editors, mainly in three categories: 1. Top:Computers:Software:Graphics:Image Manipulation (www.dmoz.org/Computers/Software/Graphics/Image Manipulation) 2. Top: Shopping: Gifts: Personalized: Photo Transfers. (www.dmoz.org/Shopping/Gifts/Personalized/Photo Transfers) 3. Top: News: Media: Journalism: Photo journalism: Music photography (www.dmoz.org/News/Media/Journalism/Photojounalism/Music Photography) Our test has been performed on 30 web pages per category, leading to 90 pages in total. For the reference, we applied a standard FCM algorithm and partitioned these 90 pages into three clusters. The result was shown in fig 4. The result of the experiment on PFCM is shown in fig 5. Further, for evaluation of the results, a different clustering was also done, which was based on human observations. In this clustering, human just look the web page and decide using their requirements that the page will go in which cluster. Result of membership function is based on 50 human’s observation is shown in fig 6. This can be explained in this way. For instance, if the first page is considered, then is obvious that it should go to the first cluster (software) but it goes to the second cluster (photo) in both the cases. Because if we visit this page then we notice that many keywords are related to gift and software and after applying FCM it comes in the second cluster. And further when we apply PFCM to this page then its membership in second cluster is decreased due to its low proximity with some pages, which belong to second cluster. Now we move to the second page and also it does not in first cluster but it is in third cluster. Here the following fig. 7. Explain that it has similarity with 65th page (which is in third cluster) and has dissimilarity with 4th page. For instance, the proximity value for page 1 and 41 is equal to 0.9 and this value underlines that the pages are very analogous. On the other hand, page 1 is very different from page 4 and this is reflected by a very low proximity value (e.g., 0.1) associated with this pair of pages, see Fig.7. This user’s feedback conveyed in terms of proximity values has an impact on the previous clustering results (refer to Fig. 4). Fig. 5 illustrates the results of the P-FCM. It is evident that some pages 7

Figure 3: Results based on Proximity alone improve their membership in the right cluster. For instance, pages 5 and 37 are now in the right cluster with higher value. A similar explanation will be for fourth page. And 5 to 30 page it come in first cluster. Now 30 to 38 pages should come in second cluster but it comes in first cluster because its proximity to first cluster’s pages is very high with respect to other pages. For the 33rd page, if we see fig.8 then we easily come to the point that this page is a software about playing cards and this should come in first cluster also its high proximity with the software’s pages tell its existence in first cluster clearly. Now moving on, 38th page is in second cluster in FCM clustering but due to its higher proximity with software’s pages, its membership in first cluster has become highest in PFCM. So this is a very good example to show, how proximity can change the membership of any web page in different clusters. Same explanation can also be given to pages from 52 to 60. Finally the whole membership graph is distinct in both the cases (FCM, PFCM) at some extent, which shows the importance of proximity in doing clustering of web pages.

6

Conclusion

Taking human clustering (fig 5.) as a basis, the results of PFCM showed the correlation value of 0.80 and FCM show value of 0.75 with respect to human clustering. If we ignore the statistical results, then also the clusters

8

Figure 4: Results based on FCM alone.

Figure 5: Results based on PFCM method.

9

Figure 6: Results based on Human Observations

Figure 7: Showing similarity and dissimilarity

10

Figure 8: 33rd WEB PAGE formed by the last method was more near to real-life situation. So we can conclude, based on this study, that P-FCM shows better performance than FCM or proximity alone. The role of the proximity-based clustering becomes crucial in cases where the original feature space may not fully capture the essence of the clustering problem. The case study of Web pages is an excellent example with this regard. While the proposed feature space addresses the facet of the textual content of the pages (in the form of a collection of keywords), the hypertext nature of the pages (including relevant information about layout, graphical content, density of associated links, etc.) is not included directly but comes in the form of the user’s hints about degrees of proximity occurring between the pages. Further, the results can be more improved if the system can provide better proximity hints.

References [1] S. Mitra, S.K. Pal, P. Mitra, Data mining in soft computing framework: a survey, In IEEE Trans. Neural Networks 13 (2002) 3-14. [2] J.C. Bezdek, Pattern Recognition and Fuzzy Objective Function Algorithms, In Plenum Press, New York, 1981. [3] Witold Pedrycz,Vincenzo Loia, Sabrina Senatore, P-FCM: a proximity-based fuzzy clustering, In Fuzzy Sets and Systems 148 (2004) 21-41. 11

[4] Dragos Arotaritei, Sushmita Mitra, Web Mining: A Survey in the fuzzy framework, In Fuzzy Sets and Systems 148 (2004) 5-19. [5] Xiaofeng He, Hong Zha, Chris H. Q. Ding, Horst D. Simon, Web Document clustering using hyperlink structures, In Computational Statistics Data Analysis 41 (2002) 19-45. [6] Kyung-Joong Kim, Sung-Bae Cho, Personalized mining of web documents using link structures and fuzzy concept networks, In Applied Soft Computing, Volume 7, Issue 1, January 2007, Pages 398-410. [7] Georg Peters, Some refinements of rough k-means clustering, In Pattern Recognition, Volume 39, Issue 8, August 2006, Pages 14811491. [8] J. Han, M. Kamber, Data Mining: Concepts and Techniques, In Morgan Kaufmann, 2001. [9] J. Kleinberg, Authoritative sources in a hyperlinked environment, In IBM Research Report RJ 10076, 1997. [10] L.A. Zadeh, Fuzzy sets as a basis for a theory of possibility, In Fuzzy Sets System. 1 (1) (1978) 3-28. [11] Valter Crescenzi, Paolo Merialdo, Paolo Missier, Clustering Web pages based on their structure, In Data Knowledge Engineering 54 (2005) 279-299. [12] R. Lempel, S. Moran, The stochastic approach for link-structure analysis (SALSA) and the TKC effect, In Computer Networks 33 (2000) 387-401. [13] Devanshu Dhyani, Wee Keong Ng, Sourav S. Bhowmick, A survey of Web metrics, In ACM Computing Surveys (CSUR),(2002) Volume 34 Issue 4. [14] Sergey Brin and Lawrence Page, The Anatomy of a LargeScale Hypertextual Web Search Engine, In retrived from http://google.stanford.edu/ 2004.

12

Web Data Clustering using FCM and Proximity Hints 1 ...

Web page clustering using Query Directed Clustering ...

Web Usage Mining Using Artificial Ant Colony Clustering and Genetic ...

Fast Web Clustering Algorithm using Divide and ...

web usage mining using rough agglomerative clustering

data clustering

Ramsey Partitions & Proximity Data Structures

Survey on Data Clustering - IJRIT

Web Search Clustering and Labeling with Hidden Topics

Web Search Clustering and Labeling with Hidden Topics - CiteSeerX

Lexical and semantic clustering by Web links

Posterior Probabilistic Clustering using NMF

TCSOM: Clustering Transactions Using Self ... - Springer Link

Timetable Scheduling using modified Clustering - IJRIT

Pattern Clustering using Cooperative Game Theory - arXiv

Timetable Scheduling using modified Clustering - IJRIT

Agglomerative Hierarchical Speaker Clustering using ...