A Comparison of Dimensionality Reduction Techniques for Web Structure Mining Nacim Fateh Chikhi, Bernard Rothenburger, Nathalie Aussenac-Gilles Institut de Recherche en Informatique de Toulouse Université Paul Sabatier, 118 route de arbonne, 31062 Toulouse Cedex {chikhi,rothenburger,aussenac}@irit.fr
Abstract In many domains, dimensionality reduction techniques have been shown to be very effective for elucidating the underlying semantics of data. Thus, in this paper we investigate the use of various dimensionality reduction techniques (DRTs) to extract the implicit structures hidden in the web hyperlink connectivity. We apply and compare four DRTs, namely, Principal Component Analysis (PCA), on-negative Matrix Factorization (MF), Independent Component Analysis (ICA) and Random Projection (RP). Experiments conducted on three datasets allow us to assert the following: MF outperforms PCA and ICA in terms of stability and interpretability of the discovered structures; the wellknown WebKb dataset used in a large number of works about the analysis of the hyperlink connectivity seems to be not adapted for this task and we suggest rather to use the recent Wikipedia dataset which is better suited.
1. Introduction Web mining technology provides techniques to extract knowledge from web data. Researchers on web mining have already distinguished three main areas, namely web content mining, web usage mining and web structure mining [1]. In this paper we focus on web structure mining (WSM). WSM deals with the discovery of structures from the web topology. It offers valuable information for various applications. HITS [2] is one of the most influential algorithms for WSM. In this paper we show the equivalence between HITS and Principal Component Analysis, a well-known technique for dimensionality reduction. In many fields, it has been established that dimensionality reduction techniques (DRTs) are not only useful for lowering the size of data but are also very effective for revealing the hidden structures underlying the data. However, according to the state of the art on web structure mining, there is no work investigating the effectiveness of different DRTs in the WSM context. Therefore, in this paper we give experimental results on using four DRTs, namely, Principal Component Analysis (PCA), Independent
Component Analysis (ICA), Nonnegative Matrix Factorization (NMF) and Random Projection (RP) for web structure mining. Experiments were conducted on three datasets. The remainder of this paper is organized as follows. In section 2 we review the HITS algorithm and show its relationship to dimensionality reduction. In section 3 we present the four dimensionality reduction techniques used in our experiments. In section 4, we describe the experimental methodology including datasets. In section 5 we report the obtained results that we discuss in section 6 before drawing a conclusion and some perspectives in section 7.
2. Related Work Through an original algorithm for hyperlink analysis called HITS (Hypertext Induced Topic Search), Kleinberg introduced the concepts of hubs (pages that refer to many pages) and authorities (pages that are referred by many pages). Unlike the PageRank [3] algorithm, HITS assigns two scores to every page in the web graph formed by web pages and their hyperlinks: an authority score (x) and a hub score (y). Scores are computed by applying the two following rules until convergence:
x (u ) ←
∑
y (v )
v∈In ( u )
y (u ) ←
∑
x (v )
v∈Out ( u )
Mathematically, the HITS algorithm can be rewritten in term of matrices as [2]:
x ← A T Ax
y ← AA T y
where A is the adjacency matrix constructed from the hyperlink graph. By running multiple iterations, HITS is equivalent to applying the power iteration method on matrices ATA and AAT. The power iteration method is a well-known technique for computing the dominant eigenvector of a matrix. Clearly, HITS is nothing else but a non-centered principal component analysis on matrices A and AT for computing respectively the authority and hub vectors [4].
3. Web Structure Analysis by Dimensionality Reduction Techniques
the basis vectors in W.
In many data mining applications such as text mining, interesting results have been established on the use of DRTs. It has been proved that DRTs are not only useful for lowering the size of the data, but also that they are able to extract the underlying semantics of the data [5]. For instance, the LSI (Latent Semantic Indexing) method which is based on a technique similar to PCA has been shown to be very effective for grouping words and documents semantically according to a small number of latent components. By analogy to these findings, we think that the same techniques would be appropriate for analyzing the web structure. Thus, we propose to generalize the HITS’ formulation to the more general case where any DRT is employed including PCA. Nevertheless, to our best knowledge, the ICA, NMF and RP methods have never been applied for the WSM task.
Random Projection (RP) is a theoretically wellfounded method for dimensionality reduction. Its core idea is that by multiplying a data matrix A of size m × n with a carefully chosen projection matrix R of size d × m where d ≪ min( m, n) , the resulting matrix D = RA preserves approximately the original distances between the data points [9].
3.1. Principal Component Analysis Principal component analysis is certainly the most widely used method for multivariate statistical analysis. The PCA is computed by determining the eigenvectors and eigenvalues of the covariance matrix [6]. The eigenvectors of the covariance matrix represent the axes of maximum variance.
3.2. Independent Component Analysis The purpose of ICA is to linearly transform the original data into components which are as much as statistically independent [7]. Here is the main difference between ICA and its cousin PCA. In fact, while PCA tries to find uncorrelated components (i.e. orthogonal) ICA seeks to find independent components. Formally, assuming a data matrix A of size m × n , ICA estimates two matrices S and C such that: A = SC , where S ∈ ℝ mixing matrix and components matrix.
C∈ℝ
k ×n
m× k
is called the
In their seminal work, Lee and Seung [8] have proposed a useful technique for the approximation of high dimensional data, where data are comprised of positive components. Given a nonnegative matrix A of size m × n , NMF finds two positive matrices W and H such that: T
m×k
4. Experimental Setup In our study, we have conducted two types of experiments. On the one hand, we used two collections of pre-classified datasets in order to assess the clustering quality of different DRTs. On the other hand, we performed an intuitive evaluation for the interpretability of the communities (i.e. groups of authoritative pages) identified by each DRT. For this latter test, we constructed a dataset by querying a search engine using a keyword. Our first dataset is a subset of the antique WebKb [10] collection. It is composed of 4200 pages grouped into six classes. The second dataset we dealt with consists of a recent collection of web pages crawled from the online encyclopedia Wikipedia [11]. It contains 5360 web pages covering seven topics. The last dataset we used was constructed in a fashion similar to the HITS approach. Precisely, we queried the Yahoo search engine using the keyword “Armstrong” to compute the root set which will be composed of the first 200 returned results. Then, we computed the vicinity graph by including pages linking or linked from pages in the root set. In final, we obtained a dataset composed of 3270 pages interconnected by about 5000 links. For the WebKb and Wikipedia datasets, once a DRT is applied, the spherical k-means algorithm is run on the authority matrix to construct the final clusters.
5. Results
is the independent
3.3. 'onnegative Matrix Factorization
A ≈ WH where W ∈ ℝ
3.4. Random Projection
n×k
, H ∈ ℝ and k ≪ min( m, n) . Basically, W contains the basis vectors used for the approximation of the original matrix A, whereas H contains the coefficients used to add up combinations of
In Figure 1a we provide the computed accuracy with the WebKb dataset. We observe that the four methods give the same results. The accuracy does not change significantly, it ranges between 0.32 and 0.36. We notice also that the four DRTs are not affected by the number of reduced dimensions. The same observations are done from Figure 1b except that the NMI (Normalized Mutual Information [12]) value is very low and close to 0. From Figures 1c and 1d we first notice the poor results of RP comparatively to the three other methods. However we observe that the performances of RP tend to increase according to the number of projections. We observe also from the figures that for a small number of factors (<10), PCA, ICA and NMF have approximately the same perfor-
Figure 1 – Experimental results with WebKb and Wikipedia datasets mances. The three methods reach the best performances when using a number of factors equal to 7. Naturally, this number corresponds to the number of categories in the Wikipedia collection. ICA has a slight advance comparatively to PCA but the difference is not significant. It is also noticeable that once the number of natural categories in the dataset (i.e. 7) is exceeded, PCA and ICA performances’ deteriorates significantly, whereas NMF is very stable and is not affected by the number of factors. We reduced the Armstrong dataset down to 10 dimensions using the four dimensionality reduction techniques PCA, ICA, NMF and RP. We report in Table 1 the results of NMF and PCA. Since results of RP and ICA were poor, we did not report them.
6. Discussion Experimental results obtained with the WebKb dataset are somewhat surprising. Indeed, the four DRTs have the same accuracy (Figure 1a) and it is not possible to get any interesting conclusion. However, in the NMI Graphs (Figure 1b), normalized mutual information values are very low and close to 0 for the four methods. As indicated by Strehl [12], for a random clustering, the NMI would be close to 0. This means that the obtained structures from WebKb have no interesting information. From these observations, we may deduce that the WebKb is not adapted for comparing the four dimensionality reduction techniques and may be more generally not suited for the evaluation of web structure mining approaches. We explain the poor quality of the WebKb dataset by two arguments. First, most of the links contained in the WebKb dataset are mainly for navigational purposes. This kind of links has been shown in many studies to be useless for the analysis of hyperlink connectivity [2][13]. The second reason is that the dataset’s categorization does not reflect the real distribution of the web pages. For instance, there is no reason for a student to put on his personal page a link to other students’ personal pages. The same is for the category project where links are generally set to pages of persons involved in the project and not to pages of other projects. Notice that we have performed tests on the
WebKb dataset because up to now, it’s the most used dataset by the web structure mining research community. Comparatively to PCA, ICA and NMF, performances of random projection are very poor (Figures 1c and 1d). This indicates its non adequacy for capturing semantic structures from the web topology. Experiments on the Wikipedia dataset show a slight advance of ICA over PCA, but it is not significant. In terms of stability, it is very clear that NMF is the best. Indeed, once the number of natural categories in the dataset (i.e. 7) is exceeded, PCA and ICA performances’ deteriorates significantly, while NMF is very stable and is not affected by the number of factors. Interpretability test has revealed the bad quality of the communities (i.e. sets of pages sharing the same topic) found by RP and ICA. Therefore, ICA’s results imply that there is no need to make the semantic structures independent. This may be due to the nature of data in the Armstrong dataset which follows a Gaussian distribution. ICA is known to be effective only if the data is non Gaussian, whereas PCA is suited when data is Gaussian. Interpretability of NMF results (Table 1) is clearly better than that of PCA. This is due to the fact that NMF considers web pages as an additive combination of the semantic components whereas in PCA web pages are defined using positive and negative coefficients.
7. Conclusion and Future Work In this study, we have compared fours dimensionality reduction techniques for the task of web structure mining. Results show Nonnegative matrix factorization to be a promising approach for web structure analysis because of its superiority over the other methods. Thus, we plan to focus on this technique by studying more precisely two advanced aspects. The first concerns the initialization step of the NMF algorithm where matrices W and H are currently filled with random positive entries. We believe that a good seed for the NMF algorithm would enhance the web structure discovery process. The second aspect concerns the computation procedure of matrices W and H. In this paper, we have employed the multiplicative update approach, but in the literature there exist many other
(a)
(b)
Factor 2 (Louis Armstrong) 0.056345 http://www.satchmo.net 0.049517 http://www.redhotjazz.com/louie.html 0.030964 http://www.npg.si.edu/exh/armstrong 0.023606 http://www.time.com/time/time100/artists/profile/ 0.021322 http://www.pbs.org/jazz/biography/ 0.010829 http://www.satchography.com 0.0072496 http://www.pbs.org/wnet/americanmasters/database/ 0.0053421 http://www.cosmopolis.ch/cosmo19/armstrong.htm 0.0047692 http://en.wikipedia.org/wiki/Louis_Armstrong 0.0025327 http://www.imdb.com/name/nm0001918
Factor 4 – egative-end (Louis Armstrong) -0.25732 http://www.redhotjazz.com/louie.html -0.16213 http://www.npg.si.edu/exh/armstrong -0.16159 http://www.time.com/time/time100/artists/profile/ -0.13561 http://www.satchmo.net -0.12362 http://www.pbs.org/jazz/biography/ -0.071576 http://www.mediawiki.org -0.069185 http://wikimediafoundation.org -0.064839 http://www.satchography.com -0.060775 http://wikimediafoundation.org/wiki/Fundraising -0.044552 http://es.wikipedia.org/wiki/'eil_Armstrong
Factor 4 (Lance Armstrong) 0.056664 http://www.lancearmstrong.com 0.030787 http://www.thepaceline.com 0.030623 http://www.livestrong.org 0.021534 http://team.discovery.com 0.01398 http://www.laf.org 0.013838 http://www.livestrong.org/site/c.jvKZLbMRIsG/b.594849/ 0.0085421 http://www.imdb.com/name/nm0035790 0.0058614 http://www.britannica.com/eb/article-9343162/ 0.005794 http://www.store-laf.org 0.0051103 http://www.kidzworld.com/article/
Factor 5 – Positive-end (Lance Armstrong) 0.24646 http://www.livestrong.org 0.2271 http://www.thepaceline.com 0.22081 http://www.lancearmstrong.com 0.20462 http://www.mediawiki.org 0.18791 http://wikimediafoundation.org 0.18233 http://team.discovery.com 0.1011 http://en.wikipedia.org/wiki/Louis_Armstrong 0.098928 http://wikimediafoundation.org/wiki/Fundraising 0.085588 http://www.livestrong.org/site/c.jvKZLbMRIsG/b.594849/ 0.072773 http://www.satchmo.net
Factor 8 (eil Armstrong) 0.016848 http://www.jsc.nasa.gov/Bios/htmlbios/armstrong-na.html 0.013813 http://www.hq.nasa.gov/office/pao/History/alsj/a11/ 0.013438 http://www.timesonline.co.uk/article/0,,3-2384628,00.html 0.013438 http://www.snopes.com/quotes/mrgorsky.htm 0.013392 http://www.cincinnati.com/visitorsguide/stories/ 0.013253 http://www.maniacworld.com/Apollo_11_2.htm 0.013209 http://www.cosmosmagazine.com/node/717 0.013209 http://www.newyorker.com/critics/books/?051003crbo_books 0.013209 http://www.nasa.gov/vision/space/features/ 0.013209 http://www.maniacworld.com/Apollo_11_3.htm
Factor 5 – egative end (eil Armstrong) -0.47625 http://www.jsc.nasa.gov/Bios/htmlbios/armstrong-na.html -0.43356 http://starchild.gsfc.nasa.gov/docs/StarChild/ -0.18776 http://en.wikipedia.org/wiki/Neil_Armstrong -0.13732 http://www.nasa.gov/centers/glenn/about/bios/neilabio.html -0.10533 http://www.armstrongair.com -0.10291 http://www.armstrong.com -0.089615 http://www.armstrongpumps.com -0.083854 http://www.armstrong.org -0.080983 http://history1900s.about.com/od/armstrongneil/ -0.076381 http://www.hq.nasa.gov/office/pao/History/alsj/a11/
Table 1 – Three communities discovered by (a) 'MF and (b) PCA from the Armstrong collection; lines in bold denote noise pages i.e. irrelevant pages to the extracted community. methods such as the divergence algorithm, the alternating least squares method and the probabilistic approach. We think that it may be useful to explore the effectiveness of these variants for web structure mining. Our experiments have also shown that the Wikipedia dataset is more adapted than the WebKb dataset for WSM assessment.
Acknowledgment This work was supported in part by the INRIA under Grant 200601075. We also thank Sylvie Viguier-Pla and Sébastien Déjean from the CNRS-LSP laboratory for their valuable pieces of advice.
References [1] B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer, 2006. [2] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. [3] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical
report, Stanford Digital Libraries SIDL-WP-1999-0120, 1999. [4] M. Saerens, F. Fouss. HITS is Principal Components Analysis. In Proc. of the 2005 IEEE/WIC/ACM Intl. Conf. on Web Intelligence, pp. 782-785, 2005. [5] X. Luo, A.N. Zincir-Heywood. Evaluation of Three Dimensionality Reduction Techniques for Document Classification. Proceedings of the IEEE CCECE’04, pp. 181184, Canada, 2004. [6] I. Jolliffe. Principal component analysis. Springer, 2002. [7] A. Hyvärinen, J. Karhunen, E. Oja. Independent Component Analysis, Wiley, New-York, 2001. [8] D. Lee, H. Seung. Learning of the parts of objects by nonnegative matrix factorization. Nature 401, pp. 788-791, 1999. [9] S. Vempala. The Random Projection Method, American Mathematical Society, 2004. [10] http://www.cs.cmu.edu/~webkb/ [11] http://www.mpi-inf.mpg.de/~angelova/DataSets/ [12] A. Strehl, Relationship-based clustering and cluster ensembles for high-dimensional data mining. Ph.D. dissertation, Austin University, 2002. [13] M. Fisher, R. Everson. When Are Links Useful? Experiments in Text Classification. Proceedings of the 25th European Conference on IR Research (ECIR’2003), LNCS 2633. Pisa, Italy, pp. 41—56, 2003.