A Comparison of Dimensionality Reduction Techniques for Web Structure Mining Nacim Fateh Chikhi, Bernard Rothenburger, Nathalie Aussenac-Gilles Institut de Recherche en Informatique de Toulouse Université Paul Sabatier, 118 route de arbonne, 31062 Toulouse Cedex {chikhi,rothenburger,aussenac}@irit.fr

Abstract In many domains, dimensionality reduction techniques have been shown to be very effective for elucidating the underlying semantics of data. Thus, in this paper we investigate the use of various dimensionality reduction techniques (DRTs) to extract the implicit structures hidden in the web hyperlink connectivity. We apply and compare four DRTs, namely, Principal Component Analysis (PCA), on-negative Matrix Factorization (MF), Independent Component Analysis (ICA) and Random Projection (RP). Experiments conducted on three datasets allow us to assert the following: MF outperforms PCA and ICA in terms of stability and interpretability of the discovered structures; the wellknown WebKb dataset used in a large number of works about the analysis of the hyperlink connectivity seems to be not adapted for this task and we suggest rather to use the recent Wikipedia dataset which is better suited.

1. Introduction Web mining technology provides techniques to extract knowledge from web data. Researchers on web mining have already distinguished three main areas, namely web content mining, web usage mining and web structure mining [1]. In this paper we focus on web structure mining (WSM). WSM deals with the discovery of structures from the web topology. It offers valuable information for various applications. HITS [2] is one of the most influential algorithms for WSM. In this paper we show the equivalence between HITS and Principal Component Analysis, a well-known technique for dimensionality reduction. In many fields, it has been established that dimensionality reduction techniques (DRTs) are not only useful for lowering the size of data but are also very effective for revealing the hidden structures underlying the data. However, according to the state of the art on web structure mining, there is no work investigating the effectiveness of different DRTs in the WSM context. Therefore, in this paper we give experimental results on using four DRTs, namely, Principal Component Analysis (PCA), Independent

Component Analysis (ICA), Nonnegative Matrix Factorization (NMF) and Random Projection (RP) for web structure mining. Experiments were conducted on three datasets. The remainder of this paper is organized as follows. In section 2 we review the HITS algorithm and show its relationship to dimensionality reduction. In section 3 we present the four dimensionality reduction techniques used in our experiments. In section 4, we describe the experimental methodology including datasets. In section 5 we report the obtained results that we discuss in section 6 before drawing a conclusion and some perspectives in section 7.

2. Related Work Through an original algorithm for hyperlink analysis called HITS (Hypertext Induced Topic Search), Kleinberg introduced the concepts of hubs (pages that refer to many pages) and authorities (pages that are referred by many pages). Unlike the PageRank [3] algorithm, HITS assigns two scores to every page in the web graph formed by web pages and their hyperlinks: an authority score (x) and a hub score (y). Scores are computed by applying the two following rules until convergence:

x (u ) ←



y (v )

v∈In ( u )

y (u ) ←



x (v )

v∈Out ( u )

Mathematically, the HITS algorithm can be rewritten in term of matrices as [2]:

x ← A T Ax

y ← AA T y

where A is the adjacency matrix constructed from the hyperlink graph. By running multiple iterations, HITS is equivalent to applying the power iteration method on matrices ATA and AAT. The power iteration method is a well-known technique for computing the dominant eigenvector of a matrix. Clearly, HITS is nothing else but a non-centered principal component analysis on matrices A and AT for computing respectively the authority and hub vectors [4].

3. Web Structure Analysis by Dimensionality Reduction Techniques

the basis vectors in W.

In many data mining applications such as text mining, interesting results have been established on the use of DRTs. It has been proved that DRTs are not only useful for lowering the size of the data, but also that they are able to extract the underlying semantics of the data [5]. For instance, the LSI (Latent Semantic Indexing) method which is based on a technique similar to PCA has been shown to be very effective for grouping words and documents semantically according to a small number of latent components. By analogy to these findings, we think that the same techniques would be appropriate for analyzing the web structure. Thus, we propose to generalize the HITS’ formulation to the more general case where any DRT is employed including PCA. Nevertheless, to our best knowledge, the ICA, NMF and RP methods have never been applied for the WSM task.

Random Projection (RP) is a theoretically wellfounded method for dimensionality reduction. Its core idea is that by multiplying a data matrix A of size m × n with a carefully chosen projection matrix R of size d × m where d ≪ min( m, n) , the resulting matrix D = RA preserves approximately the original distances between the data points [9].

3.1. Principal Component Analysis Principal component analysis is certainly the most widely used method for multivariate statistical analysis. The PCA is computed by determining the eigenvectors and eigenvalues of the covariance matrix [6]. The eigenvectors of the covariance matrix represent the axes of maximum variance.

3.2. Independent Component Analysis The purpose of ICA is to linearly transform the original data into components which are as much as statistically independent [7]. Here is the main difference between ICA and its cousin PCA. In fact, while PCA tries to find uncorrelated components (i.e. orthogonal) ICA seeks to find independent components. Formally, assuming a data matrix A of size m × n , ICA estimates two matrices S and C such that: A = SC , where S ∈ ℝ mixing matrix and components matrix.

C∈ℝ

k ×n

m× k

is called the

In their seminal work, Lee and Seung [8] have proposed a useful technique for the approximation of high dimensional data, where data are comprised of positive components. Given a nonnegative matrix A of size m × n , NMF finds two positive matrices W and H such that: T

m×k

4. Experimental Setup In our study, we have conducted two types of experiments. On the one hand, we used two collections of pre-classified datasets in order to assess the clustering quality of different DRTs. On the other hand, we performed an intuitive evaluation for the interpretability of the communities (i.e. groups of authoritative pages) identified by each DRT. For this latter test, we constructed a dataset by querying a search engine using a keyword. Our first dataset is a subset of the antique WebKb [10] collection. It is composed of 4200 pages grouped into six classes. The second dataset we dealt with consists of a recent collection of web pages crawled from the online encyclopedia Wikipedia [11]. It contains 5360 web pages covering seven topics. The last dataset we used was constructed in a fashion similar to the HITS approach. Precisely, we queried the Yahoo search engine using the keyword “Armstrong” to compute the root set which will be composed of the first 200 returned results. Then, we computed the vicinity graph by including pages linking or linked from pages in the root set. In final, we obtained a dataset composed of 3270 pages interconnected by about 5000 links. For the WebKb and Wikipedia datasets, once a DRT is applied, the spherical k-means algorithm is run on the authority matrix to construct the final clusters.

5. Results

is the independent

3.3. 'onnegative Matrix Factorization

A ≈ WH where W ∈ ℝ

3.4. Random Projection

n×k

, H ∈ ℝ and k ≪ min( m, n) . Basically, W contains the basis vectors used for the approximation of the original matrix A, whereas H contains the coefficients used to add up combinations of

In Figure 1a we provide the computed accuracy with the WebKb dataset. We observe that the four methods give the same results. The accuracy does not change significantly, it ranges between 0.32 and 0.36. We notice also that the four DRTs are not affected by the number of reduced dimensions. The same observations are done from Figure 1b except that the NMI (Normalized Mutual Information [12]) value is very low and close to 0. From Figures 1c and 1d we first notice the poor results of RP comparatively to the three other methods. However we observe that the performances of RP tend to increase according to the number of projections. We observe also from the figures that for a small number of factors (<10), PCA, ICA and NMF have approximately the same perfor-

Figure 1 – Experimental results with WebKb and Wikipedia datasets mances. The three methods reach the best performances when using a number of factors equal to 7. Naturally, this number corresponds to the number of categories in the Wikipedia collection. ICA has a slight advance comparatively to PCA but the difference is not significant. It is also noticeable that once the number of natural categories in the dataset (i.e. 7) is exceeded, PCA and ICA performances’ deteriorates significantly, whereas NMF is very stable and is not affected by the number of factors. We reduced the Armstrong dataset down to 10 dimensions using the four dimensionality reduction techniques PCA, ICA, NMF and RP. We report in Table 1 the results of NMF and PCA. Since results of RP and ICA were poor, we did not report them.

6. Discussion Experimental results obtained with the WebKb dataset are somewhat surprising. Indeed, the four DRTs have the same accuracy (Figure 1a) and it is not possible to get any interesting conclusion. However, in the NMI Graphs (Figure 1b), normalized mutual information values are very low and close to 0 for the four methods. As indicated by Strehl [12], for a random clustering, the NMI would be close to 0. This means that the obtained structures from WebKb have no interesting information. From these observations, we may deduce that the WebKb is not adapted for comparing the four dimensionality reduction techniques and may be more generally not suited for the evaluation of web structure mining approaches. We explain the poor quality of the WebKb dataset by two arguments. First, most of the links contained in the WebKb dataset are mainly for navigational purposes. This kind of links has been shown in many studies to be useless for the analysis of hyperlink connectivity [2][13]. The second reason is that the dataset’s categorization does not reflect the real distribution of the web pages. For instance, there is no reason for a student to put on his personal page a link to other students’ personal pages. The same is for the category project where links are generally set to pages of persons involved in the project and not to pages of other projects. Notice that we have performed tests on the

WebKb dataset because up to now, it’s the most used dataset by the web structure mining research community. Comparatively to PCA, ICA and NMF, performances of random projection are very poor (Figures 1c and 1d). This indicates its non adequacy for capturing semantic structures from the web topology. Experiments on the Wikipedia dataset show a slight advance of ICA over PCA, but it is not significant. In terms of stability, it is very clear that NMF is the best. Indeed, once the number of natural categories in the dataset (i.e. 7) is exceeded, PCA and ICA performances’ deteriorates significantly, while NMF is very stable and is not affected by the number of factors. Interpretability test has revealed the bad quality of the communities (i.e. sets of pages sharing the same topic) found by RP and ICA. Therefore, ICA’s results imply that there is no need to make the semantic structures independent. This may be due to the nature of data in the Armstrong dataset which follows a Gaussian distribution. ICA is known to be effective only if the data is non Gaussian, whereas PCA is suited when data is Gaussian. Interpretability of NMF results (Table 1) is clearly better than that of PCA. This is due to the fact that NMF considers web pages as an additive combination of the semantic components whereas in PCA web pages are defined using positive and negative coefficients.

7. Conclusion and Future Work In this study, we have compared fours dimensionality reduction techniques for the task of web structure mining. Results show Nonnegative matrix factorization to be a promising approach for web structure analysis because of its superiority over the other methods. Thus, we plan to focus on this technique by studying more precisely two advanced aspects. The first concerns the initialization step of the NMF algorithm where matrices W and H are currently filled with random positive entries. We believe that a good seed for the NMF algorithm would enhance the web structure discovery process. The second aspect concerns the computation procedure of matrices W and H. In this paper, we have employed the multiplicative update approach, but in the literature there exist many other

(a)

(b)

Factor 2 (Louis Armstrong) 0.056345 http://www.satchmo.net 0.049517 http://www.redhotjazz.com/louie.html 0.030964 http://www.npg.si.edu/exh/armstrong 0.023606 http://www.time.com/time/time100/artists/profile/ 0.021322 http://www.pbs.org/jazz/biography/ 0.010829 http://www.satchography.com 0.0072496 http://www.pbs.org/wnet/americanmasters/database/ 0.0053421 http://www.cosmopolis.ch/cosmo19/armstrong.htm 0.0047692 http://en.wikipedia.org/wiki/Louis_Armstrong 0.0025327 http://www.imdb.com/name/nm0001918

Factor 4 – egative-end (Louis Armstrong) -0.25732 http://www.redhotjazz.com/louie.html -0.16213 http://www.npg.si.edu/exh/armstrong -0.16159 http://www.time.com/time/time100/artists/profile/ -0.13561 http://www.satchmo.net -0.12362 http://www.pbs.org/jazz/biography/ -0.071576 http://www.mediawiki.org -0.069185 http://wikimediafoundation.org -0.064839 http://www.satchography.com -0.060775 http://wikimediafoundation.org/wiki/Fundraising -0.044552 http://es.wikipedia.org/wiki/'eil_Armstrong

Factor 4 (Lance Armstrong) 0.056664 http://www.lancearmstrong.com 0.030787 http://www.thepaceline.com 0.030623 http://www.livestrong.org 0.021534 http://team.discovery.com 0.01398 http://www.laf.org 0.013838 http://www.livestrong.org/site/c.jvKZLbMRIsG/b.594849/ 0.0085421 http://www.imdb.com/name/nm0035790 0.0058614 http://www.britannica.com/eb/article-9343162/ 0.005794 http://www.store-laf.org 0.0051103 http://www.kidzworld.com/article/

Factor 5 – Positive-end (Lance Armstrong) 0.24646 http://www.livestrong.org 0.2271 http://www.thepaceline.com 0.22081 http://www.lancearmstrong.com 0.20462 http://www.mediawiki.org 0.18791 http://wikimediafoundation.org 0.18233 http://team.discovery.com 0.1011 http://en.wikipedia.org/wiki/Louis_Armstrong 0.098928 http://wikimediafoundation.org/wiki/Fundraising 0.085588 http://www.livestrong.org/site/c.jvKZLbMRIsG/b.594849/ 0.072773 http://www.satchmo.net

Factor 8 (eil Armstrong) 0.016848 http://www.jsc.nasa.gov/Bios/htmlbios/armstrong-na.html 0.013813 http://www.hq.nasa.gov/office/pao/History/alsj/a11/ 0.013438 http://www.timesonline.co.uk/article/0,,3-2384628,00.html 0.013438 http://www.snopes.com/quotes/mrgorsky.htm 0.013392 http://www.cincinnati.com/visitorsguide/stories/ 0.013253 http://www.maniacworld.com/Apollo_11_2.htm 0.013209 http://www.cosmosmagazine.com/node/717 0.013209 http://www.newyorker.com/critics/books/?051003crbo_books 0.013209 http://www.nasa.gov/vision/space/features/ 0.013209 http://www.maniacworld.com/Apollo_11_3.htm

Factor 5 – egative end (eil Armstrong) -0.47625 http://www.jsc.nasa.gov/Bios/htmlbios/armstrong-na.html -0.43356 http://starchild.gsfc.nasa.gov/docs/StarChild/ -0.18776 http://en.wikipedia.org/wiki/Neil_Armstrong -0.13732 http://www.nasa.gov/centers/glenn/about/bios/neilabio.html -0.10533 http://www.armstrongair.com -0.10291 http://www.armstrong.com -0.089615 http://www.armstrongpumps.com -0.083854 http://www.armstrong.org -0.080983 http://history1900s.about.com/od/armstrongneil/ -0.076381 http://www.hq.nasa.gov/office/pao/History/alsj/a11/

Table 1 – Three communities discovered by (a) 'MF and (b) PCA from the Armstrong collection; lines in bold denote noise pages i.e. irrelevant pages to the extracted community. methods such as the divergence algorithm, the alternating least squares method and the probabilistic approach. We think that it may be useful to explore the effectiveness of these variants for web structure mining. Our experiments have also shown that the Wikipedia dataset is more adapted than the WebKb dataset for WSM assessment.

Acknowledgment This work was supported in part by the INRIA under Grant 200601075. We also thank Sylvie Viguier-Pla and Sébastien Déjean from the CNRS-LSP laboratory for their valuable pieces of advice.

References [1] B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer, 2006. [2] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. [3] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical

report, Stanford Digital Libraries SIDL-WP-1999-0120, 1999. [4] M. Saerens, F. Fouss. HITS is Principal Components Analysis. In Proc. of the 2005 IEEE/WIC/ACM Intl. Conf. on Web Intelligence, pp. 782-785, 2005. [5] X. Luo, A.N. Zincir-Heywood. Evaluation of Three Dimensionality Reduction Techniques for Document Classification. Proceedings of the IEEE CCECE’04, pp. 181184, Canada, 2004. [6] I. Jolliffe. Principal component analysis. Springer, 2002. [7] A. Hyvärinen, J. Karhunen, E. Oja. Independent Component Analysis, Wiley, New-York, 2001. [8] D. Lee, H. Seung. Learning of the parts of objects by nonnegative matrix factorization. Nature 401, pp. 788-791, 1999. [9] S. Vempala. The Random Projection Method, American Mathematical Society, 2004. [10] http://www.cs.cmu.edu/~webkb/ [11] http://www.mpi-inf.mpg.de/~angelova/DataSets/ [12] A. Strehl, Relationship-based clustering and cluster ensembles for high-dimensional data mining. Ph.D. dissertation, Austin University, 2002. [13] M. Fisher, R. Everson. When Are Links Useful? Experiments in Text Classification. Proceedings of the 25th European Conference on IR Research (ECIR’2003), LNCS 2633. Pisa, Italy, pp. 41—56, 2003.

Comparison of Dimensionality Reduction Techniques ...

In many domains, dimensionality reduction techniques have been shown to be very effective for elucidating the underlying semantics of data. Thus, in this paper we investigate the use of various dimensionality reduction techniques (DRTs) to extract the implicit structures hidden in the web hyperlink connectivity. We apply ...

168KB Sizes 0 Downloads 277 Views

Recommend Documents

Dimensionality Reduction Techniques for Enhancing ...
A large comparative study has been conducted in order to evaluate these .... in order to discover knowledge from text or unstructured data [62]. ... the preprocessing stage, the unstructured texts are transformed into semi-structured texts.

Transferred Dimensionality Reduction
propose an algorithm named Transferred Discriminative Analysis to tackle this problem. It uses clustering ... cannot work, as the labeled and unlabeled data are from different classes. This is a more ... Projection Direction. (g) PCA+K-means(bigger s

Distortion-Free Nonlinear Dimensionality Reduction
in many machine learning areas where essentially low-dimensional data is nonlinearly ... low dimensional representation of xi. ..... We explain this in detail.

Dimensionality Reduction for Online Learning ...
is possible to learn concepts even in surprisingly high dimensional spaces. However ... A typical learning system will first analyze the original data to obtain a .... Algorithm 2 is a wrapper for BPM which has a dimension independent mis-.

eigenfaces and eigenvoices: dimensionality reduction ...
We conducted mean adaptation experiments on the Isolet database 1], which contains .... 4] Z. Hu, E. Barnard, and P. Vermeulen, \Speaker Normalization using.

comparison of techniques
Zircon. Zr [SiO4]. 1 to >10,000. < 2 most. Titanite. CaTi[SiO3](O,OH,F). 4 to 500. 5 to 40 k,c,a,m,ig,mp, gp,hv, gn,sk. Monazite. (Ce,La,Th)PO4. 282 to >50,000. < 2 mp,sg, hv,gp. Xenotime. YPO4. 5,000 to 29,000. < 5 gp,sg. Thorite. Th[SiO4]. > 50,000

Dimensionality reduction of sonar images for sediments ...
Abstract Data in most of the real world applications like sonar images clas- ... with: Xij: represent the euclidian distance between the sonar images xi and xj. ... simple way to obtain good classification results with a reduced knowledge of.

Dimensionality reduction using MCE-optimized LDA ...
Dec 7, 2008 - performance of this new method is as good as that of the MCE- trained system ..... get significant performance improvement on TiDigits. Compared with ... Application of Discriminative Feature Extraction to Filter-. Bank-Based ...

EMI Reduction Techniques
Dec 29, 1998 - peak jitter must remain within the system's specifications. .... avoided. These ground and power partitions may create complex current loops.

Dimensionality reduction using MCE-optimized LDA ...
Dec 7, 2008 - Continuous Speech Recognition (CSR) framework, we use. MCE criterion to optimize the conventional dimensionality reduction method, which ...

Self-taught dimensionality reduction on the high ...
Aug 4, 2012 - representations of target data are deleted for achieving the effectiveness and the efficiency. That is, this step performs feature selection on the new representations of target data. Finally, experimental results at various types of da

Feature Sets and Dimensionality Reduction for Visual ...
France. Abstract. We describe a family of object detectors that provides .... 2×2 cell blocks for SIFT-style normalization, giving 36 feature dimensions per cell. .... This provides cleaner training data and, by mimicking the way in which the clas-.

Nonlinear Dimensionality Reduction with Local Spline ...
Jan 14, 2009 - and call yij ∈ Y its corresponding global coordinate in Rd. During algorithm ..... To avoid degenerate solutions, we add a .... the center. We also ...

Nonlinear Dimensionality Reduction with Local Spline ...
Nov 28, 2008 - This paper presents a new algorithm for Non-Linear Dimensionality Reduction (NLDR). Our algorithm is developed under the conceptual framework of compatible mapping. Each such mapping is a compound of a tangent space projection and a gr

Fast and Efficient Dimensionality Reduction using ...
owns very low computational complexity O(d log d) and highly ef- ..... there is no need to store the transform explicitly in memory. The- oretically, it guarantees ...

Dimensionality reduction by Mixed Kernel Canonical ...
high dimensional data space is mapped into the reproducing kernel Hilbert space (RKHS) rather than the Hilbert .... data mining [51,47,16]. Recently .... samples. The proposed MKCCA method (i.e. PCA followed by CCA) essentially induces linear depende

A Comparison of Unsupervised Dimension Reduction ...
classes, and thus it is still in cloud which DR methods in- cluding DPDR ... has relatively small number of instances compared to the di- mensionality, which is ...

Comparison of electrochemical techniques during the corrosion of X52 ...
J. Genesca, R. Galvan-Martinez, ... G. Garcia-Caloca, R. Duran-Romero, J. Mendoza-Flores, .... In order to analyze the measured electrochemical noise data.

Comparison of electrochemical techniques during the corrosion of X52 ...
2 shows the best fitting parameters obtained in the nu- merical analyses. In this table ... ing, at each analysed frequency, the power spectral density. (PSD) of the ...

A comparison of loop closing techniques in monocular SLAM
Careful consideration is given to the distinctiveness of the features – identical but indistinctive observations receive a low probability of having come from the same place. This minimises false loop closures. • Image–to–map – Corresponden

Comparison of Diversity Combining Techniques for ...
MRC, is not bounded as increasing signal-to-noise ratio (SNR). ... diversity gain, i.e. reliability of a wireless link, as compared to a conventional single-input ...

A comparison of machine learning techniques for ...
Aug 3, 2010 - search (TDR1) Prioritization Database [12], a new online resource to integrate ... automatic data-base curation of drug–target interactions. In the.