Dept. Computer Science & Engineering, Shang Hai Jiao Tong University, Shanghai, 200030, P.R. China (8621)62932564 [email protected] 2 MSPLAB, Dept. Electronic Engineering, Tsinghua University, Beijing, 100084, P.R. China (8610)62789944 {qinshitao99, fengg03}@mails.tsinghua.edu.cn [email protected] 3 Microsoft Research Asia, No. 49, Zhichun Road, Haidian District, Beijing, 100080, P.R. China (8610)62617711 {t-tyliu, wyma}@microsoft.com

Abstract. PageRank is one of the most popular link analysis algorithms that have shown their effectiveness in web search. However, PageRank only consider hyperlink information. In this paper, we propose several novel ranking algorithms, which make use of both hyperlink and site structure information to measure the importance of each web page. Specifically, two kinds of methodologies are adopted to refine the PageRank algorithm: one combines hyperlink information and website structure information together by graph fusion to refine PageRank algorithm, while the other re-ranks the pages within the same site by quadratic optimization based on original PageRank values. Experiments show that both two methodologies effectively improve the retrieval performance.

1 Introduction Typical search engines today use hyperlink information to measure the importance of a page for producing better search results. PageRank [7] and HITS [6] are two of the most popular link analysis algorithms for calculating the importance of web pages. Most link analysis algorithms such as [3,5,6,7] mainly focus on hyperlink information. The underlying assumption is that the Web is a flat graph, where all pages are identical and their importance is only determined by the link connections. However, it is clear that the structure of the Web is not as simple as such. Simon [10] argued that all the system, including the Web, are likely to be organized with a hierarchical structure. If having a macro look, we can find that domains, sites and pages form a hierarchy. Symmetrically, if having a micro look, we will find that each website is also organized hierarchically (as shown in Figure 1). In this paper, our goal is to study if we could improve the link analysis algorithms by introducing site hierarchy. *

This work was performed at Microsoft Research Asia.

G.G. Lee et al. (Eds.): AIRS 2005, LNCS 3689, pp. 546 – 551, 2005. © Springer-Verlag Berlin Heidelberg 2005

Calculating Webpage Importance with Site Structure Constraints

547

Fig. 1. Hierarchical structure in a website

From the view of website administrator and the view of web user, we can get the following two rules: Rule 1: The more important a parent page is, the more likely its children will be important. In other words, a parent endorses its children pages. Similarly, the more important a child page is, the more likely its parent will be important. That is, a child page also endorses its parent. This rule displays the parent-child endorsement relationship. Rule 2: Generally speaking, a parent page would be more important than its child pages, since the website administrator uses children pages to support the parent page while he is constructing the website. Since the level of parent page is less than its children, this rule displays the level priority in the site structure constraints. The above constraints reflect the goal of the website administrator when he/she constructs the site, and they are also consistent with the users’ browsing behaviors. Thus, we have the confidence that page importance computation with site structure constraints will improve the effectiveness and precision of Web information retrieval system. The rest of this paper is organized as follows. In Section 2, we propose how to combine hyperlink graph and site structure graph together to refine PageRank algorithm. In Section 3, we describe the optimization based PageRank algorithm with site structure constraints. Experimental results are reported in Section 4. And last, we give the concluding remarks and future work in Section 5.

2 PageRank with Site Structure Constraints As well known, PageRank algorithm simulates a random walk on the hyperlink graph, and it assumes that hyperlinks represent human endorsement. According to the Rule 1 in Section 1, parent and child pages also endorse each other. Similar to the hyperlink graph, we can construct a site structure graph. In such a way, we will get two graphs. Let A and A* represent the adjacency matrix of the hyperlink graph and site structure graph separately. To integrate the site structure information for Web page ranking, we need to fuse these two graphs for the random walk model.

548

H.-M. Yan et al.

2.1 Additive Graph Fusion Algorithm One of the simplest fusion methods is to add the two graphs directly. That is, we merge the adjacency matrix A and A* to get a new adjacency matrix B

B = A + A*

(1)

⎧⎪1, if Aij = 1 or Aij* = 1 Bij = Aij + Aij* = ⎨ ⎪⎩0, otherwise

(2)

where

For the new graph with adjacency matrix B, we can follow the standard PageRank algorithm to compute the PageRank for each page in the Web. We call it by additive graph fusion algorithm (AGF for short). 2.2 Multiplicative Graph Fusion Algorithm

In the additive graph fusion algorithm, we get a new graph by adding hyperlink graph and site structure graph together. In this sub section, we fuse the two graphs by multiplication. This is the so-called multiplicative graph fusion algorithm algorithm (MGF for short). Different from the additive graph fusion algorithm, we do not multiply the two adjacency matrices directly. Instead, we first convert the adjacency matrix of each graph to a probability transition matrix, and then multiply the two transition matrices together to get a new graph for random walk model. The details of this algorithm are shown as follow: Same as the standard PageRank algorithm, we first normalize each row of the adjacency matrix A with its sum, and then get a probability matrix A . Similarly, we get a row-stochastic matrix A* from the adjacency matrix A*. Then we get a matrix C for the new graph by multiplication: C = A* A

(3)

It is easy to get that C is also a row-stochastic matrix. And the stationary distribution of C is used to measure the importance of each web page.

3 Optimization Based Algorithm As mentioned in Rule 2 in Section 1, from the view of website administrator, most of the parent pages should be more important than their child pages because the children are the supports of their parent. However, the standard PageRank can not guarantee this level priority naturally. To make PageRank better suited to the site structure constraints, we need to refine the importance of pages within the same site by optimization. We call this algorithm OB for short.

Calculating Webpage Importance with Site Structure Constraints

549

Suppose there are k pages in a website, whose levels are l1, l2, … , lk, and the importance scores calculated by standard PageRank algorithm are π 1 , π 2 ,L , π k . Let L denote the maximal level in the site, and π represent the mean PageRank of all the pages in that site. Define w=(w1,w2,...,wL) as a L-dimensional weight vector, and then we refine the importance scores by adding a constant value to those pages in the same level of the site. The so-refined importance score are denoted by π 1* , π 2* , L , π k* , where π i* = π i + wli π . During this refinement process, on one hand, we try to make those pages in the same site consistent with the level priority; while on the other hand, we do not want to make too much change for the original PageRank. Note that the change of the original PageRank depends on the weight vector w. The smaller the module of w, the less the change of the original PageRank is. So as a result, we can formulate the optimization problem as below. min w

2

s.t. π i* ≥ π *j if i is the parent of j

(4)

Considering that the level priority shown in Rule 2 is true for general sense, but it may be unsatisfied for some special cases, we introduce relaxation variables µij to the optimization model of (4) as follows, min w + C ∑ µij 2

s.t. π i* + µij ≥ π *j if i is the parent of j

(5)

µij ≥ 0, i = 1,..., k , j = 1,..., k . where C controls the trade-off between the modification to original PageRank and the violation of level priority. If C=+∞, the model focuses on level priority: it does not allow any violation of the level priority, and the model of (5) degenerates into the simple model of (4). In the other extreme, if C=0, the model neglects the level priority and the PageRank remains unchanged. It is clear that the models in (4) and (5) take the forms of typical quadratic optimization problems. We can use the algorithms in [1] to fulfill this task. We will not list further deductions of the dual problem and the details of the solution due to the limitation of paper length.

4 Experiments To compare our new PageRank algorithms with the standard PageRank (PR) [7], we chose the topic distillation task in Web track of TREC 2003 as the benchmark. To generate the hierarchical structure for each website in the data corpus, we adopt the method in [4]. We use BM2500 [8] for the relevance weighting function and get the baseline with the precision at 10 (P@10) of 0.104.

550

H.-M. Yan et al.

For each query, we first use BM2500 to get a relevance list. We then choose the top 2000 pages from the relevance list, and combine the relevance score with importance score as follow: scorecombination = α × scorerelevance + (1 − α ) × scoreimpro tan ce

(6)

The P@10 of all algorithms under investigation is shown in Figure 2. All the four curves converge to the baseline while α=1. From this figure, we can see that all our three algorithms outperform the standard PageRank algorithm, which shows the effectiveness of considering site structure information. Particularly, OB significantly boosts the retrieval accuracy than any other algorithms. The success of OB directly shows the validation of the site structure constraints mentioned in the introduction. In both AGF and MGF, we integrate site structure information by modifying the random walk graph. However, we do not know clearly how much site structure contributes to the final importance score. Therefore we are not sure whether the final importance score is consistent with the site structure constraints. However, in OB, we treat the site structure constraints explicitly in the optimization formulation. So the final importance score will most possibly follow the site structure constraints. As a result, we can say, OB makes the best use of site structure information among all the algorithms in Figure 2. PR

0.14

AGF MGF

0.13

OB

P @ 10

0.12

0.11

0.1

0.09

0.08 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Combination Parameter α

Fig. 2. Comparison of P@10 on TREC2003

5 Conclusions and Future Work In this paper, we pointed out that site structure information should be considered while ranking web pages, which was neglected by traditional link analysis algorithms. Based on this motivation, we modified the standard PageRank algorithm from two different aspects: one modified the transition graph for the random walk model, and the other post-optimized the importance score according to site structure constraints. We developed three new algorithms with site structure constraints for page importance analysis: additive graph fusion algorithm, multiplicative graph fusion

Calculating Webpage Importance with Site Structure Constraints

551

algorithm and optimization-based algorithm. Experiments on the topic distillation task of TREC2003 showed that all the new algorithms outperform the standard PageRank algorithm. Particularly the optimization-based algorithm significantly boosted the retrieval accuracy. HITS [6] is another popular link analysis algorithm. For the future work we would like to apply site structure constraint to modify HITS algorithm.

References 1. Boyd, S., and Vandenberghe, L. Convex optimization. Course notes for EE364, Stanford University, 2003. 2. Brin, S., and L. Page, L. The anatomy of a large-scale hypertextual Web search engine, In The Seventh International World Wide Web Conference, 1998. 3. Chakrabarti, S., Joshi, M., and Tawde, V. Enhanced topic distillation using text, markup tags, and hyperlinks, In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, 2001, pp. 208-216. 4. Feng, G., Liu, T. Y., Zhang, X. D., Qin. T., Gao, B., Ma, W. Y. Level-Based Link Analysis, in the 7th APWeb, 2005. 5. Haveliwala, T.H. Topic-sensitive pagerank. In Proc. of the 11th Int. World Wide Web Conference, May 2002. 6. Kleinberg, J. Authoritative sources in a hyperlinked environment, Journal of the ACM, Vol. 46, No. 5, pp. 604-622, 1999. 7. Page, L., Brin, S., Motwani, R., and Winograd, T. The PageRank citation ranking: Bringing order to the web, Technical report, Stanford University, Stanford, CA, 1998. 8. Robertson, S. E. Overview of the okapi projects, Journal of Pageation, Vol. 53, No. 1, 1997, pp. 3-7. 9. Salton, G. and McGill, M. J. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. 10. Simon, H. A. The Sciences of the Artificial. MIT Press, Canbridge, MA, 3rd edition, 1981.