Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search Xutao Li1
Michael Ng2
Yunming Ye1
1 Department
of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, China
2 Department
of Mathematics, Hong Kong Baptist Univerisity, Hong Kong
SIAM International Conference on Data Mining, 2012
X.T. Li, et al.
Outline
• Motivation • Related Work • HAR (Idea + Theory + Algorithm) • Experimental Results • Concluding Remarks
X.T. Li, et al.
Motivation
• Link analysis algorithm is critical to information retrieval tasks, especially to Web related retrieval applications. much noise, low quality information link(hyperlink)structure is helpful e.g., Google
• There are many applications where the links/hyperlinks can be characterized into different types.
X.T. Li, et al.
Motivation - Examples of multi-relational data &LWDWLRQWKURXJKNH\ZRUG &LWDWLRQWKURXJKNH\ZRUG &LWDWLRQWKURXJKNH\ZRUG
(a) multi-relational citation net- (b) multi-semantic hyperlink network work 3KRQH 061 061 3KRQH 061 (PDLO
&RH[SUHVVLQWHUDFWLRQ XQGHUFRQGLWLRQ &RH[SUHVVLQWHUDFWLRQ XQGHUFRQGLWLRQ
(PDLO
3K\VLFDOLQWHUDFWLRQ XQGHUFRQGLWLRQ
(PDLO 3KRQH
3K\VLFDOLQWHUDFWLRQ XQGHUFRQGLWLRQ
061 3KRQH
(c) multi-channel communication (d) multi-conditional gene internetwork action network
How to exploit such multi-relational link structures to facilitate query search task is an important and open research problem. X.T. Li, et al.
Outline
• Motivation • Related Work • HAR (Idea + Theory + Algorithm) • Experimental Results • Concluding Remarks
X.T. Li, et al.
Related Work
• The hyperlink structure is exploited by three of the most frequently cited Web IR methods: HITS (Hypertext Induced Topic Search), PageRank and SALSA. • HITS was developed in 1997 by Jon Kleinberg. Soon after Sergey Brin and Larry Page developed their now famous PageRank method. SALSA was developed in 2000 in reaction to the pros and cons of HITS and PageRank. [The survey given by A. Langville and C. Meyer, A Survey of Eigenvector Methods for Web Information Retrieval, SIAM Review, 2005.] • In 2006, Tamara Kolda and Brett Bader proposed TOPHITS method to analyze multi-relational link structures by using tensor decomposition.
X.T. Li, et al.
New Challenge • PageRank: L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. 1998. • HITS: J. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 46: 604-632, 1999. • SALSA: R. Lempel and S. Moran. The Stochastic Approach for Link-structure Analysis (SALSA) and the TKC effect. The Ninth International WWW Conference, 2000. –single-type relation(hyperlink) • TOPHITS: T. Kolda and B. Bader. The TOPHITS Model for Higher-Order Web Link Analysis. Workshop on Link Analysis, Counterterrorism and Security, 2006. – The decomposition may not be unique. – Negative hub and authority scores can be produced. X.T. Li, et al.
Outline
• Motivation • Related Work • HAR (Idea + Theory + Algorithm) • Experimental Results • Concluding Remarks
X.T. Li, et al.
The Idea • In order to differentiate relations, we introduce a relevance score for each relation besides the hub and authority scores for objects. UHOHYDQFH VFRUH DXWKRULW\ VFRUH KXEVFRUH
The hub, authority and relevance scores have a mutually-reinforcing relationship. • Represent the data with a tensor → construct transition probability tensors w.r.t. hubs, authorities and relations→ setup tensor equations based on random walk → solve the tensor equations for obtaining the hub, authority and relevance scores X.T. Li, et al.
The Representation Example: five objects and three relations (R1: green, R2: blue, R3: red) among them. R1 R2
1 1
R3
1
1 1
2
1
1
R3
1 1
5 3 4
1 1 21 3 1 1 4 51 1 2 3 4 5 R1
(a)
R2
(b)
In the following, we assume that there are m objects and n relations in the multi-relational data. It is represented as a tensor T = (ti1 ,i2 ,j1 ). Here (i1 , i2 ) to be the indices for objects and j1 to be the indices for relations. X.T. Li, et al.
Transition Probability Tensors H = (hi1 ,i2 ,j1 ), A = (ai1 ,i2 ,j1 ) and R = (ri1 ,i2 ,j1 ) with respect to hubs, authorities and relations by normalizing the entry of T as follows: hi1 ,i2 ,j1
=
ti1 ,i2 ,j1 , m X ti1 ,i2 ,j1
i1 = 1, 2, · · · , m,
i1 =1
ai1 ,i2 ,j1
=
ti1 ,i2 ,j1 m X
,
i2 = 1, 2, · · · , m,
ti1 ,i2 ,j1 , n X ti1 ,i2 ,j1
j1 = 1, 2, · · · , n.
ti1 ,i2 ,j1
i2 =1
ri1 ,i2 ,j1
=
j1 =1
X.T. Li, et al.
Transition Probability Tensors
These numbers give the estimates of the following conditional probabilities: hi1 ,i2 ,j1
= Prob[Xt = i1 |Yt = i2 , Zt = j1 ]
ai1 ,i2 ,j1
= Prob[Yt = i2 |Xt = i1 , Zt = j1 ]
ri1 ,i2 ,j1
= Prob[Zt = j1 |Yt = i2 , Xt = i1 ]
where Xt , Yt and Zt are random variables referring to visit at any particular object as a hub and as an authority, and to use at any particular relation respectively at the time t respectively. Here the time t refers to the time step in the random walk.
X.T. Li, et al.
HAR - Tensor Equations
¯ hub score: x ¯ authority score: y ¯ relevance score: z ¯=x ¯, H¯ yz with
m X i1 =1
x ¯i1 = 1,
¯=y ¯, A¯ xz m X i2 =1
X.T. Li, et al.
y¯i2 = 1,
¯=z ¯, R¯ xy n X j1 =1
z¯j1 = 1.
HAR - Tensor Equations ¯ hub score: x ¯ authority score: y ¯ relevance score: z m X n X
hi1 ,i2 ,j1 yi2 zj1 = xi1 ,
1 ≤ i1 ≤ m
ai1 ,i2 ,j1 xi1 zj1 = yi2 ,
1 ≤ i2 ≤ m
hi1 ,i2 ,j1 xi1 yi2 = zj1 ,
1 ≤ j2 ≤ n
i2 =1 j1 =1 n m X X i1 =1 j1 =1 m X m X i1 =1 i2 =1
with
m X i1 =1
x ¯i1 = 1,
m X i2 =1
X.T. Li, et al.
y¯i2 = 1,
n X j1 =1
z¯j1 = 1.
Generalization
¯ to be a vector When we consider a single relation type, we can set z l/n of all ones, and thus we obtain two matrix equations ¯ H¯ yl/n = x
¯. A¯ xl/n = y
We remark that A can be viewed as the transpose of H. This is exactly the same as that we solve for the singular vectors to get the hub and authority scoring vectors in SALSA. As a summary, the proposed framework HAR is a generalization of SALSA to deal with multi-relational data.
X.T. Li, et al.
HAR - Query Search
To deal with query processing, we need to compute hub and authority scores of objects and relevance scores of relations with respect to a query input (like topic-sensitive PageRank): ¯ + αo = x ¯, (1 − α)H¯ yz ¯ + βo = y ¯, (1 − β)A¯ xz ¯ + γr = z ¯, (1 − γ)R¯ xy where o and r are two assigned probability distributions that are constructed from a query input, and 0 ≤ α, β, γ < 1, are three parameters.
X.T. Li, et al.
HAR - Theory
Ωm = {u = (u1 , u2 , · · · , um ) ∈ Rm |ui ≥ 0, 1 ≤ i ≤ m,
m X
ui = 1}
i=1
and Ωn = {w = (w1 , w2 , · · · , wn ) ∈ Rn |wj ≥ 0, 1 ≤ j ≤ n,
n X
wj = 1}
j=1
Clearly, the solution of HAR is in a convex set. Then we derived the following two theorems based on the Brouwer Fixed Point Theorem.
X.T. Li, et al.
HAR - Theory
Theorem 1 Suppose H, A and R are constructed, 0 ≤ α, β, γ < 1, and o ∈ Ωm and r ∈ Ωn are given. If T is irreducible, then there exist ¯ > 0, y ¯ > 0 and z ¯ > 0 such that (1 − α)H¯ ¯ + αo = x ¯, x yz ¯ + βo = y ¯ , and (1 − γ)R¯ ¯ + γr = z ¯, with x ¯, y ¯ ∈ Ωm (1 − β)A¯ xz xy ¯ ∈ Ωn . and z Theorem 2 Suppose T is irreducible, H, A and R constructed, 0 ≤ α, β, γ < 1 and o ∈ Ωm and r ∈ Ωn are given. If 1 is not the eigenvalue of the Jacobian matrix of the mapping from the tensor, ¯, y ¯ and z ¯ are unique. then the solution vectors x
X.T. Li, et al.
The HAR Algorithm Input: Three tensors H,PA and R, two initial Pnprobability distributions y0 and z0 with ( m [y ] = 1 and 0 i i=1 j=1 [z0 ]j = 1), the assigned probability distributions of objects and/or relations o and Pm Pn r ( i=1 [o]i = 1 and j=1 [r]j = 1), three weighting parameters 0 ≤ α, β, γ < 1, and the tolerance ² ¯ (authority scores), Output: Three stationary probability distributions x ¯ (hub scores) and z ¯ (relevance values) y Procedure: 1: Set t = 1; 2: Compute xt = (1 − α)Hyt−1 zt−1 + αo; 3: Compute yt = (1 − β)Axt zt−1 + βo; 4: Compute zt = (1 − γ)Rxt yt + γr; 5: If ||xt − xt−1 || + ||yt − yt−1 || + ||zt − zt−1 || < ², then stop, otherwise set t = t + 1 and goto Step 2. X.T. Li, et al.
Outline
• Motivation • Related Work • HAR (Idea + Theory + Algorithm) • Experimental Results • Concluding Remarks
X.T. Li, et al.
Evaluation metrics
• P@k: Given a particular query q, we compute the precision at position k as follows: P @k =
#{relevant documents in top k results} k
• NDCG@k: NDCG@k is a normalized version of DCG@k metric. • MAP: Given a query, the average precision is calculated by averaging the precision scores at each position in the search results where a relevant document is found. • R-prec: Given a query, R-prec is the precision score after R documents are retrieved, i.e., R-prec=P@R, where R is the total number of relevant documents for such query.
X.T. Li, et al.
Experiment 1
• 100,000 webpages from .GOV Web collection in 2002 TREC and 50 topic distillation topics in TREC 2003 Web track as queries • links among webpages via different anchor texts • 39,255 anchor terms (multiple relations), and 479,122 links with these anchor terms among the 100,000 webpages • If the i1 th webpage links to the i2 th webpage via the j1 th anchor term, we set the entry ti1 ,i2 ,j1 of T to be one. The size of T is 100, 000 × 100, 000 × 39, 255.
X.T. Li, et al.
HITS SALSA TOPHITS (500-rank) TOPHITS (1000-rank) TOPHITS (1500-rank) BM25+ DepInOut HAR (rel. query) HAR (rel. and obj. query)
P@10 0.0000 0.0160 0.0020
P@20 0.0000 0.0140 0.0010
NDCG@10 0.0000 0.0157 0.0044
NDCG@20 0.0000 0.0203 0.0028
MAP 0.0041 0.0114 0.0008
R-prec 0.0000 0.0084 0.0002
0.0040
0.0020
0.0088
0.0057
0.0016
0.0010
0.0040
0.0030
0.0063
0.0049
0.0011
0.0018
0.0280
0.0180
0.0419
0.0479
0.0370
0.0370
0.0560
0.0410
0.0659
0.0747
0.0330
0.0552
0.1100
0.0800
0.1545
0.1765
0.1035
0.1051
The results of all comparison algorithms on TREC data set.
X.T. Li, et al.
Parameters
0.07 0.06
performance
0.05
P@10,α=β=0 NDCG@10,α=β=0 MAP,α=β=0 R−prec,α=β=0
0.04 0.03 0.02 0.01 0 0
0.2
0.4
γ
0.6
0.8
1
The parameter tuning test: tuning γ with α = β = 0. X.T. Li, et al.
Parameters
0.16 P@10,γ=0.9 NDCG@10,γ=0.9 MAP,γ=0.9 R−prec,γ=0.9
performance
0.14
0.12
0.1
0.08
0.06 0
0.2
0.4
α=β
0.6
0.8
1
The parameter tuning test: tuning α and β with γ = 0.9. X.T. Li, et al.
Experiment 2
• five conferences (SIGKDD, WWW, SIGIR, SIGMOD, CIKM) • Publication information includes title, authors, reference list, and classification categories associated with publication • 6848 publications and 617 different categories • 100 category concepts as query inputs to retrieve the relevant publications • Tensor: 6848 × 6848 × 617, If the i1 th publication cites the i2 th publication and the i2 th publication has the j1 th category concept, then we set the entry ti1 ,i2 ,j1 of T to be one, otherwise we set the entry ti1 ,i2 ,j1 to be zero.
X.T. Li, et al.
HITS SALSA TOPHITS (50-rank) TOPHITS (100-rank) TOPHITS (150-rank) BM25+ DepInOut HAR (rel. query)
P@10 0.2260 0.4100 0.1360
P@20 0.1815 0.3105 0.1145
NDCG@10 0.3789 0.5606 0.1684
NDCG@20 0.3792 0.5352 0.1557
MAP 0.2522 0.3462 0.0566
R-prec 0.2751 0.3929 0.0617
0.1640
0.1340
0.2012
0.1857
0.0646
0.0732
0.1920
0.1410
0.2315
0.1998
0.0732
0.0765
0.0170
0.0145
0.0147
0.0138
0.0162
0.0109
0.5880
0.4155
0.7472
0.6760
0.4731
0.4683
The results of all comparison algorithms on DBLP data set.
X.T. Li, et al.
Outline
• Motivation • Related Work • HAR (Theory + Algorithm) • Experimental Results • Concluding Remarks
X.T. Li, et al.
Concluding Remarks
• Our framework is a general paradigm and it can be further extended to consider data with higher order tensors for potential applications in semantic web, image retrieval and community discovery. • For example, we can consider the query search problem in semantic web using a (1, 1, 1, 1)th order rectangular tensor to represent subject, object, predicate and context relationship. After constructing four transition probability tensors S, O, P and R for subject, object, predicate and context relationship respectively, based on the proposed framework, we expect to solve the following set of tensor equations: Sopr = s, Ospr = o, Psor = p, Rsop = r.
X.T. Li, et al.
Thank you!
X.T. Li, et al.