155

Collaborative Filtering Supporting Web Site Navigation Gianluigi Greco, Sergio Greco and Ester Zumpano DEIS, Univ. della Calabria and ICAR-CNR, 87030 Rende, Italy E-mail: {gianluigi.greco,greco,zumpano}@deis.unical.it Abstract. The proliferation of information available on the world wide web and the new emerging technologies that have reduced the barriers in organizing and publishing documents, have made the support for navigation and personalization of web sites an appealing and promising task for the AI community. Making the process of retrieving relevant documents easier is one of the most challenging activities in the design of modern sites and goes beyond any particular domain. This paper proposes a new technique for Web navigation based on current algorithms used in Recommendation Systems. Our approach identifies really relevant documents adopting methodologies similar to those successfully used in current search engines. The effectiveness and the validity of our technique have been shown by several experiments. Keywords: Adaptative Web Sites, Collaborative Filtering, Web Searching, Browsing

1. Introduction The proliferation of information available on the Web has significantly increased the possibility of sharing ideas and human knowledge on a scale never seen before. In particular, the research and the development of new technologies, reducing the barriers for collecting, organizing and publishing huge amounts of information, has led many organizations to create, with relatively low costs, massive web sites consisting of several hundreds of pages. Such sites (educational, E-commerce, industrial) are, generally, difficult to navigate since they try to attract a large and heterogeneous number of people, by constructing unspecialized pages. This behaviour, often, lets a new user to find information of interest, only as a result of long and exhausting navigations. As a consequence, the design of tools supporting site navigation and personalization, is a new, emerging task for the AI community [21]. The easy navigation of web sites and their personalization are based on a special treatment of the information provided by Web sites in order to match the visitor’s interest. The basic idea, underlying all tools supporting easy navigation and personalization, consists in accumulating a vast amount of historical data about site visitors, and in using these data to design adaptative web sites, that is sites that automatically improve their organization and presentation by learning from visitor access

patterns. The personalization of the information is then performed by applying traditional data mining algorithms that try to cluster users with respect some (given or induced) profiles. In this paper we follow a different idea. In particular, the main contribution is the proposal of an effective approach for web site navigation, that combines the knowledge developed in the field of Recommendation Systems, with the emerging filtering technologies applied to the design of Ranking mechanisms for web search engines. Specifically, we propose an algorithm which, given a web site S, represented by means of a rooted graph GS , computes a new rooted graph GS representing a different view of the site. The computation of the rooted graph GS is based on spectral techniques currently used for the identification of relevant Web pages. In particular, the process of computing authoritative pages (with respect to a given topic or to the user profile) is applied to a bipartite graph GA associated to a matrix A, called user-page matrix. In more details, such a matrix corresponds to the classical useritem matrix used in Recommendation Systems, since it stores all the pages visited by each user. Moreover, since the process of computing authoritative pages is expensive, we also propose an approximated solution. Our approach has been confirmed by many experiments to be more effective than traditional recommendation algorithms.

0921-7126/04/$17.00 2004 – IOS Press and the authors. All rights reserved

156

G. Greco et al. / Collaborative filtering supporting Web site navigation

1.1. Organization The rest of the paper is organized as follows. In Section 2 we present preliminary notions on Collaborative Filtering, Web personalization and Ranking Techniques. In Section 3 it is presented a new algorithm for finding similar items and it is, moreover, proposed a new technique for computing approximated authoritative element. On the basis of the theoretical results presented in Section 3, in Section 4 we present an algorithm that, given a web site S, computes the structure of a new web site S representing a personalization of S. The description of a system prototype and some experimental results are described in Section 5.

2. Preliminaries and related works In this section we provide the general background needed for understanding our technique for web site personalization. In particular, we introduce the collaborative filtering approach and we overview many recommendation systems in which such technology has been successful applied. Furthermore, we introduce the main concepts underlying web personalization task and describe the ranking techniques (used by current search engines) that we have fitted for implementing a new collaborative filtering algorithm, best suited for web site navigation tools. 2.1. Collaborative filtering Collaborative filtering (CF) is a new technique helping people to make choices on the basis of other people choices. Formally, the problem is defined as follows. We are given a set of users U = {u1 , . . . , um }, a set of items I = {i1 , . . . , in } and a distinguished user ua , called active user. Each user uj has associated a list of chosen items Iuj from I, on which he/she has expressed opinion, by means of rating elements in I or selecting elements in I.1 For instance, in E-commerce applications, Iuj could be the list of items bought by uj . The rating of items can be defined by means of a (sparse) (m × n) user-item matrix A where each element A(h, j) contains the likeliness of item ij expressed by the user uh . The process of collaborative filtering may consist in 1 Such an information can be obtained from the analysis of logs.

1. Prediction of a value Va,j , expressing the likeliness of the item ij , not belonging to the current Iua , by the user ua . 2. (Top-N) Recommendation, consisting in the selection of a list of N items which could be recommended to the active user ua . Observe that the prediction task can be considered as a special case of the Top-N recommendation task with N = 1. Thus, in the following we concentrate on the recommendation task. Several approaches for Recommendation Systems [2,13,16,22,25,26] have been recently developed, but the two main categories are: 1. user-based: (also known as collaborative recommendation algorithms) which analyzes historical information of a neighborhood of people, whose tastes are similar to those of the active user, in order to recommend new items that they have liked and that, for this reason, will probably be of interest for the user. The idea is to find groups of users that are interested in common items, i.e., they have chosen exactly the same or similar items. Even if widely used, this approach suffers severe limitations of scalability as it requires a computation that grows linearly with the number of users and with the number of items. 2. item-based: (also known as content-based recommendation algorithms) which analyzes historical information in order to recommend items similar to those users have liked in the past; here the attention is focused on the items purchased; in particular, these approaches look for a correlation between items by using some measure of similarity. It follows that, for each item, we can define a recommendation value by looking at the most similar ones. In the user-based approach, the grouping of users is performed by computing clusters of users in the useritem matrix (see Fig. 1, where m is the number of users and n is the number of items). In the item-based approach (see Fig. 2), the similarity between two items ij and ik is computed by considering the “distance” between the columns j and k in the user-item matrix. Observe that, even if item-based approaches are typically more efficient and effective, their performances are often dramatically limited by the choice of a similarity measure for items. Several similarity criteria have been proposed in the literature. We mention here: the cosine-based, the correlation-based, and the probabilistic similarity.

G. Greco et al. / Collaborative filtering supporting Web site navigation

157

Avanti [5]. The Avanti project takes care of the individual needs of heterogeneous visitors by adapting the content and the presentation of web pages to each of them. Using information about users’ groups, the system attempts to predict the topic in which the user is interested and his next step in the navigation of the site. The system particularly cares for the special needs of elderly and handicapped users. Fig. 1. Users clustering.

Fig. 2. Items similarity.

Both the approaches present some limitations. In item-based recommendation algorithms, recommended items are just those scoring highly against a user’s profile, so the user is restricted to see just items similar to those already rated. In user-based recommendation algorithms, the historical information concerning users and items is often limited so causing an inaccurate identification of the neighborhood and poor recommendations. Furthermore, the idea of assigning each user, in a rigid way, to a cluster seems too restrictive as the user could be interested in more than one topic. In the remaining part of this subsection we present some systems for collaborative filtering. After the pioneering system Tapestry [7] that introduced the term of collaborative filtering, many different kinds of adaptative web sites, implementing different personalization techniques, have been proposed in the literature. WebWatcher [12]. WebWatcher is one of the earliest systems for content-based recommendations. After entering a site, a visitor is asked what he is looking for and before departing he is asked if he has found what he wanted. WebWatcher learns to predict what links, on a particular page, the active user will probably follow, by analyzing the previous successful paths. Different methods for calculating the degree of relevance of documents are used [24].

Ringo [28]. Ringo, also known as Firefly, is a networked system which, on the basis of the similarities between the interest profile of the active user and those of other users, makes personalized recommendations for music albums and artists. Ringo uses an individualized form of collaborative filtering, letting members rate hundreds of CDs or movies. GroupLens [15]. The GroupLens system provides personalized recommendations for electronic forums, specifically Usenet news. The purpose of the system is to increase the quality of time spent reading electronic forums: a user doesn’t want to miss the good articles, but doesn’t want to waste his/her time on the bad ones. Personalization is based on the identification of a set of “neighbors” by examining previous patterns of agreement. Fab [2]. Fab is a hybrid system applying a contentbased, collaborative recommendation. It maintains user profiles based on content-analysis, and directly compares these profiles to determine similar users for collaborative recommendations. The user regularly receives a list of pages to evaluate, and this information is then used to update both a collection of agents and a user’s selection agent. The agent applies TFIDF techniques [24] to obtain a set of keywords associated to the document and uses the cosine similarity measure to calculate how much, according to his/her user profile, the user is interested in the document. The bestevaluated documents are sent to other users with similar profiles. Movielens [8]. Movielens is a system for film recommendations which uses both information from other users having similar video preferences, and information based on previous evaluations given by the active user. In order to identify which items a user will find worthwhile, the system combines Information filtering and collaborative filtering techniques. Information filtering techniques focus on the items and develop a personal user interest profile, while collaborative filtering techniques identify other users with similar tastes in order to generate item recommendations.

158

G. Greco et al. / Collaborative filtering supporting Web site navigation

Letizia [19]. Letizia is a content-based system that assists a user browsing the World Wide Web by recommending web documents. The agent automates a browsing strategy consisting of a best-first search algorithm augmented by heuristics inferring user interests from browsing behavior. In particular, by doing concurrent, autonomous explorations of links from the user’s current position, the system attempts to anticipate items of interest for the user. Footprints [29]. Footprints is a web browsing history visualization system, which provides a way to take advantage of web site usage information, to aid people browsing the site. The motivating metaphor is the following: a web site visitor leaves his/her “footprints” (for example how often a link between adjacent pages is traversed) like a walker in the grass. Footprints are left automatically and can be useful to new visitors as indicators of the most interesting pages to visit. 2.2. Web personalization Web personalization is an interesting and difficult task whose aim is to automatically improve the organization and presentation of web sites on the basis of users’ interests. The basic idea, underlying all personalization systems, consists in accumulating a vast amount of historical data about site visitors and analyzing visitor access patterns to reorganize the Web site so that it can be easily navigated. Web personalization has been used for different purposes. Initially it was used to kept visitors on the site in order to advertise and promote products. Today the main objective of personalization is making the site useful and attractive so that a visit becomes a personalized experience for each visitor. Today personalization is prominently used in order to make the visit of a web site a personalized experience, making the site useful and attractive. An adaptative Web site changes the information returned in response to a user request, predicting his/her preferences based on the preferences of a group of users. The key idea is that human preferences are correlated and so the active user will prefer the items that like-minded people prefer. Techniques for collecting user information are: – Explicit profiling, i.e., each visitor is asked to provide some basic information, for example by filling out questionnaires, or by giving a set of keywords indicating his/her interests.

– Implicit profiling, i.e., tracking the visitor’s behaviour. The technique, transparent to the visitor, consists in saving the behaviour information of each user and updating it at each visit. For example, the Amazon.com logs contain the buying history of each customer and, based on this historical information, recommend specific items. – Using legacy data, i.e., credit applications and previous purchases are used for valuable profile information tracking the visitor’s behaviour. By using the above techniques, also in combined way, personalization systems produce comprehensive profiles that define site visitors’ interests. Once the profiles are available, they have to be correctly analyzed in order to produce recommendations. 2.3. Ranking and authoritativeness In this subsection we review the basic notions of graph theory, useful for describing the most popular algorithms for ranking web pages, adopted in current search engines. A (directed) graph G is a pair (V , E) where V is the set of nodes and E ⊆ {(u, , v)|u, v ∈ V } is the set of edges. Moreover, G is said to be weighted if E ⊆ {(u, σ, v)|u, v ∈ V , σ ∈ R+ }, whereas it is said to be undirected if for each (u, σ, v) ∈ E there is an edge (v, σ, u) ∈ E. Given a node i of G we denote with in(i) the set of arcs ending in i and with out(i) the set of arcs starting from i. We also assume that I(i) = |in(i)| and O(i) = |out(i)| denote the cardinalities of in(i) and out(i), respectively. A web graph is a directed graph where nodes denote page identifiers (URLs) and arcs denote links. A graph G = (V , E) is stored by means of a (|V | × |V |) matrix A, called adjacency matrix, where A(i, j) = 1 if there is an arc from i to j, and 0 otherwise. For a weighted graph G, the value of A(i, j) denotes the weight of the arc i → j and A(i, j) = 0 means that there is no arc from i to j. In the following the graph G = (V , E) representing the web is not weighted and E(i, j) = 1 means that page i has a link to page j, i.e., we do not consider heuristics assigning weight to arcs on the basis of the page content. Let A be a matrix, an eigenvalue of A is a number λ such that Aw = λw, for some w called eigenvector. The number of linearly independent eigenvectors defines the multiplicity of the eigenvalue λ. Traditional content based search engines have been proved to be not very useful as they suffer from the problem of synonymies and

G. Greco et al. / Collaborative filtering supporting Web site navigation

polysemies causing inaccurate results. In order to improve the recall2 and the precision3 of search engines, a new approach based on the analysis of the link structure of the Web is becoming widely used [9]. Techniques for ranking web pages have been successfully used in search engines such as Google; mathematically the PageRank measure can be expressed by the following formula. Let p be a Web page, O(p) the set of links starting from p and L(p) the set of pages having a link to p, the PageRank of p is: P R(p) = d + (1 − d)

P R(pi ) O(pi )

pi ∈L(p)

where d is a real number such that 0 < d < 1 [3]. PageRank simulates the behaviour of a surfer randomly navigating the Web, who jumps with probability d to a page selected randomly and, consequently, follows a link selected randomly with probability 1 − d. In some way it encapsulates a concept of generic popularity, but not of relevance for a particular topic. Kleinberg proposed an innovative approach based on the observation that each page has an authority rating (based on its incoming links) and a hub rating (based on its outgoing links) [14]. Kleinberg algorithm, called Mutual Reinforcement Approach,4 consists of two phases: – Creation of the base set. A fixed cardinality set of relevant documents with respect to the query is constructed by considering a subset of the documents given by a traditional search engine (called root set) and a subset of the documents linked to documents in the root set. – Computation of hub and authorities scores. Let n be the number of pages in the base set B. The two vectors X and Y of size n representing the authority and hub scores for each page in B are computed iteratively as follows: 1. X = Y = [1, . . . , 1] ∈ Rn 2. For all the k iterations needed (a) X ← AT Y , Y ← AX (b) X = X /|X |, Y = Y /|Y | 2 The number of relevant documents retrieved divided by the total

number of relevant documents in the collection 3 The number of relevant documents retrieved divided by the total number of documents retrieved 4 Based on the mutual reinforcement relationship among hubs and authorities.

159

At the end of the iteration process, the vectors X and Y contain, respectively, the authority and hub scores assigned to all pages in the base set. The vectors X and Y converge, respectively, to the principal eigenvectors of the matrices AT A and AAT . 3. A new technique for collaborative filtering In this section we describe a new approach for collaborative filtering, based on spectral techniques currently used for the identification of relevant pages on the Web. The main steps involved in this technique are: (a) the construction of the co-occurrence matrix from the user-item one; (b) the identification of the most relevant items (pages); in this step we differentiate (i) the case of a new user, for which we compute an absolute measure of relevance for each page, and (ii) the case of a known user, for which the relevance is conditioned by its previous choices; (c) the construction of a personalized view of the Web site, that is carried out by using an ad hoc algorithm computing the relevance of a whole view of the site (and not only of each single page). In more details, the second step (for the identification of the relevant pages) can be carried out in different manners, for instance by using the PageRank or with the Kleinberg algorithm, previously described. Anyhow, it should be pointed out that the page ranking algorithm suggests the most popular items without considering the different classes of users and the similarity between items. Thus, it can be useful to suggest the most popular items, but not to suggest sets of related items. On the contrary, Kleinberg’s algorithm suggests popular elements in the largest user community, and for this reason we adopt it in the following. Moreover, in the last subsection, we present a new ad hoc ranking strategy better suited for this particular task. Let’s start by considering the simple case in which the active user is new and unknown to the system. The relevance is computed by looking at the co-occurrence graph, whose definition is as follows. Definition 3.1. Let U be a set of users and I a set of items, then the user-item matrix A is associated to the graph GA = U, I, E where for each A(k, j) = 1 there is an arc(uk , ij ) in E and vice versa. The symmetric matrix C = AT A is called co-occurrence matrix and the associated graph GC , co-occurrence graph.

160

G. Greco et al. / Collaborative filtering supporting Web site navigation

3.1. Suggestions for known users Let’s now consider the general case in which the active user is known, i.e., he/she has already expressed opinions about items. In the case that the active user ua has already selected at least one item, say i1 , instead of first computing the related items and then ranking them, we can carry out the two steps at the same time. Items similar to i1 are in fact those that are at a reasonable distance, say d, in the co-occurrence graph. Definition 3.2. Let A be the user-item matrix, C = AT A the co-occurrence matrix, GC the graph associated to C and i an item (node) of GC . The neighbor of i at a distance d > 0, say N e(i, d), is the set of items in GC reachable from i with a path of maximum length d. In Fig. 3 we report a representation of the neighbors of an item i (denoted by the black node) with distance d = 1 (the gray nodes). In order to find the most authoritative items, conditional to the fact of being similar to the item i1 , we can simply apply Kleinberg algorithm to the sub-graph of GC containing only nodes in N e(i1 , d). This phase outputs a set of items, each having an associated rank. In particular, the rank associated to an item i, denoted by rd (i|i1), can be considered a suggested value for item i, calculated on the basis of the interest shown by the users in the item i1 . In the following we assume that the distance d is fixed and, whenever no ambiguity arises, we shall denote rd by r. A justification of this approach lies in a theorem recently proved in [16] stating that a collaborative algorithm choosing among the neighbors of a given item, achieves a performance at least 0.704 of the optimum for the problem of recommending items. Obviously the theorem has been proved under particular assumptions, but it is a reasonable starting point for further derivations. Let’s generalize this situation, by the following definition.

Fig. 3. N e(i, 1) can be used for calculating authoritative pages conditional on the choose of item i.

Definition 3.3. Let {i1 , . . . , ih } be the set of currently selected items and d a fixed distance. The suggestion for an item i is rd (i|{i1, . . . , ih }) =

h

(rd (i|ij ) ∗ w(ij ))

j=1

where – rd (i|ij ) is a ranking factor obtained by applying Kleinberg algorithm to the neighborhood N e(ij , d), for a fixed d > 0, and – w(ij ) is a weighting factor in [0..1] assigned to each item. As an example, the weighting factor could assign greater values to items chosen more frequently or, if the system runs for a long time, it could give more relevance to recently chosen items. In the latter case, relating item indexes to the (discretized) time in which they have been chosen, we obtain a factor of the form w(ij ) = 1/2j . Let us now consider its complexity. Given h previously selected items, we have to run h times Kleinberg algorithm at cost O(h × t × n2 ) where n is the number of items, and O(t × n2 ) is the complexity of Kleinberg algorithm. Note that this approach has the advantage that it does not need an effective clustering of both users and items. On the other hand, there are at least two critical aspects: (i) h sets of neighbors have to be considered and, consequently, h eigenvectors have to be calculated, (ii) the combination of such eigenvalues is arbitrary. In particular, for the second aspect, note that Definition 3.3 is based on the assumption that interests can be linearly combined even though there is no intuition beyond this assumption. As we consider this approach interesting, in the following section we develop a ranking scheme that is not affected by the above problems. 3.2. A different ranking scheme A new approach for finding relevant web pages has recently been introduced in [10]. Informally, the idea is to consider a random navigation in the co-occurrence graph in order to find the nodes which occur most frequently through the walks. In the following P denotes the matrix obtained from C = AT A through the normalization of its rows (i.e., by dividing each element Ci,j by n k=1 Ci,k ); P will be called Transition Probability matrix.

G. Greco et al. / Collaborative filtering supporting Web site navigation

Definition 3.4. Being in a node i, the probability of going to a node j with a random walk of random length (composed with a maximum of ω steps) is

Definition 3.6. Given a user u and a vector of chosen items I u , then the ranking, assigned to each item i, is given by the formula

1 1 1 P (i, j) + 2 P 2 (i, j) f f 1 + · · · + ω P ω (i, j) f

Tω (i, j) = (f − 1) ×

where 1/f is a damping factor, with f > 0. Theorem 3.5 [10]. Let P be a transition probability matrix and let f be a real number greater than the principal eigenvalue of P , then the sequence Tω = (f − 1) ×

ω i=1

1 ×P f

i

converges for ω → ∞ to the value W = (f − 1) × P × (f × I − P )−1 , where I is the identity matrix of the same size of P . Theorem 3.5 is the real kernel of our technique: the columns of the matrix W can be used to obtain the relevance of the pages. In fact, Wj (column j of W ) contains the probabilities of arriving in the corresponding node j with a random walk originating in nodes i = 1 . . . |V |, where |V | is the size of the base set (see Section 2). So if we assume 1/|V | as the probability of being in node i, the row vector r= =

1 × [1, . . . , 1] × W |V | f −1 × [1, . . . , 1] × P × (f × I − P )−1 |V |

T

r=

Iu

1j n I u (j)

× W,

where, letting P be the item-item matrix derived from the normalization of C = AT A, W is equal to (f −1)× P × (f × I − P )−1 and f > 0 is a constant factor. In order to calculate the recommendation value of items, it is necessary to compute the matrix W . The vector r can be computed by applying Definition 3.6. Fact 3.7. The vector r can be computed in time O(n3 ). Proof. O(n3 ) is the complexity of inverting an n × n matrix. The computation of the recommendation values is expensive as it requires the inversion of the matrix (f × I − P ); moreover, inverting a matrix may be a problematic task if it is ill-conditioned. We next present an approximate solution for the computation of W . The value W = (f − 1) ×

∞ 1 i=1

f

i ×P

can be approximated by considering a fixed number of steps ω0 < ∞ in the summation and an appropriate value for f . Proposition 3.8. The approximated solution of r, say r˜, can be computed in time O(ω0 × n2 ). Proof. Consider the following algorithm, in which tj and rj denote respectively a temporary vector and the ranking vector at the j-th step. T

is such that each term r(i) says how much node i is similar to the other nodes. Thus r(i) is a measure of the relevance of page i. Note that here we are considering the global graph, i.e., the fixed distance coincides with the size of the base set, d = |V |. The idea, which allows this technique to be applied to the collaborative filtering problem, consists in modelling each set of items, chosen by a user u, say Iu = {i1 , . . . , ih } as a 0-1 vector in the space of the n items, say I u .

161

t0 =

Iu P f

r0 = t0 ... tj =

1 t P f j−1

rj = rj−1 + tj

162

G. Greco et al. / Collaborative filtering supporting Web site navigation

At each step we compute a row-matrix product at cost O(n2 ). The global cost is, therefore, O(ω0 × n2 ). Obviously, to improve the computational complexity of computing r, with respect to that of the exact solution (Theorem 3.5), we have to fix ω0 < n. Theorem 3.9. Let f be the fixed constant used in the computation of the approximated value of r and ω0 the number of steps, the error δ = r − r˜ is δ=

1 f ω0 (f − 1)

Proof. By fixing the number of iterations to ω0 , the value r˜ becomes an approximate computation of the vector of rankings r, as given in Definition 3.6; if we deal only with the first ω0 terms of the sequence, the error is bounded by

1

1 1 + ...+ ω ω ω 0 f f f 0

1 1 − ω−ω +1 0 f 1 1− f

For ω → ∞ the error is given by 1 =δ f0ω (f − 1)

It follows that by fixing the maximal error δ and the constant f , it is possible to define the number of iterations k to be made (i.e., k = logf (1/(δ(f − 1)))). Observe that, in comparison with Kleinberg algorithm, our approach is not affected by the spectral structure of the matrix W , as the formula for calculating both r and r˜ still holds even if the underlying graph is constituted by several strong components. On the contrary, Kleinberg algorithm computes the principal eigenvector corresponding to the largest community and its complexity is O(t × n2 ), where t is the number of steps necessary for convergence (t = O(n) although can be considered constant in practical cases).

4. Web site personalization The aim of this section is to apply the previously described technique for developing a recommendation system in the Web context. The first notion we need to formalize is the concept of Web site.

Definition 4.1. A Web site S can be modeled by means of a rooted, directed graph G = h, N , E, where the nodes in N represent pages in S, the arcs in E represent links among pages of S and the root of the graph h represents the home page of S. The root of the graph G will also be denoted by root(G). Every navigation on the site S defines a different view of S. In order to identify the portion of a web site S which is more relevant for a given user ua , we consider the associated rooted graph GS = h, N , E and compute a new rooted graph GS = h, N , E , where N ⊆ N is the set of nodes in G (pages in S) which are relevant for the user. The relevance of a page p is given by the function Support(p) which is defined on the basis of previous navigation of the site. In order to compute GS , we first compute a tree TS = N , Et with root h; the graph GS is derived from TS by adding to Et the arcs in E joining nodes in N , i.e., E = Et ∪ {(x, y)|(x, y) ∈ E ∧ x, y ∈ N }. The function TreeView, taking a graph GS associated to a web site S and computing the tree TS , is reported in Fig. 4. The selection of the most authoritative page in the set of nodes N (corresponding to pages of S) is computed by assigning to every node in G a ranking value as defined in Definition 3.6. Thus, the T vector r, whose value is I u /( 1j n I u (j)) × W , assigns a rank to every node in G on the basis of the user preferences (the value of Iu ) and the log information (the value of W ). Moreover, while for a new user all elements in I u are equal to 1, for a known user the value of I u can be computed in two different ways. function TreeView(G: Rooted Graph): Tree; var T : Tree := {(nil, root(G))}; N: set of nodes :=Adjacents(root(G),G); p’ : node := root(G) ; p : node := MostAuthoritative(S,L); begin while Support(p)>min_supp do begin T := T ∪ {(p’,p)}; N := N - {p}∪ Adjacents(p,G); p’ := p; p := MostAuthoritative(N) end; return T end Fig. 4. Function TreeView.

G. Greco et al. / Collaborative filtering supporting Web site navigation

1. by looking at log files: in this case the site must provide a login for each user in order to identify his/her operation without taking the IP address into account. It follows that any user is associated with a set of most frequently seen pages, that can be used to calculate the most authoritative view of the site. 2. by an explicit query: in this case the user can query the Web asking for pages relevant for topics of interest; the system extracts a set R as response to the given query. Observe that the first method is quite simple, while the second one requires further explanation. If a user u submits a query q consisting of a string containing some terms, we need to develop a technique for obtaining the starting set R. An effective idea is to consider the neighborhood of the site stored in a local DB, consisting of a set of pages obtained with a focused Crawler. Then an algorithm can be applied to find the most authoritative pages with respect to q, filtering only the ones in N , which correspond to pages in the site S. Such pages define the set R. The function MostAuthoritative implements one of the techniques presented in the previous section on the bipartite graph GS = h, U ∪ N , E connecting users and visited pages. In particular GS is obtained considering the set U of users, the set N of pages of GS and by adding an arc e = (ui , pj ) ∈ E if the log file stores the information that user ui accessed page pj .

5. Implementation of an intelligent agent

– Crawler, which downloads from the Web the pages relevant for the site (e.g., pages pointing to or referred to by pages in the site). Every time a user accesses the site he/she can obtain a personalized view, either by logging into the system with a username and a password, or by explicitly supplying a query describing the topics of interest. The personalization is performed by the SP module, that is the core of the system, as it implements the Ranking Algorithm conditioned to the user behavior, described in the previous section. In particular, the user interests can be derived by looking at log file (in the case of a logged user) or by analyzing the query explicitly supplied. In the latter case, the module RA applies a Ranking Algorithm to the neighborhood of the site which has been collected by means of a Crawler and stored in a local database DB. Note that the user interface allows the personalized view of the site to be updated and the submission of queries searching for documents relevant for a given topic (see Fig. 8). The prototype has been implemented in Java and has been tried out on the Web site of the DEIS Department of the University of Calabria, whose link structure is reported in Fig. 6 (where many pages with no outlinks have been removed). For a better understanding of its structure, a logical view of the site is shown in Fig. 7 where each page, directly reachable from the home page, has links to four pages, each of them representing a different area of the department: Computer Science, Electronics, Control Systems and Optimization. For the sake of simplicity, in Fig. 7 the four different areas have been represented by means of a circle.

In this section a system prototype supporting web navigation is presented. The architecture is shown in Fig. 5 and consists of the following modules – User interface, which adds to the site a new page consisting of two frames; the one on the left shows the structure of the site, whereas the one on the right side shows the current page. – SP (Site Personalization), which produces a personalized view of the site shown in the frame in the left side of the user interface; the personalization of the site is defined on the basis of the information stored in the log file and the structure of the web site stored in the local database. – RA (Ranking algorithm), which applies the ranking algorithm, presented in Section 3, to the pages of the site.

163

Fig. 5. A system for web personalization.

164

G. Greco et al. / Collaborative filtering supporting Web site navigation

Fig. 6. Structure of the site “www.deis.unical.it”.

The main drawback of this organization is that a user interested in information regarding computer science, must navigate a large portion of the site. By analyzing users’ logs it is possible to compute, by using the function TreeView, a personalized view of the site which should contain paths frequently navigated by ‘similar’ users. A likewise result can be obtained by supplying the query “computer science”; in such a case the set of relevant pages is stored in the vector Iu that is used for computing the most authoritative page in the function TreeView. The user interface consists of two frames: one supports user navigation by displaying the most authoritative tree and the other is used to visualize pages (see Fig. 8). From the frame in the left side a user can log into the system, submit queries and update the personalized view of the site. This frame is also used to visualize the personalized view of the site as a tree shown in terms of nested hyperlinks labeled with the title of the corresponding page. 6. Conclusion Online personalization and tools supporting web sites navigation are an appealing and promising task

Fig. 7. Logical view of DEIS site.

for the AI community. Practically, the attractiveness of such technologies derives from the claim by consumer behaviorists that making the site useful and attractive, so that a visit becomes a personalized experience for each visitor, can increment the possibility of owning for e-shoppers. The work reported in this paper is a part of our ongoing research effort to implement novel algorithms for adaptative Web sites. In particular, in order to provide personalized recommendations, we propose an effec-

G. Greco et al. / Collaborative filtering supporting Web site navigation

Fig. 8. Web user interface.

tive approach obtained by combining the knowledge in the field of Recommendation Systems with the process of successfully supporting navigation of web sites. The proposed technique has been implemented in a system prototype tested internally on the web site of the DEIS Department of the University of Calabria. Theoretical analysis of our algorithm makes our approach competitive with respect to previously proposed systems, while the analysis of the experimental results have shown the system really adds benefits to the process of retrieving useful information. We are deep in conducting other experiments to validate the suggestions the system returns as answers to a given user query and we are currently implementing a version of the prototype easily fitting a generic site. References [1] Y. Azar, A. Fiat, A.R. Karlin, F. McSherry and J. Saia, Spectral analysis of data, in: ACM Symp. of Theory of Computer Science, 2001, pp. 619–626. [2] M. Balabanovic and Y. Shoham, Content-based, collaborative recommendation, Communication of the ACM 40(3) (1997), 66–72. [3] S. Brin and L. Page, The PageRank citation ranking: bringing order to the Web, http://google.standford.edu/backrub/ pageranksub.ps. [4] W.W. Cohen and W. Fan, Web-collaborative filtering: recommending music by crawling the Web, Proc. Int. Conf. on World Wide Web, Computer Networks 33(1–6) (2000), 685–698. [5] J. Fink, A. Kobsa and A. Nill, Adaptable and adaptive information provision for all users, including disabled and elderly people, The New Review of Hypermedia and Multimedia 4 (1998), 163–188. [6] E.J. Glover, S. Lawrence, W.P. Birmingham and C. Lee Giles, Architecture of a metasearch engine that supports user information needs, in: Proc. Int. Conf. on Information Knowledge Management, 1999, pp. 210–216.

165

[7] D. Goldberg, D. Nichols, B.M. Oki and D.B. Terry, Using collaborative filtering to weave an information tapestry, Communications of the ACM 35(12) (1992), 61–70. [8] N. Good, B. Schafer, J. Konstan, A. Borchers, B. Sarwar, J. Herlocker and J. Riedl, Combining collaborative filtering with personal agents for better recommendations, in: Proc. of the AAAI’99 Conference, 1999, pp. 439–446. [9] Google coorporation, Google Search Engine, http://www. google.com. [10] G. Greco, S. Greco and E. Zumpano, A probabilistic approach for distillation and ranking of Web pages, WWW Journal 4(3) (2001), 189–207. [11] S.J. Hong, R. Natarajan and I. Belitskaya, A new approach for item choice recommendations, in: Third Int. Conf. on Data Warehousing and Knowledge Discovery, 2001, pp. 131–140. [12] T. Joachims, D. Freitag and T.M. Mitchell, Web Watcher: a tour guide for the world Wide Web, in: Proc. Int. Joint Conf. on Artificial Intelligence, Vol. 1, 1997, pp. 770–777. [13] G. Karypis, Evaluation of item-based Top-N recommendation algorithms, in: Proc. Int. Conf. on Information and Knowledge Management, 2001, pp. 247–254. [14] J.M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM 46(5) (1999), 604–632. [15] J.A. Konstan, B.N. Miller, D. Maltz, J.L. Herlocker, L.R. Gordon and J. Riedl, GroupLens: applying collaborative filtering to usenet news, Communication of the ACM 40(3) (1997), 77–87. [16] R. Kumar, P. Raghavan, S. Rajagopalan and A. Tomkins, Recommendation systems: a probabilistic analysis, Journal of Computer and System Sciences 63(1) (2001), 42–61. [17] S. Lawrence, Context in Web search, IEEE Data Engineering Bulletin 23(3) (2000), 25–32. [18] R.D. Lawrence, G. Almasi, V. Kotlyar, M. Viveros and S. Duri, Personalization of supermarket product recommendations, Data Mining and Knowledge Discovery 5 (2001), 11–32. [19] H. Lieberman, Letizia: An agent that assists Web browsing, in: Proc. Int. Joint Conf. on Artificial Intelligence, Vol. 1, 1995, pp. 924–929. [20] D.M. Pennock, E. Horvitz, S. Lawrence and C. Lee Giles, Collaborative filtering by personality diagnosis: a hybrid memoryand model-based approach, in: Proc. 6th Conf. on Uncertainty in Artificial Intelligence, 2000, pp. 473–480. [21] M. Perkowitz and O. Etzioni, Towards adaptive Web sites: conceptual framework and case study, Artificial Intelligence 118(1–2) (2000), 245–275. [22] N. Ramakrishnan, B.J. Keller, B.J. Mirza, A.Y. Grama and G. Karypis, When being weak is brave: privacy issues in recommender systems, IEEE Internet Computing (2002) (to appear). [23] J. Riedl, Recommender systems in commerce and community, in: Proc. 7th ACM SIGKDD Int. Conf. on Knowledge discovery and Data Mining, 2001, p. 15. [24] J.J. Rocchio, Relevance feedback in information retrieval, in: The SMART Retrieval System–Experiments in Automatic Document Processing, Englewood Cliffs, NJ, Prentice Hall, Inc., 1971, pp. 313–323. [25] B.M. Sarwar, G. Karypis, J.A. Konstan and J. Riedl, Analysis of recommendation algorithms for e-commerce, in: ACM Conference on Electronic Commerce, 2000, pp. 158–167.

166

G. Greco et al. / Collaborative filtering supporting Web site navigation

[26] B.M. Sarwar, G. Karypis, J.A. Konstan and J. Riedl, Item-based collaborative filtering recommendation algorithms, in: Proc. 10th Int. Conf. on World Wide Web, 2001, pp. 285–295. [27] C. Shahabi, F.B. Kashani, Y. Chen and D. McLeod, Yoda: an accurate and scalable Web-based recommendation system, in: Proc. 9th Int. Conf. on Cooperative Information Systems, 2001, pp. 418–432.

[28] U. Shardanand and P. Maes, Social information filtering: Algorithms for automatic “word of mouth”, in: Proc. Conf. on Human Factors in Computing Systems, 1995, pp. 210–217. [29] A. Wexelblat and P. Maes, Footprints: history-rich tools for information foraging, in: Proc. Conf. on Human Factors in Computing Systems, 1999, pp. 270–277.