249

Web Intelligence and Agent Systems: An international journal 1 (2003) 249–257 IOS Press

Context-sensitive filtering for the web Noriyuki Matsudaa,∗, Tsukasa Hirashima b , Toyohiro Nomotoc , Hirokazu Takia and Jun’ichi Toyoda d a

Wakayama University, Japan Kyushu Institute of Technology, Japan c Hitachi Ltd., Japan d Kinran College, Japan b

Abstract. This paper proposes a filtering method on Web browsing. Some papers have reported that a user’s interests shift continuously during browsing. To facilitate the user’s browsing, a filtering system must adapt to the user’s shift of interests. On the assumption that the user’s interests are reflected in browsing page content and order, we have proposed Context-Sensitive Filtering (CSF) for a hypertext CD-ROM encyclopedia. In the encyclopedia, the entire hypertext is controlled to have one topic per page. We confirmed effectiveness of CSF for the encyclopedia. However, effectiveness of CSF for the Web was not confirmed. We inferred that the reason of failure is the difference in page properties of the Web and the encyclopedia. This paper describes modeling and CSF improvements. Main improvements include dividing a Web page to approximate an encyclopedia page and augmenting model calculation with TF-IDF. We confirmed effectiveness of proposed CSF through comparing several filtering methods.

1. Introduction Browsing is popular way to gather information from the Web. Many studies have proposed myriad assistant functions for Web browser users. Most have adopted a machine learning approach. Balavanovic et al. observed shifting of users’ interests in Web browsing; they reported that their machine learning approach hindered follow- up the shift of users’ interests [1]. Lieberman emphasized importance of an assistant to follow-up the shift of users’ interest [4]. He also pointed out that browsing history’s influence on user interests decays over time. Our research purpose is realizing a filtering of keyword searching to follow up shifts of a user’s interests when browsing hypertext: e.g., an encyclopedia and the Web. Herein, we call shifts of the user’s interests “context”. We have the following two assumptions: (Assumption-1) a user’s interests in browsing shifts according to contents of pages that the user reads; (Assumption-2) influence of browsed pages on ∗ Corresponding author: Sakaedani 930, Wakayama City, 6408510 Japan. Tel.: +81 73 457 8122; Fax: +81 73 457 8112; E-mail: [email protected].

1570-1263/03/$8.00  2003 – IOS Press. All rights reserved

the user’s interests decreases over time. We previously proposed a model made from the user’s browsing history and Context-Sensitive Filtering (CSF), which determines ordering of pages in the result list of keywords searched for by the model [2]. Because users usually read a page from the top of the list, CSF places pages at the upper list that the model determines to be close to interests. We confirmed CSF effectiveness for a hypertext CD-ROM encyclopedia by comparing several filtering methods. All hypertext pages are kept to one-topic size. It is considered that such control of the whole of the hypertext functions better to model a user’s interests from browsing history. On the other hand, CSF effectiveness for the Web was not confirmed. We assumed that this failure was caused by difference in unique Web and encyclopedia page properties. A Web page is inferred to cause inaccurate model construction by having both influential and non-influential items in relation to a user’s interests. This paper is intended to describe model and filtering improvements for general HTML pages. Main improvements include dividing a Web page to approximate a CD-ROM encyclopedia page and adding TFIDF to the model to improve model accuracy. We describe the following two methods: dividing a Web

250

N. Matsuda et al. / Context-sensitive filtering for the web

page based on HTML elements, which show document structure; model calculation based on a recurrence formula independent of the browsing history length. The model uses a vector-space method. Applicability of the proposed dividing method is limited to HTML pages written under the HTML specification. We confirmed effectiveness of improvement through comparison with several other filtering methods. Our model includes the following advantages. It follows-up shifts of the user’s interests in browsing. Widyantoro et al. proposed a model of a user’s interests in Web browsing that addresses decay over time [7]. Their model learns dynamics of the user’s interests through relevance feedback; it also maintains a longterm and short-term descriptor. Our model includes no explicit distinction of long-term and shot-term. Our model represents them continuously as a decay curve. Also, the model source is different: our model addresses browsing history while theirs incorporates relevance feedback. Our model gives a user no additional load in browsing, such as input of the user’s rate of assistance or the user’s preferences in browsing. Kushmerick et al. proposed a user’s model with no explicit input except browsing [3]. The model is made from the referrer URL of a page in the browsing history and provides user-similar pages to that user’s referrer page. Our model considers that history influences decay over time when it addresses shifts of the user’s interests. Our dividing method for Web page is simple. Stefani et al. measures similarity between words to reflect interests [6]. Compared with semantic division, our dividing method may be less accurate. However, our method may have wider applicability because our method is independent of page contents. We consider that our simple method has an advantage in running a filtering system which addresses general Web pages; it replies quickly even in the face of numerous access. Section 2 describes CSF issues. In Section 3, we define a new CSF for the Web. In Section 4, we report a filtering system design. Section 5 describes our evaluation experiment.

2. Issues of CSF Hirashima verified CSF effectiveness by an experiment using actual users’ browsing [2]. However, we did not confirm effectiveness of CSF on Web browsing. The reason is discussed in this section.

2.1. Overview of CSF By Assumption-1 in Section 1, a user’s interests are modeled by browsing-history page contents. Model representation adopts a vector-space method. Vector weights express how a word is close to the user’s interests. By Assumption-2 in Section 1, the weight decreases over time in the browsing history. Weight calculation requires the following two conditions. Condition-1: Weight decreases over time. Condition-2: Calculation time of updating the model is practical and independent of the length of the browsing history. (Condition-2) is a condition to implement the filtering system. Users benefit from a filtering function that CSF provides; it sorts the result list of keyword searching by the model [2]. 2.2. Webfalcon-I We call the filtering system on a hypertext CD-ROM encyclopedia ‘CD-ROM FALCON (Filtering Agent based on Local CONtext)’. We call the filtering system on the Web ‘WebFALCON-I’. In CD-ROM FALCON, a user and the filtering system use the same computer. However, in WebFALCON-I, the system must gather the browsing history from the user’s computer. In practical system implementation, it is considered that the gathering function is an add-on program of the browser. We used a proxy server to gather users’ browsing histories because of easier implementation. In CD-ROM FALCON, the system can build a complete database of words and pages in hypertext. However, that is impossible in WebFALCON-I. Here, we used Web robot software to build a database of words and pages on the Web. When a user browses a page that is not registered in the database, the page and words in the page are registered at this time. Figure 1 Shows an overview of WebFALCON-I; it consists of the following four major functions. Database process: 1. A Web software robot gathers Web pages. 2. Words contained in the page are extracted by morphological analysis. 3. The page URL and words are registered to the database. Modeling process: 1. A user is identified by the access.

N. Matsuda et al. / Context-sensitive filtering for the web

251

function Software Robot

Database Dat Lexical Analysis

Page-DB

Browser Search User

Retrieval h

result

Filtering

url

Model-DB Browsing

proxy

Modeling

Fig. 1. Overview of WebFALCON-I.

2. When the URL is unregistered in the database, words are extracted by morphological analysis and the URL and the words in the page are registered to the Page-DB. 3. The user’s model is updated in the Model-DB. Retrieval process: 1. Pages which contain specific keywords are found. 2. The result list is made; it consists of a page title and the page text (256 bytes). Filtering process: 1. A user is identified by access. 2. A score of a page in the result list is calculated by the user’s model in Model-DB. The result list is sorted by score. 2.3. Webfalcon-I issues We performed an experiment to compare several filtering methods with WebFALCON-I. However, the experiment did not demonstrate WebFALCON-I effectiveness. Difference in Web and the encyclopedia page properties was considered to inhibit the filter. Pages in the CD-ROM encyclopedia are arranged overall so that one page has one topic. In contrast, there is no overall arrangement for Web pages. Web pages often include various topics. This difference is inferred to engender the following problems.

Problem-1: Modeling disturbance A part of a Web page which is not influential in the user’s interests works inaccurately in modeling. This decreases modeling precision. Problem-2: Filtering disturbance A part of a Web page which is not influential in the user’s interests works inaccurately in evaluating candidate pages of search results. This decreases filtering precision.

3. CSF for the web This section describes the following three methods: dividing a Web page to approximate a page of the encyclopedia; revised CSF model calculation; and filtering calculation. 3.1. Dividing web page 3.1.1. Dividing procedure To improve the accuracy, the dividing method requires fewer topics in divided pages than for the original page. We assume that document structure has a strong relation with document topics. According to the HTML specification, some HTML elements are defined to express document structure [5]. The HTML specification defines heading elements (H1, H2, H3, H4, H5, H6) as ‘briefly describes the topic of the section it introduces’. Horizontal rule (HR) elements indicate a change of a topic.

252

N. Matsuda et al. / Context-sensitive filtering for the web

Here, we designed a division method for a Web page based on these HTML elements. In actual Web pages, authors sometimes use HTML elements incorrectly in ignorance of specifications. In such cases, it is extremely difficult to divide a page exactly according to the topic. However, it is expected to improve model precision when division decreases noise for the user’s interests. Document structure is represented as a hierarchy. The leaf of the hierarchy is a minimum unit of the document. Here we regard a leaf as a topic. Figure 2 shows an example of a hierarchy. Here, when a HR element divides into two sentences, their relation is represented as brothers in the hierarchy; Sentence-B and Sentence-C are examples in Fig. 2. A hierarchical node is able to inherit properties from the parent node. A leaf node which inherits from the root node shows a topic. We call such a leaf a Page-Block. The procedure of division follows. 1. Part of the head element () is regarded as the root node in the hierarchy. The following procedure is run about the body part () in a page. 2. All elements are eliminated except of headings and HR. 3. HR elements with no sentence, i.e. consecutive HR elements, are deleted. 4. A HR element is replaced into a heading element T , which is one level lower than a heading T above the HR element. 5. A node is made in order of the higher heading level. A node has a heading sentence. 6. Procedure (5) is repeated until the heading level reaches the lowest. 7. Page-Block is made by inheritance from all nodes on the path of the leaf. 3.1.2. Preliminary experiment We asked eight subjects to evaluate Page-Block as a text which has one topic for the purpose of evaluating the division method: 20 Web pages gathered randomly from Japanese Web site were divided into 86 PageBlocks. Subjects grouped Page-Blocks into the following three categories: “too small”, “adequate size”, or “too large”. Table 1 shows experiment results. Most Page-Blocks that subjects chose as “too small” and “too large” were advertisement and link pages which are intended to introduce links. These are inferred to have only slight influence on filtering precision. Most division failures were caused by deviant use

Table 1 Result of the preliminary experiment Category Adequate size Too small Too large

Mean number of Page-Block 58% 22% 20%

of HTML elements in ignorance of specifications. In such cases, heuristic rules of division for some failure cases might be effective. With these considerations, the division method is considered to improve precision of the user’s model.

3.2. Calculation of weight 3.2.1. Condition of CSF for the web We added TF-IDF to the weight calculation of a word [2] to increase model accuracy. The following condition is added in CSF. Condition-3: The weight is directly proportional to frequency of the word in a Page-Block; it is inversely proportional to frequency of the PageBlock which includes the words.

3.2.2. Definition of a weight of an index To satisfy the Condition-3, TF-IDF method is added to calculation of our model. A new definition of a weight of an index follows. When a user browses n(n 1) pages, the weight Iw(i, n) of the index i is the following. n Hw(k) · T F IDF (i, k) Iw(i, n) = k=1 n k=1 Hw(k) · T F IDF max (1) s(i, n) = T (n) Therein, T (n) functions as standardization of Iw(i, n); k is kth page in the history. Hw is weight of a page. Hw decays over time, T F IDF (i, k) (0 < T F IDF 1) is the value of TF-IDF of index i in kth page, and T F IDF max is the maximum value of T F IDF (i, k) in the browsing history. When a page is gathered by Web robot before calculating the model, T F IDF (i, k) and T F IDF max are calculated (see Section 4.1). When the model is updated, these values can be found on Page-DB. Therefore, these values are regarded as constant.

N. Matsuda et al. / Context-sensitive filtering for the web

Musical instrument

instruments

Strings

sentence-A

Percussion

sentence-B

sentence-C

253

Title : Musical instrument Ti H1: instruments H2: Strings sentence-A

H2: Percussion sentence-B

sentence-C

Page-Block 1

Musical instrument, instruments, Strings, sentence-A

Page-Block 2

Musical instrument, instruments, Percussion, sentence-B

Page-Block 3

Musical instrument, instruments, Percussion, sentence-C

Fig. 2. An example of a page hierarchical structure.

3.2.3. A weight of a page We choose the following equation for calculating weight of a page. Hw(k) = rn−k

(2)

Here, r(0 r 1) is a rate of decline. In the old model of CSF, T F IDF (i, k) and T F IDF max are added to Eq. (1). We previously proved that the old model satisfies Condition-1 and Condition-2 [2]. It is obvious that Iw(i, n) in Eq. (1) satisfies these conditions because T F IDF (i, k) and T F IDF max are constant when the model is updated. We omit the proof in this paper. 3.2.4. Determining a rate of decline It is extremely difficult to determine an exact value r. Here, we tentatively choose 0.6 that satisfies Eq. (3) [2]. A method to find an adequate value of r is an issue in our future work. Iw(‘b’, n) < Iw(‘a’, n) < Iw(‘c’, n)

(3)

Here, the term ‘a’ is included in only the latest page and is not included in any pages in the history. Word ‘b’ is not included in the latest page and is included in two most recent pages without the latest. Word ‘c’ is not included in the latest page and is included in the three most recent pages without the latest. 3.3. Consideration of reading area According to Assumption-1 in Section 1, a model should be made only from the user’s reading area. The

following two methods ensure that the filtering system acquires the reading area: the user directly specifies the area in a page; the system automatically estimates the area. That is, it regards displayed area in a browser window as reading area. It is expected that the former provides more precise area, but it burdens a user with an extra load. The latter is more difficult in system implementation. We adopted the former implementation. Estimation of reading area is an issue of our future work because the purpose of the system in this paper is evaluation of our proposed method. 3.4. Filtering To decide the ordering of the result list of PageBlock, each Page-Block needs to be evaluated by the user’s model. Evaluation is defined as the following. The feature vector U v of the user’s model, the feature vector Dv of the Page-Block d, and the inner product R are defined below. Uv = (Iw(i1 , n), Iw(i2 , n), . . . , Iw(iM , n)) (4) Dv = (w(i1 , d), w(i2 , d), . . . , w(iM , d))

(5)

R(Uv , Dv ) = M w(ik , d) · Iw(ik , n) k=1 M M 2 2 k=1 w(ik , d) · k=1 Iw(ik , n)

(6)

Here Page-Blocks are sorted in descending order of R. A Page-Block that has a large value of R is placed on the upper of the list as information which closely approximates the user’s interests.

254

N. Matsuda et al. / Context-sensitive filtering for the web

function new function of Web-FALCON-II Software Softw are Robot Robot

Database

Dividing pages Lexical Analysis

Page-DB Retrieval Browsing

result

ur l

User

Search

httpd

Filtering Model-DB

Updating Model

Modeling

Browser Fig. 3. Overview of WebFALCON-II.

4. Filtering system for the web This section describes design of WebFALCON-II, which is based on CSF of Section 3.

1. A Page-Block is searched by keywords specified. 2. A result list, comprising a title and top text (256 bytes) in the Page-Block, is made. Filtering process:

4.1. Overview of webfalcon-II A dividing function and the model with TF-IDF are added to the system to realize WebFALCON-II. Figure 3 shows a WebFALCON-II overview. Bold lines in Fig. 3 show differences with WebFALCON-I. Functional procedures follow. Database process: 1. A software robot gathers Web pages. 2. The page is divided into Page-Blocks. 3. Words are extracted from Page-Block by morphological analysis. 4. A TF-IDF value of the word is calculated. 5. The URL, word and TF-IDF value are registered into the Page-DB. Modeling process: 1. A user is identified by access. 2. Words are extracted from input text by morphological analysis. 3. The user’s model is updated in the Model-DB. Retrieval process:

1. A user is identified by access. 2. A score of the result list is calculated by the user’s model in Model-DB. The result list is sorted by score. When a user chooses a link in the result list of keyword searching, the filtering system displays the whole page at once and then scrolls the page to the top of the Page-Block. The system needs to give the page position to the user’s browser. Here the system embeds a name element such as “” into the page. 4.2. Interface The filtering system interface comprises the following three windows: the browsing window, the keyword searching window, and the modeling window. A user selects read text by the mouse, drags it to the modeling window and clicks the “update” button. A user inputs keywords in the window for search and clicks the “search” button. Then, the result list is displayed in the window for browsing. Top text (256 bytes) in each page is also displayed. When the user

N. Matsuda et al. / Context-sensitive filtering for the web

255

Fig. 4. A picture of WebFALCON-II.

clicks a page in the list, the page appears in the same window.

personal diaries, etc. In Page-DB 199,652 words were registered. The dividing program generated 50,633 Page-Blocks.

5. Evaluation

5.2. Experiment

To evaluate the CSF, we conducted an experiment to compare several filtering methods by Web-FALCON-II with real users’ browsing. 5.1. System environment

We asked 30 Japanese university students to freely browse the Web and search for pages of interest and not to search a specific page. We provide them keyword searching whenever they need it. All web pages which they browsed were written in Japanese.

The Web-FALCON-II was implemented on UNIX machine (Ultra SPARC -II CPU, 300 MHz, Solaris 2.6 OS, 128MB memory). The database was the mSQL (Hughes Technologies, Australia). Morphological analysis was by ChaSen (NAIST, Japan) for Japanese and English languages. The system randomly gathered 11,720 Web pages which were written in Japanese because mother languages of all subject are Japanese. The gathered pages are from university Web sites, products sites, advertisements, newspapers,

5.2.1. Procedure A subject used three kinds of ordering in the result list of the keyword searching: (Order-1, Order-2 and Order-3). Order-1: as the rate of decline in Eq. (2) is 0.6. This is the CSF that we propose. The weight decreases over time. Order-2: is zero. This filtering considers only the latest page. It ignores the browsing history. Filtering provides similar pages to the latest page in the history. Order-3: is 1.0. This

256

N. Matsuda et al. / Context-sensitive filtering for the web Table 2 Scoring of the ordering method Rank (x) 1 2 3 4 5

score (M-x+1) candidate 5 Page E 4 Page A 3 Page C 2 Page B 1 Page D total of score

user’s evaluation Good NG Good NG NG

score of the order 5 0 3 0 0 8

Table 3 Analysis of the experiment

Normal page Divided page

The number of times which the score exceeded Order-1 Order-2 Order-3 13 16 – 16 – 13 21 9 – 22 – 8

filtering ignores page ordering in the browsing history. Filtering provides similar pages to all pages in the history. At the same time, a subject used two kinds of Webpage type of the keyword searching: normal page and divided page by our method in Section 3.1.1. Therefore, we asked subjects to browse six times (three ordering methods two page types). The experiment browsing procedure follows. 1. We asked subjects to start browsing the Web. 2. When termination conditions were satisfied, we asked the subject to stop browsing. Conditions were the followings: browsing history length is 10 or more; the number of candidate pages in the result list of latest keyword searching is from 5 to 60. 3. We provided the subject the result list for which ordering of candidate page was random. Then we asked them to evaluate whether each page in the list was close to that user’s interests or not. Evaluation was ‘Good’ or ‘NG’ for a page on the list. 5.2.2. Analysis We required a score that indicates a subject’s browsing effectiveness to analyze ordering methods effectiveness. Here, we define the scoring method of an ordering method. When the number of a page in the result list is M and the page rank is x, the page score is (M-x+1) when the subject choose ‘Good’ and zero when the subject choose ‘NG’. The score of the ordering method is the sum total of the page scores. A scoring example is shown in Table 2. In that exam-

2-sided Sign-test (P)

0.71 0.71 0.04 0.02

ple, the number of pages in the result list is 5. The subject selected ‘Good’ for pages C and E, ‘NG’ for pages A, B and D. Then, the result list is sorted by the ordering method. Ordering is page E, A, C, B and D. Ranks of ‘Good’ pages are first and third. Therefore, the ordering method score is 5 + 3 = 8. A two-sided sign-test was run for the number of exceeded times of comparing between scores of two ordering methods within a subject. Table 3 shows experiment results. Here, tie scores are eliminated. Tests were run between Order-1 and Order-2 and between Order-1 and Order-3 in normal and divided page, respectively. There were no significant differences in normal pages. Among divided pages, there were significant differences between Order-1 and Order-2 (p < 0.05) and between Order-1 and Order-3 (p < 0.05). This indicates that the proposed method is effective for improvement of CSF for Web browsing. We inferred that the proposed dividing method for a Web page reduces emphasis on items that are unrelated to the user’s interests. Oppositely, it reduces emphasis on relatively few items that are related to the user’s interests. In Order-2 (r = 0), the system provides subject pages that are similar to the latest page in the browsing history. In Order-3 (r = 1), the system provides pages that are similar to pages in the history without considering the page order. The result was that the Order-1 score of the proposed method is higher than the Order-2 and Order-3 scores. This suggests that the proposed method, based on decay over time, is effective for Web browsing.

N. Matsuda et al. / Context-sensitive filtering for the web

257

6. Conclusion

References

This paper proposed CSF in Web browsing. CSF works effective when a hypertext node is controlled to have a topic. However, Web pages have diverse content and purpose. In this paper, we propose a new CSF for Web page: page division with HTML elements, weight calculation with TF-IDF, and filtering method, to reduce noise in the Web pages. Also, this paper describes overview of filtering system with the proposed model. We confirmed CSF effectiveness through comparison with several filtering methods. Future works will address modeling and model applications. To model a user’s interests more accurately, the rate of decline of this model should dynamically follow-up the shift of a user’s interests. Currently, the model is applicable to filtering of keyword searching. It is considered that this model provides information related to the user’s interests during browsing, for example, providing Page-Block in keyword searching results by a model without the user’s specified keywords.

[1] M. Balabanovic and Y. Shoham, Learning information retrieval agents: Experiments with automated Web browsing, AAAI Spring Symposium on Information Gathering, Palo Alto, USA, 1995. [2] T. Hirashima, K. Hachiya, A. Kashihara and J. Toyoda, Information Filtering Using User’s Context on Browsing in Hypertext, User Modeling and User-Adapted Interaction 7(4) (1997), 239–256. [3] N. Kushmerick, J. McKee and F. Toolan, Towards Zero-Input Personalization: Referrer-Based Page Prediction, Adaptive Hypermedia and Adaptive Web-Based Systems International Conference, AH 2000, pp. 133–143. [4] H. Lieberman, Letizia: An agent that assists Web browsing, Proc. of IJCAI95, Montreal, Canada, 1995, pp. 924–929. [5] D. Raggett,, HTML 4.01 Specification, http://www.w3.org/TR/ html4/cover.html, 1999. [6] A. Stefani and C. Strapparava, Exploiting NLP techniques to build user model for Web sites: the use of WordNet in SiteIF Project, Proceedings of the 2nd Workshop on Adaptive Systems and User Modeling on the WWW, 1999. [7] D.H. Widyantoro, T. Ioerger and J. Yen, Learning User Interest Dynamics with a three descriptor representation, Journal of the American Society for Information Science (JASIS) (2000).

Context-sensitive filtering for the web

usually read a page from the top of the list, CSF places .... used Web robot software to build a database of words ... A Web software robot gathers Web pages. 2.

Download PDF

165KB Sizes 0 Downloads 174 Views

Report

instruments

Strings

Percussion

Context-sensitive filtering for the web

Recommend Documents