Mining linguistic browsing patterns in the world wide web - Springer Link

Viewer
Transcript

Focus

Soft Computing 6 (2002) 329 – 336 Springer-Verlag 2002 DOI 10.1007/s00500-002-0186-6

Mining linguistic browsing patterns in the world wide web T.-P. Hong, K.-Y. Lin, S.-L. Wang

329 Abstract World-wide-web applications have grown very rapidly and have made a significant impact on computer systems. Among them, web browsing for useful information may be most commonly seen. Due to its tremendous amounts of use, efficient and effective web retrieval has thus become a very important research topic in this field. Data mining is the process of extracting desirable knowledge or interesting patterns from existing databases for a certain purpose. In this paper, we use the data mining techniques to discover relevant browsing behavior from log data in web servers, thus being able to help make rules for retrieval of web pages. The browsing time of a customer on each web page is used to analyze the retrieval behavior. Since the data collected are numeric, fuzzy concepts are used to process them and to form linguistic terms. A sophisticated web-mining algorithm is thus proposed to find relevant browsing behavior from the linguistic data. Each page uses only the linguistic term with the maximum cardinality in later mining processes, thus making the number of fuzzy regions to be processed the same as the number of the pages. Computational time can thus be greatly reduced. The patterns mined out thus exhibit the browsing behavior and can be used to provide some appropriate suggestions to web-server managers. Keywords Web mining, Fuzzy set, Sequential pattern, Web page, World wide web

1 Introduction World-wide-web applications have grown very rapidly and have made a significant impact on computer systems. Among them, web browsing for useful information may be most commonly seen. Due to its tremendous amounts of use, efficient and effective web retrieval has thus become a T.-P. Hong (&) Department of Electrical Engineering, National University of Kaohsiung, Kaohsiung, 811, Taiwan, ROC e-mail: [email protected] K.-Y. Lin, S.-L. Wang Department of Information Management, I-Shou University, Kaohsiung, 840, Taiwan, ROC * This is a modified and expanded version of the paper ‘‘Web mining for browsing patterns’’, presented at The Fifth International Conference on Knowledge-Based Intelligent Information Engineering Systems & Allied Technologies, 2001, Osaka, Japan.

very important research topic in this field. Techniques of web mining have recently been requested and developed to achieve this purpose. Cooley et al. divided web mining into two classes: web-content mining and web-usage mining. Web-content mining focuses on information discovery from sources across the world wide web. On the other hand, web-usage mining emphasizes on the automatic discovery of user access patterns from web servers [16]. In the past, several web-mining approaches for finding sequential patterns and user interesting information from the World Wide Web were proposed [10, 11, 14, 15]. Chen and Sycara proposed the WebMate system to keep track of user interests from the contents of the web pages browsed. It can thus help users easily search data from WWW [11]. Chen et al. mined path-traversal patterns by first finding the maximal forward references form log data and then obtaining the large reference sequences according to the occurring numbers of the maximal forward references [10]. Cohen et al. sampled only portions of the server logs to extract user access patterns, which were then grouped as volumes [14]. Files in a volume could then be fetched together to increase the efficiency of a web server. Fuzzy set theory is being used more and more frequently in intelligent systems because of its simplicity and similarity to human reasoning [38]. The theory has been applied in fields such as manufacturing, engineering, control, diagnosis, and economics, among others [19, 26, 29, 38]. In this paper, we thus propose a novel web-mining algorithm to find linguistic browsing behaviors from data logs on web servers. The browsing time of a customer on each web page is used to analyze the retrieval behavior of a web site. Since the data collected are numeric, fuzzy concepts are used to process them and to form linguistic terms. The proposed algorithm then finds relevant sequential patterns from the linguistic data. Each page uses only the linguistic term with the maximum cardinality in later mining processes, thus making the number of fuzzy regions to be processed the same as the number of the pages. The algorithm therefore focuses on the most important linguistic terms for reduced time complexity. The mined rules are expressed in linguistic terms, which are more natural and understandable for human beings. The remaining parts of this paper are organized as follows. Data mining concepts are reviewed in Sect. 2. Related works are introduced in Sect. 3. Notation used in this paper is defined in Sect. 4. An algorithm for mining relevant fuzzy browsing behavior from log data on servers is proposed in Sect. 5. An example to illustrate the

proposed algorithm is given in Sect. 6. Discussion and conclusion are stated in Sect. 7.

330

2 Review of data mining concepts The rapid development of computer technology, especially increased capacities and decreased costs of storage media, has led businesses to store huge amounts of external and internal information in large databases at low cost. Mining useful information and helpful knowledge from these large databases has thus evolved into an important research area. Years of effort in data mining has produced a variety of efficient techniques. Depending on the types of databases to be processed, mining approaches may be classified as working on transactional databases, temporal databases, relational databases, and multimedia databases, among others [9]. Depending on the classes of knowledge sought, mining approaches may be classified as finding association rules, classification rules, clustering rules, and sequential patterns, among others. Among them, finding useful association rules and sequential patterns is especially important to real applications. An association rule is an expression X ﬁ Y, where X is a set of items and Y is a single item. It means in the set of transactions, if all the items in X exist in a transaction, then Y is also in the transaction with a high probability. For example, assume whenever customers in a supermarket buy bread and drink, they will also buy fruit. From the transactions kept in the supermarkets, an association rule: bread and drink ﬁ fruit will be mined out. In the past, Agrawal et al. [1–4] and his co-workers proposed several mining algorithms based on the concept of large itemsets to find association rules in transaction data. They divided the mining process into two phases. In the first phase, candidate itemsets were generated and counted by scanning the transaction data. If the number of an itemset appearing in the transactions was larger than a pre-defined threshold value (called minimum support), the itemset was considered a large itemset. Itemsets containing only one item were processed first. Large itemsets containing only single items were then combined to form candidate itemsets containing two items. This process was repeated until all large itemsets had been found. In the second phase, association rules were induced from the large itemsets found in the first phase. All possible association combinations for each large itemset were formed, and those with calculated confidence values larger than a predefined threshold (called minimum confidence) were output as association rules. A sequential pattern is an expression X1 ! X2 ! ! Xn , where Xi is a set of items. It means in the given set of transactions, if a customer buys all the items in X1 at some time, then he will buy all the items in X2 at some other time with a high probability. Similarly, the customer will sequentially buy all the items in X3 to Xn with a high probability. It is thus concerned with inter-transaction patterns, which are ordered itemsets. Agrawal and Srikant [4] proposed a mining algorithm to discover sequential patterns from a set of transactions. Five phases are included in their approach. In the first phase, the transactions are sorted first by customer ID as

the major key and then by transaction time as the minor key. This phase thus converts the original transactions into customer sequences. In the second phase, the set of all large itemsets are found from the customer sequences by comparing their counts with a predefined support parameter a. This phase is similar to the process of mining association rules. Note that when an itemset occurs more than one time in a customer sequence, it is counted once for this customer sequence. In the third phase, each large itemset is mapped to a contiguous integer and the original customer sequences are transformed into the mapped integer sequences. In the fourth phase, the set of transformed integer sequences are used to find large sequences among them. In the fifth phase, the maximally large sequences are then derived and output to users.

3 Related works It is useful to extract knowledge via data from the real world and to represent it in practical usage form. By using fuzzy sets, linguistic representation makes it easy to draw knowledge into efficient fuzzy rules, and easy to explain the knowledge to human beings. Several fuzzy learning algorithms for inducing rules from a given set of data have been designed and used for specific domains [5, 6, 8, 17, 18, 20–22, 24, 25, 33, 35]. Strategies based on decision trees were proposed in [12, 13, 31–33, 36, 37]. Wang et al. proposed a fuzzy version space learning strategy for managing vague information [34]. Hong et al. also proposed a fuzzy data mining approach, which integrated fuzzy-set concepts and conventional data mining algorithm to find association rules from quantitative data [23]. The approach consisted of three main steps. Step 1: Step 2: Step 3:

Transform each quantitative value in the transaction data into a fuzzy set using the given membership functions. Generate large itemsets by calculating the fuzzy cardinality of each candidate itemset. Induce fuzzy association rules from the large itemsets found in Step 2.

In addition, Chan and Au [7] used the fuzzy set theory and data mining technology to solve a classification problem. Maddouri et al. [28] proposed a fuzzy incremental production rule induction method that could generate imprecise and uncertain IF-THEN rules from data records. Several fuzzy clustering methods [30], such as Fuzzy C-Means (FCM), have already been exploited in the context of fuzzy modeling. A new linguistic model, based on a characterization of both the structure of linguistic concepts and the uncertainty distribution in knowledge acquisition, was also presented in [27]. It used the concept of compatibility clouds, which integrated randomness and fuzziness, to capture the qualitative knowledge through a quantitative way. Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web. It can be thought of as a specific application of sequential mining techniques. In this paper, fuzzy set

Table 1. A part of the log data used in the example Date

Time

Client-ip

Server-ip

Server-port

File-name

...

2001-03-01 2001-03-01 2001-03-01 : 2001-03-01 : 2001-03-01 2001-03-01 : : 2001-03-01 : 2001-03-01 : 2001-03-01 : 2001-03-01 : 2001-03-01

05:39:56 05:40:08 05:40:10 : 05:40:26 : 05:40:52 05:40:53 : : 05:41:08 : 05:48:38 : 05:48:53 : 05:50:13 : 05:53:33

140.127.194.127 140.127.194.127 140.127.194.127 : 140.127.194.127 : 140.127.194.82 140.127.194.82 : : 140.127.194.128 : 140.127.194.44 : 140.127.194.22 : 140.127.194.20 : 140.127.194.20

140.127.194.88 140.127.194.88 140.127.194.88 : 140.127.194.88 : 140.127.194.88 140.127.194.88 : : 140.127.194.88 : 140.127.194.88 : 140.127.194.88 : 140.127.194.88 : 140.127.194.88

21 21 21 : 21 : 21 21 : : 21 : 21 : 21 : 21 : 21

Inside.htm home-bg1.jpg line1.gif : person.asp : cheap.htm line1.gif : : cheap.htm : closing connection : cheap.htm : search.asp : closing connection

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

concepts will be used in our proposed algorithm to mine clients’ browsing behavior on a web site.

4 Notation n m c ni Di

the total number of log data; the total number of files in the log data; the total number of clients in the log data; the number of log data from the i-th client, l i c; the browsing sequence of the i-th client, l i c; the d-th log transaction in Di, l d ni ; the g-th file, l g m; the k-th fuzzy region of I g, l k jI g j, where jI g j is the number of fuzzy regions for I g ; the browsing duration of file I g in Did ; g the fuzzy set converted from vid ; g the membership value of vid in region Rgk ; the membership value of region Rgk in the i-th client sequence Di; the scalar cardinality of region Rgk ; the maximum count value among count gk values; the fuzzy region of file I g with max-count g ; the predefined minimum support value; the predefined minimum confidence value; the set of candidate sequences with r files; the set of large sequences with r files.

Fig. 1. The membership functions used in this example

.htm, .html, .jva and .cgi are considered home pages and used to analyze the mining behavior. The other files such as .jpg and .gif are thought of as inclusion in the home pages and are omitted. The number of files to be analyzed is thus reduced. g The log data to be analyzed are sorted first in the order vid of client-ip and then in the order of date and time. The g fid duration of each web page browsed by a client can then be gk fid calculated from the time interval between the page and its gk next page. Since the time durations are numeric, fuzzy fi concepts are used here to process them and to form linguistic terms. In the log data, each transaction contains count gk only one web page. The mining process can thus be simmax-count g plified when compared to that for multiple-item transactions in Agrawal and Srilant’s mining approach [4]. max-Rg The proposed web-mining algorithm then calculates the a scalar cardinality of each linguistic term on all the log data, k and adopts an iterative search approach to find large itemCr sets. Each item (web page) uses only the linguistic term with Lr the maximum cardinality in later mining processes, thus making the number of fuzzy regions to be processed the same as the number of original web pages. The algorithm 5 therefore focuses on the most important linguistic terms, Fuzzy web mining for browsing patterns which reduces its time complexity. The mining process Log data in a web site are used to analyze the browsing based on fuzzy counts is then performed to find fuzzy patterns on that site. Many fields exist in a log schema. Among them, the fields date, time, client-ip and file name browsing patterns from these large itemsets. The detail of are used in the mining process. Only the log data with .asp, the proposed web-mining algorithm is described as follows. Did Ig Rgk

331

Table 2. The resulting log data for web mining

332

Date

Time

Client-ip

File-name

2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01

05:39:56 05:40:26 05:40:52 05:41:08 05:41:30 05:41:54 05:42:25 05:42:46 05:43:02 05:43:46 05:44:06 05:44:07 05:44:17 05:44:31 05:45:47 05:46:46 05:47:45 05:47:53 05:47:56 05:48:19 05:48:38 05:48:53 05:49:33 05:50:13 05:51:14 05:53:16 05:53:33

140.127.194.128 140.127.194.128 140.127.194.82 140.127.194.128 140.127.194.22 140.127.194.82 140.127.194.82 140.127.194.128 140.127.194.22 140.127.194.44 140.127.194.44 140.127.194.82 140.127.194.128 140.127.194.22 140.127.194.44 140.127.194.38 140.127.194.44 140.127.194.38 140.127.194.44 140.127.194.38 140.127.194.44 140.127.194.20 140.127.194.38 140.127.194.20 140.127.194.20 140.127.194.20 140.127.194.20

inside.htm person.asp cheap.htm cheap.htm homepage.htm inside.htm cheap.htm search.asp cheap.htm inside.htm search.asp closing connection closing connection closing connection person.asp cheap.htm inside.htm inside.htm search.asp search.asp closing connection cheap.htm closing connection search.asp person.asp inside.htm closing connection

The web mining algorithm INPUT: A server log, a set of membership functions, a predefine minimum support value a. OUTPUT: A set of linguistic browsing patterns. STEP 1: Select the transactions with file names including .asp, .htm, .html, .jva .cgi and closing connection from the log data; keep only the fields date, time, client-ip and file-name. Denote the resulting log data as D. STEP 2: Transform the client-ips into contiguous integers (called encoded client ID) for convenience, according to their first browsing time. Note that the same client-ip with two closing connections is given two integers. STEP 3: Sort the resulting log data first by encoded client ID and then by date and time. STEP 4: Calculate the time durations of the web pages browsed by each encoded client ID from the time interval between a web page and its next page. STEP 5: Form a browsing sequence Dj for each client cj by sequentially listing his/her nj tuples (web page, duration), where nj is the number of web pages browsed by client cj. Denote the d-th tuple in Dj as Djd. g STEP 6: Transform the time duration vid of the file gname g g I appearing in Did into a fuzzy set fid represented as fid1 =Rg1 g g þfid2 =Rg2 þ þ fidl =Rgl Þ using the given membership g functions, where I is the g-th file name, Rgk is the k-th fuzzy gk g region of item I g, fid is vid ’s fuzzy membership value gk in region R , and l is the number of fuzzy regions for I g. gk STEP 7: Find the membership value fi of each gk region R in each browsing sequence Di as gk jDi j gk fid , where |Di| is the number of tuples in Di. fi ¼ MAX d¼1 STEP 8: Calculate the scalar cardinality of each region Rgk as:

Table 3. Transforming the values of field client-ip into contiguous integers Date

Time

Client ID

File-name

2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01 2001-03-01

05:39:56 05:40:26 05:40:52 05:41:08 05:41:30 05:41:54 05:42:25 05:42:44 05:43:02 05:43:46 05:44:06 05:44:07 05:44:17 05:44:31 05:45:47 05:46:46 05:47:45 05:47:50 05:47:56 05:48:19 05:48:38 05:48:53 05:49:33 05:50:13 05:51:14 05:53:16 05:53:33

1 1 2 1 3 2 2 1 3 4 4 2 1 3 4 5 4 5 4 5 4 6 5 6 6 6 6

inside.htm person.asp cheap.htm cheap.htm homepage.htm inside.htm cheap.htm search.asp cheap.htm inside.htm search.asp closing connection closing connection closing connection person.asp cheap.htm inside.htm inside.htm search.asp search.asp closing connection cheap.htm closing connection search.asp person.asp inside.htm closing connection

countgk ¼

c X

gk

fi

;

i¼1

where c is the number of browsing sequences. STEP 9: Find max-count g ¼ MAX lk¼1 count gk , where 1 g m, m is the number of files in the log data, and l is the number of regions for file I g. Let max-Rg be the region with max-count g for file I g. max-Rg will be used to represent the fuzzy characteristic of file I g in later mining processes. STEP 10: Check whether the value max-countg of a region max-Rg, g ¼ 1 to m, is larger than or equal to the predefined minimum support value a. If a region max-Rg is equal to or greater than a, put it in the set of large 1-sequences (L1 ). That is,

L1 ¼ fmax -Rg jmax -countg a; 1 g mg : STEP 11: If L1 is null, then exit the algorithm; otherwise, do the next step. STEP 12: Set r ¼ 1, where r is used to represent the length of sequential patterns currently kept. STEP 13: Generate the candidate set Cr+1 from Lr in a way similar to that in the aprioriall algorithm [4]. Restated, the algorithm first joins Lr and Lr, under the condition that r)1 items in the two itemsets are the same and with the same orders. Different permutations represent different candidates. The algorithm then keeps in Cr+1 the sequences which have all their sub-sequences of length r existing in Lr. STEP 14: Do the following substeps for each newly formed (r+1)-sequence s with contents ðs1 ; s2 ; . . . ; srþ1 Þ in Cr+1:

Table 4. Web pages sorted first by client ID and then by date and Table 5. The web pages browsed with their durations time Client ID (Web page, Duration) Date Time Client ID File-name 1 (B, 30) 2001–03–01 05:39:56 1 inside.htm 1 (E, 42) 2001–03–01 05:40:26 1 person.asp 1 (D, 98) 2001–03–01 05:41:08 1 cheap.htm 1 (C, 91) 2001–03–01 05:42:46 1 search.asp 2 (D, 62) 2001–03–01 05:44:17 1 closing connection 2 (B, 31) 2001–03–01 05:40:52 2 cheap.htm 2 (D, 102) 2001–03–01 05:41:54 2 inside.htm 3 (A, 92) 2001–03–01 05:42:25 2 cheap.htm 3 (D, 89) 2001–03–01 05:44:07 2 closing connection 4 (B, 20) 2001–03–01 05:41:30 3 homepage.htm 4 (C, 101) 2001–03–01 05:43:02 3 cheap.htm 4 (E, 118) 2001–03–01 05:44:31 3 closing connection 4 (B, 11) 2001–03–01 05:43:46 4 inside.htm 4 (C, 42) 2001–03–01 05:44:06 4 search. asp 5 (D, 64) 2001–03–01 05:45:47 4 person.asp 5 (B, 29) 2001–03–01 05:47:45 4 inside.htm 5 (C, 74) 2001–03–01 05:47:56 4 search.asp 6 (D, 80) 2001–03–01 05:48:38 4 closing connection 6 (C, 61) 2001–03–01 05:46:46 5 cheap.htm 6 (E, 122) 2001–03–01 05:47:53 5 inside.htm 6 (B, 17) 2001–03–01 05:48:19 5 search.asp 2001–03–01 05:49:33 5 closing connection 2001–03–01 05:48:50 6 cheap.htm Table 6. The browsing sequences formed from Table 5 2001–03–01 05:50:13 6 search.asp 2001–03–01 05:51:14 6 person.asp Client ID Browsing Sequence 2001–03–01 05:53:16 6 inside.htm (B, 30) (E, 42) (D, 98) (C, 91) 2001–03–01 05:53:33 6 closing connection 1 2 (D, 62) (B, 31) (D, 102) 3 (A, 92) (D, 89) 4 (B, 20) (C, 101) (E, 118) (B, 11) (C, 42) s (a) Calculate the fuzzy value fi of s in each browsing 5 (D, 64) (B, 29) (C, 74) 6 (D, 80) (C, 61) (E, 122) (B, 17) sequence Di as: rþ1

fis ¼ Min fisk ; k¼1

where region sk must appear after region sk)1 in Di. If two or more same subsequences exist in Di, then fis is the maximum fuzzy value among those of these subsequences. (b) Calculate the scalar cardinality of s as:

counts ¼

c X

fis ;

i¼1

where c is number of browsing sequences. (c) If count s is larger than or equal to the predefined minimum support value a, put s in Lr+1. STEP 15: IF Lr+1 is null, then do the next step; otherwise, set r ¼ r+1 and repeat STEPs 13–15. STEP 16: Output the maximally large q-sequences, q 2, to web-site mangers as browsing patterns. After STEP 16, the sequential browsing patterns output can serve as meta-knowledge concerning the given transactions.

6 An example In this section, a simple example is given to show how the proposed algorithm can be used to generate sequential

patterns for clients’ browsing behavior according to the log data in a web server. A part of the log data are shown in Table 1. Each transaction in the log data includes fields date, time, client-ip, server-ip, server-port and file-name, among others. Only one file name is contained in each transaction. For example, the user in the client-ip 140.127.194.127 browsed the file inside.htm at 05:39:56 on March 1st, 2001. Assume the fuzzy membership functions for a browsing duration on a web page are shown in Fig. 1. In Fig. 1, the browsing duration is divided into three fuzzy regions: Short, Middle and Long. Thus, three fuzzy membership values are produced for each duration according to the predefined membership functions. For the log data shown in Table 1, the proposed web mining algorithm proceeds as follows. STEP 1: The transactions with file names .asp, .htm, .html, .jva, .cgi and closing connection are selected for mining. Only the four fields date, time, cilent-ip and file-name are kept. Assume the resulting log data from Table 1 are shown in Table 2. STEP 2: The values of field client-ip are transformed into contiguous integers according to each client’s first browsing time. Results for Table 2 are shown in Table 3. Six clients logged in the web server and five web pages including homepage.htm, longin.htm, search.asp, cheap.htm and person.asp were browsed in this example.

333

Table 7. The fuzzy sets transformed from the browsing sequences

Client ID

Fuzzy Sets

1

ð0:8=B:Short þ 0:2=B:MiddleÞð0:6=E:Short þ 0:4=E:MiddleÞ ð0:6=D:Middle þ 0:4=D:LongÞð0:8=C:Middle þ 0:2=C:LongÞ ð0:2=D:Short þ 0:8=D:MiddleÞð0:8=B:Short þ 0:2=B:MiddleÞ ð0:6=D:Middle þ 0:4=D:HighÞ ð0:8=A:Middle þ 0:2=A:HIghÞ ð0:8=D:Middle þ 0:2=D:HighÞ ð1:0=B:ShortÞð0:6=C:Middle þ 0:4=C:HighÞ ð0:2=E:Middle þ 0:8=E:HighÞð1:0=B:ShortÞð0:6=C:Short þ 0:4=C:MiddleÞ ð1:0=D:MiddleÞð0:8=B:Short þ 0:2=B:MiddleÞð1:0=C:MiddleÞ ð1:0=D:MiddleÞð0:2=C:Short þ 0:8=C:MiddleÞ ð0:2=E:Middle þ 0:8=E:LongÞð1:0=B:ShortÞ

2 3 4 5 6

334

STEP 3: The resulting log data in Table 3 are then sorted first by encoded client ID and then by date and time. Results are shown in Table 4. STEP 4: The time durations of the web pages browsed by each encoded client ID are calculated. Take the first web page browsed by client 1 as an example. Client 1 retrieves the file inside.htm at 05:39:56 on March 1st, 2001 and the next file person.asp at 05:40:26 on March 1st, 2001. The duration of inside.htm for client 1 is then 30 seconds (2001/03/01, 05:39:56 – 2001/03/01, 05:40:26). Simple symbols are used here to represent web pages for convenience. Let A, B, C, D and E respectively represent homepage.htm, inside.htm, search.asp, cheap.htm and person.asp. The durations of all pages browsed by each client ID are shown in Table 5. STEP 5: The web pages browsed by each client are listed as a browsing sequence. Each tuple is represented as (web page, duration). The resulting browsing sequences from Table 5 are shown in Table 6. STEP 6: The time durations of the file names in each browsing sequence are represented as fuzzy sets. Take the web page B in the first browsing sequence as an example. The time duration ‘‘30’’ of file B is converted into the fuzzy set ð0:8=low þ 0:2=middle þ 0:0=highÞ by the given membership functions (Fig. 1). This step is repeated for the other files and browsing sequences. The results are shown in Table 7. STEP 7: The membership value of each region in each browsing sequence is found. Take the region D.Middle for client 2 as an example. Its membership value is max(0.8, 0.0, 0.6) ¼ 0.8. The membership values of the other regions can be similarly calculated. STEP 8: The scalar cardinality of each fuzzy region in all the browsing sequences is calculated as the count value. Take the fuzzy region D.Middle as an example. Its scalar cardinality ¼ (0.6 + 0.8 + 0.8 + 0.0 + 1.0 + 1.0) ¼ 4.2. This step is repeated for the other regions, and the results are shown in Table 8. STEP 9: The fuzzy region with the highest count among the three possible regions for each file is selected. Take file A as an example. Its count is 0.0 for Short, 0.8 for Middle, and 0.2 for Long. Since the count for Middle is the highest among the three counts, the region Middle is thus used to represent the file A in later mining processes. This step is repeated for the other files. Thus, ‘‘Short’’ is chosen for B, ‘‘Middle’’ is chosen for A, C and D, and ‘‘Long’’ is chosen for E.

Table 8. The counts of the fuzzy regions Region

Count

Region

Count

Region

Count

A.Short A.Midlle A.Long B.Short B.Midlle B.Long

0.0 0.8 0.2 4.4 0.6 0.0

C.Short C.Midlle C.Long D.Short D.Midlle D.Long

0.8 3.2 0.6 0.2 4.2 1.0

E.Short E.Midlle E.Long

0.6 0.8 1.6

Table 9. The membership values for sequence (B.Short, C.Middle) Client ID

Membership value of (B.Short, C.Middle)

1 2 3 4 5 6

0.8 0.0 0.0 0.6 0.8 0.0

STEP 10: The counts of the regions selected in STEP 9 is checked against the predefined minimum support value a. Assume a is set at 2 in this example. Since the count values of B.Short, C.Middle and D.Middle are larger than 2, these regions are put in L1. STEP 11: Since L1 is not null, the next step is executed. STEP 12: Set r ¼ 1, where r is used to represent the length of sequential patterns currently kept. STEP 13: The candidate set C2 is generated from L1 as follows: (B.Short, B.short), (B.Short, C.Middle), (C.Middle, B.Short), ......, (C.Middle, D.Middle), (D.Middle, C.Middle), (D.Middle, D.Middle). STEP 14: The following substeps are done for each newly formed candidate 2-sequence in C2: (a) The fuzzy membership value of each candidate 2-sequence in each browsing sequence is calculated. Here, the minimum operator is used for the intersection. Take the sequence (B.Short, C.Middle) as an example. Its membership value in the fourth browsing sequence is calculated as: max[min(1.0, 0.6), min(1.0, 0.4)] ¼ 0.6 since there are two subsequences of (B.Short, C.Middle) in that browsing sequence. The results for sequence (B.Short, C.Middle) in all the browsing sequences are shown in Table 9.

Table 10. The fuzzy counts of the candidate 2-sequences in C2 Sequences

Count

Sequences

Count

(B.Low, B.Low) (B.Low, C.Middle) (C.Middle, B.Low) (B.Low, D.Middle) (D.Middle, B.Low)

1.0 2.2 1.4 1.2 2.6

(C.Middle, C.Middle) (C.Middle, D.Middle) (D.Middle, C.Middle) (D.Middle, D.Middle)

0.4 0.0 2.4 0.6

(b) The scalar cardinality (count) of each candidate 2sequence in C2 is calculated. Results for this example are shown in Table 10. (c) Since only the counts of 2-sequences (B.Short, C.Middle), (D.Middle, B.Short) and (D.Middle, C.Middle) are larger than the predefined minimum support value 2, they are thus kept in L2. STEP 15: Since L2 is not null, r ¼ r + 1 ¼ 2. STEPs 13–15 are then repeated to find L3. C3 is first generated from L2, and the sequence (D.Middle, B.Short, C.Middle) is generated. Since its count is 0.8, smaller than 2.0, it is thus not put in L3. L3 is an empty set. STEP 16 then begins. STEP 16: The maximally large sequences are output to web-site managers. In this example, the three sequences B.Short ﬁ C.Middle, D.Middle ﬁ B.Short and D.Middle ﬁ C.Middle are output as meta-knowledge concerning the given log data.

7 Conclusions and future works In this paper, we have proposed a novel web-mining algorithm, which can process web-server logs to discover fuzzy sequential browsing patterns among them. The duration of each web page browsed by a client is calculated from the time interval between the page and its next page. Since the time durations are numeric, fuzzy concepts are used here to process them and to form linguistic terms. Each web page uses only the linguistic term with the maximum cardinality in later mining processes, thus making the number of fuzzy regions to be processed the same as the number of original web pages. The algorithm therefore focuses on the most important linguistic terms, which reduces its time complexity. A fuzzy mining process has then been performed to find fuzzy browsing patterns. The mined rules are expressed in linguistic terms, which are more natural and understandable for human beings. Although the proposed method works well in fuzzy web mining from log data, it is just a beginning. There is still much work to be done in this field. Our method assumes that the membership functions are known in advance. In [15–17], we also proposed some fuzzy learning methods to automatically derive the membership functions. In the future, we will attempt to dynamically adjust the membership functions in the proposed web mining algorithm to avoid inappropriate choice of membership functions. We will also attempt to design specific web-mining models for various problem domains. References 1. Agrawal R, Imielinksi T, Swami A (1993) Mining association rules between sets of items in large database, The 1993 ACM SIGMOD Conference, pp. 207–216

2. Agrawal R, Imielinksi T, Swami A (1993) Database mining: a performance perspective, IEEE Trans Knowledge and Data Eng 5(6): 914–925 3. Agrawal R, Srikant R (1994) Fast algorithm for mining association rules, The International Conference on Very Large Data Bases, pp. 487–499 4. Agrawal R, Srikant R (1995) Mining sequential patterns, The Eleventh International Conference on Data Engineering, pp. 3–14 5. Blishun AF (1987) Fuzzy learning models in expert systems, Fuzzy Sets and Systems 22: 57–70 6. de Campos LM, Moral S (1993) Learning rules for a fuzzy inference model, Fuzzy Sets and Systems 59: 247–257 7. Chan KCC, Au WH (1997) Mining fuzzy association rules, The Sixth ACM International Conference on Information and Knowledge Management, pp. 209–215 8. Chang RLP, Pavliddis T (1977) Fuzzy decision tree algorithms, IEEE Trans Syst, Man and Cybernetics 7: 28–35 9. Chen MS, Han J, Yu PS (1996) Data mining: an overview from a database perspective, IEEE Trans Knowledge and Data Eng 8(6): 866–883 10. Chen MS, Park JS, Yu PS (1998) Efficient data mining for path taversal patterns, IEEE Trans Knowledge and Data Eng 10: 209–221 11. Chen L, Sycara K (1998) WebMate: a personal agent for browsing and searching, The Second International Conference on Autonomous Agents, ACM 12. Clair C, Liu C, Pissinou N (1998) Attribute weighting: a method of applying domain knowledge in the decision tree process, The Seventh International Conference on Information and Knowledge Management, pp. 259–266 13. Clark P, Niblett T (1989) The CN2 induction algorithm, Machine Learning 3: 261–283 14. Cohen E, Krishnamurthy B, Rexford J (1999) Efficient algorithms for predicting requests to web servers, The Eighteenth IEEE Annual Joint Conference on Computer and Communications Societies 1: 284–293 15. Cooley R, Mobasher B, Srivastava J (1997) Grouping web page references into transactions for mining world wide web browsing patterns, Knowledge and Data Engineering Exchange Workshop, pp. 2–9 16. Cooley R, Mobasher B, Srivastava J (1997) Web mining: information and pattern discovery on the world wide web, Ninth IEEE International Conference on Tools with Artificial Intelligence, pp. 558–567 17. Delgado M, Gonzalez A (1993) An inductive learning procedure to identify fuzzy systems, Fuzzy Sets and Systems 55: 121–132 18. Gonzalez A (1995) A learning methodology in uncertain and imprecise environments, Int J Intelligent Syst 10: 57–371 19. Graham I, Jones PL (1988) Expert Systems – Knowledge, Uncertainty and Decision, Chapman and Computing, Boston, pp. 117–158 20. Hong TP, Chen JB (1999) Finding relevant attributes and membership functions, Fuzzy Sets and Systems 103(3): 389– 404 21. Hong TP, Chen JB (2000) Processing individual fuzzy attributes for fuzzy rule induction, Fuzzy Sets and Systems 112 (1): 127–1400 22. Hong TP, Lee CY (1996) Induction of fuzzy rules and membership functions from training examples, Fuzzy Sets and Systems 84: 33–47 23. Hong TP, Kuo CS, Chi SC (1999) A data mining algorithm for transaction data with quantitative values, Intelligent Data Anal 3(5): 363–376 24. Hong TP, Tseng SS (1997) A generalized version space learning algorithm for noisy and uncertain data, IEEE Trans Knowledge and Data Eng 9(2): 336–340

335

336

25. Hou RH, Hong TP, Tseng SS, Kuo SY (1997) A new probabilistic induction method, J Automated Reasoning 18: 5–24 26. Kandel A (1992) Fuzzy Expert Systems, CRC Press, Boca Raton, pp. 8–19 27. Li D, Han J, Shi X, Chan MC (1998) Knowledge representation and discovery based on linguistic atoms, KnowledgeBased Syst 10: 431–440 28 Maddouri M, Elloumi S, Jaoua A (1998) An incremental learning system for imprecise and uncertain knowledge discovery, J Information Sci 109: 149–164 29. Mamdani EH (1974) Applications of fuzzy algorithms for control of simple dynamic plants, IEEE Proceedings, pp. 1585–1588 30. Hoppner F, Klawonn F, Kruse R, Runkler T (1999) Fuzzy Cluster Analysis, John Wiley & Sons Ltd, New York 31. Quinlan JR (1987) Decision tree as probabilistic classifier, The Fourth International Machine Learning Workshop, Morgan Kaufmann, San Mateo, CA, pp. 31–37 32. Quinlan JR (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA

33. Rives J (1990) FID3: fuzzy induction decision tree, The First International Symposium on Uncertainty, Modeling and Analysis, pp. 457–462 34. Wang CH, Hong TP, Tseng SS (1996) Inductive learning from fuzzy examples, The Fifth IEEE International Conference on Fuzzy Systems, New Orleans, pp. 13–18 35. Wang CH, Liu JF, Hong TP, Tseng SS (1999) A fuzzy inductive learning strategy for modular rules, Fuzzy Sets and Systems 103(1): 91–105 36. Weber R (1992) Fuzzy-ID3: a class of methods for automatic knowledge acquisition, The Second International Conference on Fuzzy Logic and Neural Networks, Iizuka, Japan, pp. 265– 268 37. Yuan Y, Shaw MJ (1995) Induction of fuzzy decision trees, Fuzzy Sets and Systems 69: 125–139 38. Zadeh LA (1988) Fuzzy logic, IEEE Comp 83–93 39. Zimmermann HJ (1991)Fuzzy Set Theory and Its Applications, Kluwer Academic Publisher, Boston

Extracting, Presenting and Browsing of Web Social ... - Springer Link