A Generalized Links and Text Properties Based Forum ...

Viewer
Transcript

A Generalized Links and Text Properties Based Forum Crawler Amit Sachan, Wee-Yong Lim, Vrizlynn L. L. Thing Digital Forensics Lab Cryptography & Security Department Institute for Infocomm Research, Singapore Email: {sachana, weylim, vriz}@i2r.a-star.edu.sg

Abstract—Web forums have become a major source of information gathering/mining due to a large amount of user generated content. Crawling of web forums is necessary to gather/mine the information from them. However, a generic web crawler is unable to efficiently and effectively crawl the web forums because of the existence of many redundant and duplicate pages. In addition, there exists a crawling relationship among the useful pages that need to be considered. So, for efficient crawling, we need to intelligently crawl the web forums by eliminating redundant and duplicate pages, and understanding the crawling relationship. Existing works in forum crawling use visual pattern recognition based methods, which make them extremely computational expensive. In this paper, we propose a novel light-weight crawling method using text and links properties of the pages in web forums. Theoretical analysis and experimental results show the effectiveness and efficiency of the proposed method. Keywords-forum crawler; clustering; information retrieval

I. I NTRODUCTION Web forums have emerged as an important source of information over the Internet. Due to a large number of people discussing in web forums, a large amount of relatively unbiased and useful content is generated. This vast amount of data can be very valuable for various purposes such as national security, scams detection and user opinion mining. Web forums can be crawled to gather useful data. However, a generic breadth-first web crawler [1] may be unable to efficiently and effectively crawl the web forums due to the following two reasons. First, a generic breadth-first depth-limited web crawler may miss many useful pages and crawl the pages without understanding the correlation among them. To allow users to conveniently browse the pages according to their interest, multiple hierarchies of pages are common. This usually consists of a board or seed page, list-of-thread pages and list-of-post pages. Each of these pages may be divided into multiple pages and connected via page-flipping links. Unless we crawl according to the hierarchy and page-flipping links, it is difficult to understand the correlation among the pages. The second reason for the unsuitability of a generic web crawler is that it may crawl many useless (redundant and duplicate) forum pages. According to a statistic quoted by Cai et al. in [2], more than 40% of the pages crawled by a generic crawler on web forums are generally useless.

Although both generic web crawling and forum-specific crawling requires the handling of useless pages, generic web crawlers, by their nature, are unable to exploit the structured hierarchy of forum sites to obtain the optimal traversal paths. Hence, there is a need to take into account of a priori information on the types of pages which are deemed useful in forum sites. Useless pages, not containing information pertaining to the discussions such as user profile pages, login portals, etc. should not be crawled. Another type of useless pages is pages containing duplicate information (e.g. due to shortcut links, query links and sorting links). These pages are created to facilitate convenient browsing and searching. For example, links to the latest posts are usually placed in the board page, so that the users can access the latest posts directly, without going through the corresponding list-ofthread pages. Download of useless pages wastes network bandwidth and negatively affects repository quality, which in turn, affects the performance during the data analysis. To handle the above mentioned problems, a crawler specific for web forums is necessary. Of all the links present on web forums, it is necessary to identify the links corresponding to the desired pages. This feature does not exist in a generic web crawler. There are several works that deal with the problems of forum crawling. Guo et al. in [3] assumed forums are generally created using specific software and the authors designed a forum crawling scheme based on heuristic rules particular to the software used. But this is clearly dependent on the prevalence of the different forum software which may change over time. Moreover, forums may not contain the information on the software used to create them. Cai et al. in [2] designed a forum crawler by categorizing the pages according to their templates. But this method requires a high computation burden because of the costly pattern recognition operations involved during the training and online crawling phases. Wang et al. in [4] extended the work in [2] and elaborated on the automatic traversal strategy design. But this work also suffers from a high computation overload similar to [2]. Also, the traversal strategy design in [4] requires the downloading of a large number of pages (about 5,000) during the training phase. The objective of the work in this paper is to design

a general, automatic, computation-efficient and bandwidthefficient forum crawler. We propose a forum crawler that only uses text and link properties of forum pages in its automated traversal strategy generation. The proposed forum crawler consists of the following three modules: i) URL sampling and clustering, ii) traversal path generation using text and link properties, iii) traversal paths score computation algorithms. In comparison with the costly pattern recognition operations in [4], the proposed crawler requires only lightweight operations such as string pattern matching, computation of the amount of text and links on the pages, computation of a similarity matrix and scores for the traversal paths. The proposed crawler requires the download of less than 1,000 pages to generate a traversal strategy in comparison with about 5,000 pages used in [4]. Also, the proposed forum crawler is more general in comparison to the existing works as no assumptions are made on the forum software or page templates used in building the web forum. The rest of the paper is organized as follows. Section II discusses the works related to forum crawling. In Section III, we briefly discuss the organization of general web forums. In Section IV, we discuss our proposed approach of forum crawling. Experiments and performance analysis of the proposed forum crawling algorithm are discussed in Section V. Finally, the paper is concluded in Section VI. II. R ELATED W ORK Several works in literature deal with the problem of forum crawling. Cai et al. in [2] designed a forum crawler by identifying the repetitive regions and patterns on forum pages. Their proposed algorithm is applicable for a wide variety of forums, but suffers several drawbacks. First, it requires a long training time due to the use of relatively costly visual features and pattern detection methods. Second, the method may not be applicable to the forums where the templates do not satisfy the template properties specified by the authors. For example, in ubuntuforums.org, the page template of the board page does not satisfy the repetitive region based assumption by the authors. Third, the visual feature extraction is required during online traversal, which causes significant overhead during crawling. Wang et al. [4] improved the work in [2] by adding an additional module to determine the traversal strategy among the skeleton pages. The authors use coverage and informativeness as the criteria to design the traversal strategy. The proposed algorithm searches an exponential number (to the number of candidates for skeleton links) of states to obtain an optimum traversal strategy. The authors also proposed an optimization process for selecting possible vertices in the traversal path based on significant increases in coverage or decreases in informativeness. However, it is difficult to choose a generic set of threshold values because information is presented in different manners in different web forums.

Zhang et al. in [5] proposed a system for information extraction by automatically extracting rules from web forums. The method is not directly related to forum crawling, but it may be useful to extract the relevant links from forum pages. Authors in [6], [7] proposed the methods for the extraction of user comments from post pages. We are not addressing this issue in this paper as our focus is to generate the crawling strategy to retrieve as many useful pages as possible. Focused crawling and deep web crawling are special types of web crawling algorithms that appear to be similar to forum crawling. However, their main focus is different from forum crawling. Focused web crawlers[8], [9] follow the strategy of downloading only pages that are relevant to particular given topics. Focused web crawlers use keywords as the feature to decide the relevance of web pages. However, this strategy is not applicable to forum crawling as the objective of forum crawling is to retrieve as much user generated content and their associated information as possible. Deep web crawler [10] focuses on generating appropriate queries to retrieve hidden contents from web sites where the data is otherwise not readily viewable/available. However, the focus of forum crawling is to identify and follow valuable links in the site. III. O RGANIZATION OF W EB F ORUMS Web forums are generally organized in a hierarchical structure to facilitate the browsing experience of users. The hierarchy of pages usually consists of a board page, listof-thread pages and list-of-post pages. Note that there may exist more (e.g. sub-board pages) or less (e.g. list-of-thread pages may be absent) levels of hierarchy in a forum site. We collectively refer to these pages as skeleton pages. Skeleton pages may be linked to other similar types of pages through page-flipping links, and these are referred to as flippedskeleton pages to the skeleton page. In addition to skeleton and flipped-skeleton pages, useless pages such as the user profile pages, login portal pages, pages corresponding to shortcut links, sorting links, etc. also exist. Based on the content, we divide useless pages into two categories. The pages in the first category are those that do not contain useful links or text, such as user profile pages and login portal pages. This type of pages is referred to as redundant-information pages. The pages in the second category are those that contain useful links or text, but they are duplicated from one or more skeleton/flipped-skeleton pages. Examples of such pages are pages corresponding to shortcut links and sorting links. We collectively refer to these pages as duplicate-information pages. IV. P ROPOSED F ORUM C RAWLING A LGORITHM The objectives of forum crawling are to distinguish between useful and useless pages and establish the traversal relationship between different useful pages. In this section, we propose a forum crawling algorithm using the outgoing

Characters Special Characters {’.’, ’=’, ’&’, ’?’, ’/’} Numeric {123456789} Others

Symbol no change @ *

Table I S IGNATURE SYMBOLS FOR CHARACTERS IN A

LINK

links and text information in the forum pages. The proposed algorithm consists of three steps. The first step is a simple collection of URLs in random pages sampled from the targeted forum site. The second step is to distinguish the different types of pages in the site based on the sampled URLs. These steps are done using our URL sampling and clustering processes. The final step deals with the generation of the traversal path using the clusters obtained in the previous step. A. URL Sampling and Clustering In this step, we categorize different types of pages into different clusters based on their URLs. Initially, a small number of random pages from the targeted web forum are sampled and all URLs in these pages are extracted. Then, we employ a two-step clustering approach as discussed in this section. The first step is a signature-based clustering. The second step involves further clustering within each cluster based on common word extraction. 1) Signature Generation and Signature Based Clustering: In this step, signatures are generated for the sampled URLs for the purpose of differentiation among different types of pages. Given a URL, its signature is generated by representing each character in the URL with a symbol and then collapsing consecutive similar symbols into a single symbol. As shown in Table I, for the signature generation process, we divide the characters in the URLs into three different categories namely special characters, numeric characters and all other characters. The basis of this categorization is that numeric and special characters usually play a dominant role in distinguishing between different types of URLs in the web forums. Special characters are characters that are used in the URLs for special purposes and these are primarily based on the reserved characters defined in RFC 3986 [11]. Sampled URLs are clustered next based on their signature. A separate cluster is made for each different signature. The URLs with the same signature are placed in the same cluster. 2) Further Clustering based on Keyword Extraction: We observed that clustering based on the signatures generated may not be sufficient to separate different types of pages in the web forums. In several scenarios, signature collisions occur due to the use of different strings for different types of pages at the same position for different types of links by forum sites. For example, the links “www.scam.com/ forumdisplay.php?f=23” and “www.scam.com/showthread. php?t=146987” correspond to a list-of-thread page and a

list-of-post page respectively. However, the same signature (*.*.*/*.*?*=@) is generated for both the links. In order to further separate the URLs in such cases, a second level cluster separation algorithm is further applied by extracting common strings from the links in a cluster. In this proposed algorithm, we search for a set of strings (i.e. keywords) within the links that can indicate different types of pages within each cluster. New clusters are formed to better distinguish links in the original cluster based on the presence of the keywords within the links. For each URL in each prior generated cluster, we first segment the URL into various hierarchies using ‘.’, ‘/’ or ‘?’ as delimiters. Hierarchies containing the known domain of the site are ignored. Within each of the remaining hierarchies, the partial URL is further segmented into multiple strings using any of the characters in “0123456789#- +” as delimiters. Strings that occur at the same hierarchy in more than one URL are recorded as potential keywords for that hierarchy. Strings which could not be further segmented are also selected as potential keywords for the hierarchy. There can be multiple sets of keywords (one set for each hierarchy), but only the set from the left-most hierarchy, which satisfies certain pre-defined threshold criteria is selected. For example, one such criteria can be that the URLs containing the keywords should contain less than a threshold number of strings in that hierarchy. The following Example 1 illustrates the second level clustering result for a set of URLs in a cluster. Example 1: Consider the following six links in a cluster with the signature “*.*.*/*.*?*=@”: 1. www.scam.com/forumdisplay.php?f=23 2. www.scam.com/forumdisplay.php?f=2 3. www.scam.com/showthread.php?t=146987 4. www.scam.com/blog.php?u=141740 5. www.scam.com/showthread.php?t=146987 6. www.scam.com/member.php?u=141740 In these links, the domain “www.scam.com” is ignored and the search for potential keywords starts from the next hierarchy. The strings “forumdisplay.php” and “showthread.php” are chosen as the keywords, thus, dividing the cluster into three clusters, each with the same signature, but different keywords. The first cluster (with keyword “forumdisplay.php”) contains links 1 and 2. The second cluster (with keyword “showthread.php”) contains links 3 and 5. The third cluster (with no keyword) contains the remaining links 4 and 6. It should be noted that even if all URLs are categorized into one of the newly generated clusters, the original cluster with no keyword still remains. This is to hold any future unseen URLs that match the signature but do not possess any differentiable keywords. 3) Representation of clusters: After clustering the collected URLs, let N be the total clusters formed. Each cluster can be represented with the following three parameters namely, (i) signature , (ii) keyword (or lack thereof), and

(iii) the hierarchy where this keyword can be found. 4) Link and Text Properties of the Clusters: Before generating the traversal path, the text standard deviation (textSD) for sampled pages in each cluster and the Outlink matrix representing the links between the different known clusters are computed. The textSD is computed from the amount of text in the sampled pages. The ith row and j th column in Outlink matrix, denoted by Outlink[i][j], refers to the average number of links in the j th cluster that are listed in the pages in the ith cluster. B. Traversal Strategy Design Traversal strategy design deals with the establishment of the traversal relationship among the generated URL clusters and it consists of three steps. The first step is the classification of clusters. The second step is the generation of the candidate traversal paths and the third step is the selection of the best traversal path. 1) Classification of Clusters: In this step, each cluster is classified as a redundant information cluster, link cluster or text cluster. Redundant information clusters contain pages with neither useful links nor useful text information. Link clusters contain pages in which useful information is present mainly in the form of URL links and their associated data. Examples of such clusters include those for the board page and list-of-thread pages. Text clusters contain pages where the desired information is mainly in the form of text. Examples of such clusters include the clusters corresponding to the list-of-post pages. We use the Outlink matrix and text standard deviation proposed in Section IV-A4 for classifying the clusters. Text utility (T U ) and links utility (LU ) scores are calculated for all the clusters as shown in Algorithm 1. The clusters with both link utility and text utility scores below the thresholds TLU and TT U respectively are considered redundant information clusters and discarded. The clusters with the text utility score above TT U but the link utility score below TLU are classified as text clusters, while the remaining clusters are classified as link clusters. Redundant information clusters are discarded in order to reduce noise to facilitate traversal path predictions. In our experiments with different forum web forums, we obtained about 50 to 100 different clusters, of which only a handful of clusters corresponds to useful pages. Therefore, the relatively large number of redundant information clusters does indeed cause unnecessary noise obscuring the link relationship between the useful clusters. 2) Candidate Traversal Paths Generation: Despite removing redundant-information clusters, both linkClusters and textClusters may still contain some noise clusters (e.g. clusters corresponding to shortcut pages, etc.). In this second step, our aim is to further detect clusters corresponding to useful pages and generate the likely traversal paths.

Algorithm 1: Links utility and text utility calculation Set textClusters, TT U . const float: TLU , TT U . float: M axLinks = 0, maxT extSD = 0. float: tempLinks = 0. for int i = 1 to i ≤ N do ∑ tempLinks = N j=1 outlink[i, j]. if tempLinks > M axLinks then M axLinks = tempLinks. if textSDi > maxT extSD then maxT extSD = textSDi . for int i = 1 to i ≤ N do ∑ N

outlink[i,j]

LU [i] = j=1 . maxLinks textSDi . T U [i] = maxT extSD if (LU [i] < TLU AND T U [i] < TT U ) then Discard the ith cluster. for int j = 1 to j ≤ N do Outlink[i][j]=0. Outlink[j][i]=0. else if LU [i] < TLU AND T U [i] ≥ TT U ) then Add the ith cluster in the set textClusters. else Add the ith cluster in the set linkClusters.

In our proposed algorithm, we process the link clusters before the text clusters for the traversal strategy design. The rationale is that the cluster(s) for the list-of-post pages are likely to be text clusters while the rest of the clusters starting from the seed page and before the list-of-post pages are likely to be the link clusters. To generate a traversal path, we start from the cluster containing the forum board page as the current skeleton cluster. Then Steps i and ii mentioned (as shown below) are recursively processed until a termination criteria (Step iii) is satisfied. Step (i) Flipped-Skeleton Cluster Selection : Selection of possible flipped-skeleton clusters for a given skeleton cluster is based on the observation that flipped-skeleton pages generally contain similar proportion and types of links as the skeleton pages. To detect clusters containing pages with similar types of links, we derive the Similarity matrix to record the similarity scores between all the cluster pairs. For two clusters i and j, the Similarity[i, j] is defined as: ∑N k=1 M in(Outlink[i, k], Outlink[j, k]) Similarity[i, j] = ∑N k=1 M ax(Outlink[i, k], Outlink[j, k]) (1)

Let i be the current skeleton cluster; the candidate flippedskeleton clusters are all clusters which have a higher than threshold (Tsim ) similarity score (with respect to cluster i). Nonetheless, it may be possible for undesired duplicate information clusters to have a high similarity score. Thus, the following criteria are also considered when selecting the

corresponding flipped-skeleton cluster: 1) We observed that the URLs of flipped-skeleton pages on the forums generally contain an additional numeric value as compared to the corresponding skeleton page. So, signatures for flipped-skeleton cluster should also contain an additional character ‘@’. 2) We also observed that the signature of flipped-skeleton clusters remain shorter than the signature of the clusters containing duplicate information pages. So, the cluster with the shortest signature out of all eligible clusters is selected as the flipped-skeleton cluster. If no cluster satisfies the above mentioned criteria, the algorithm does not output any flipped-skeleton cluster. After this step, the Outlink matrix is modified by discarding all the flipped-skeleton candidates. This helps to increase the specificness (discussed in Step ii) of the next-skeleton links cluster to the current skeleton links cluster as the links corresponding to the next skeleton cluster are also present in all the flipped-skeleton candidate clusters. Step (ii) Next Skeleton Cluster Selection : First, a Specificness matrix is derived after the modification in Outlink matrix in Step i. Specificness matrix gives the probabilities of origination of links in a cluster from the pages in other clusters. In particular, Specificness[i, j] determines the probability of origination of links in cluster i from the pages in cluster j. Mathematically, we define Specificness[i, j] as: Outlink[j, i] ∑ Outlink[i, l] ∗ N m=1 Outlink[m, j] l=1 (2)

Specif icness[i, j] = ∑N

The cluster k (1≤k≤N , k ̸= i) is chosen as the next skeleton cluster to current skeleton cluster i if k is such that Specificness[i, j]*Outlink[i, j] is maximum for j=k and it does not satisfy the termination criteria defined in Step iii. The above criteria uses the observation that the next skeleton links are usually specific to the current skeleton pages and these links are present in significantly large numbers in the current skeleton pages. For example, for the forum “www.scam.com”, the links for list-of-thread pages, post pages and several other types of pages are present in relatively large numbers in the board page. But, the links for list-of-thread pages are more specific to the board page as these are present mainly in the board page, whereas the links for other types of pages are also present in other pages.

clusters are left for the detection of next skeleton cluster, the algorithm to generate a traversal path is terminated. In our proposed algorithm, we used four different thresholds namely, page similarity threshold (Tsim ), link utility threshold (TLU ), text utility (TT U ) threshold and minimum links threshold (TM L ). It is difficult to select the most appropriate value for all these thresholds since the optimal thresholds may vary with different web forums. So, instead of manually selecting the thresholds for each forum, we derive a method to score the traversal path generated using a given set of four thresholds. In the automated traversal strategy generation process, we generate the candidate traversal paths for several different sets of these four thresholds. The scoring and selection of the best traversal path is the final step in our traversal strategy design and it is explained below. 3) Scoring of Traversal Paths: The score for a traversal strategy is calculated based on a path score and page-flipping score. The path score is calculated based on the relationship among skeleton clusters. It demonstrates how well skeleton clusters in a traversal path are connected. Page-flipping score is calculated based on the relationship between each skeleton cluster and its corresponding flipped-skeleton cluster. Let a traversal path T of n skeleton clusters with indices s1 , s2 ,..., and sn and the corresponding flipped-skeleton clusters with indices f1 , f2 ,..., and fn be given as: T = (s1 , f1 ) → (s2 , f2 ) → ... → (sn , fn ). The flipped-skeleton clusters are such that fi (1 ≤ i ≤ n) is equal to null if no flipped-skeleton cluster is selected for the ith skeleton cluster. The path score for T is given by following equation: pathScore(T ) = (Outlink[s1 , s2 ] ∗ Specif icness[s1 , s2 ])+ (Outlink[s2 , s3 ] ∗ Specif icness[s2 , s3 ]) + ... + (Outlink [sn−1 , sn ] ∗ Specif icness[sn−1 , sn ])/(P ∗ (n − 1)) (3) where P is a penalty term, without which the score is biased towards improbably short path lengths. However, this penalty value does not affect forums in which the actual traversal paths are short since the score becomes significantly less if a skeleton cluster is wrongly included in the path. Page-flipping score for the traversal path T is given by Equation 4. It calculates the average similarity score between all the skeleton clusters and their corresponding flippedskeleton clusters. pf Score(T ) = (Similarity[s1 , f1 ] + Similarity[s2 , f2 ]+

Step (iii) Termination Criteria : We use a minimum links threshold TM L based termination criteria to decide whether a next skeleton cluster to the current skeleton cluster exists. A cluster is selected as a candidate for the next skeleton cluster if at least average TM L links are present per sampled page in cluster i. Note that this criteria need not be applied for the cluster(s) for list-of-post pages since they are deemed to be the last type of page to traverse in a web forum. If no more

... + Similarity[sn , fn ])/m

(4)

where m=n-1 if the board page does not contain flippedskeleton links and m=n if the board page contains flippedskeleton links. Traversal score for the traversal path T is then given by: traversalScore(T ) = pathScore(T ) ∗ pf Score(T ) (5)

V. E XPERIMENTAL P ERFORMANCE A NALYSIS In this section, we present the experimental results for our proposed forum crawler. We demonstrate the crawling quality of the proposed crawler by evaluating the traversal path generated as well as the recall, precision and specificity values based on pages identified by the traversal paths. We have chosen the sites with different URL structures to better test the algorithm’s learning capabilities for a variety of forums. However, the assumption for using this traversal path generation algorithm is that targeted page types in each site shall have distinct URL structures, allowing them to be sorted to distinct clusters. Our observation is that this assumption is sufficiently valid. It is further observed that common URL structures in forum sites are usually either keyword-based or verbose-based (or mixed). The former is where the types of pages can be deduced from keywords in the URL such as “displayforum”, “showthread”, etc. The latter is where the URL contains the title/heading of the page instead of keywords that indicates the type of page the URL is pointing to. Common instances of such URL structure can be found in list-of-thread or listof-post pages where the forum or thread titles are often part of the pages’ URLs. In this case, although there are generally more words in these URLs, these words do not help as much in differentiating the various types of pages in the forum. For evaluating the traversal path, each generated traversal path is compared against the intended browsing path, which is deemed to be Seed Page − > List-of-Thread pages − > List-of-Thread flipped-skeleton pages − > List-of-Post pages − > List-of-Post flipped-skeleton pages. Pages belonging to any of the page types in the intended browsing path are termed as targeted pages to be crawled. All other pages are considered as redundant pages. To generate the traversal path for each forum site, 100 pages were first randomly sampled from each site and subsequently all links in these sampled pages are clustered. Next, pages in each cluster are randomly sampled and the characteristics mentioned in Section IV-A4 are computed. Finally, a traversal path is generated using the method described in Section IV-B. Table II presents the experimental results for the clustering and traversal path generation methods for the different forum sites. A cluster is considered to be correctly built if the majority of the sampled targeted pages of the same type are grouped in the same cluster. A selected cluster in the traversal path is considered correct if it is in the correct order and correctly represents its corresponding targeted page. The results show that the proposed method is able to achieve a 100% accuracy in building the clusters for the targeted types of pages in the different forums. The recall for the identification of the List-of-Thread, Listof-Thread flipped-skeleton, List-of-Post and List-of-Post flipped-skeleton clusters for building the traversal paths are

100%, 70%, 100% and 80%, respectively. Precision for the identification of the mentioned page types are 100%, 87.5%, 100% and 80% respectively. The difficulty in identifying the flipped-skeleton clusters for some forums is due to relatively small number of flipped-skeleton pages in the forums. Here, we would like to mention that we are unable to compare our work with existing works [2], [4], because the source codes for these works are not publicly available. To our knowledge, there exist no other work that is focused on deriving the crawling strategy for web forums. In addition to evaluating the clustering and traversal path generation methods, we also performed experiments to determine the bandwidth saving, recall, precision and specificity values when using the traversal paths to classify the web pages. Pages belonging to any of the targeted page types in the traversal path are considered as targeted pages while the rest are considered as redundant pages. The targeted pages are regarded as the “positive” class while the redundant pages are regarded as the “negative” class. However, the page type of a targeted page must be correctly identified in the traversal path for its classification as true positive. For example, a list-of-post page wrongly identified as a list-of-thread page will be considered a false negative classification. For experimental analysis purpose, we randomly sample 1,000 pages from each forum site and determine the bandwidth saving over generic web crawler by calculating the percentage of redundant pages in the sampled pages. Here, we assume that all the pages, whether redundant or useful, require the same amount of bandwidth. The bandwidth saving for different forums is as shown in Table III. The average overall bandwidth saving for the forums used in our experiments is found to be 62.1%, with a range from 31.5% (for www.cashfindforum.com) to 76.9% (for www.stormfront.org/forum). For calculating recall, precision and specificity, we classify the sampled pages either as positive or negative using the traversal path generated for the forum site. Then we manually verify the classification to calculate the recall, precision and specificity values which are shown in Table IV. For each forum site, both the automated traversal path and manually corrected traversal path (if the automated traversal path is not correct) are used. Recall is given by the ratio of correctly identified targeted pages out of the total number of sampled targeted pages. This value shows the ability of the traversal path in identifying all targeted pages in the forum site. Precision is given by the ratio of correctly identified pages out of all pages identified as useful by the traversal path. This value is indicative of the traversal path’s ability to correctly identify pages belonging to a page type without error. Given that the “positive” class consist of multiple targeted page types, but that there is no differentiation among the redundant pages, the specificity value, given as the total number of correctly identified redundant pages out of the total number

Forum www.edaboard.com www.scam.com ubuntuforums.org forums.moneysavingexpert.com www.stormfront.org/forum www.cashfindforum.com creditcardforum.com/forum www.scamfound.com www.realscam.com forums.islamicawakening.com

Type keyword keyword keyword keyword keyword verbose verbose mixed mixed mixed

Seed C,T C,T C,T C,T C,T C,T C,T C,T C,T C,T

List-of-Thread C,T C,T C,T C,T C,T C,T C,T C,T C,T C,T

Flipped-skeleton C,T C,T C,T C,T C C C,T C,F C,T C,T

List-of-Post C,T C,T C,T C,T C,T C,T C,T C,T C,T C,T

Flipped-skeleton C,T C,T C,T C,T C,T C,T C,T C,T C,F C,F

Time 1109 s 1134 s 2033 s 1376 s 1039 s 1397 s 1322 s 1280 s 1951 s 2484 s

Table II C LUSTERING AND T RAVERSAL PATH GENERATION PERFORMANCE . ‘C’ INDICATES THAT A CLUSTER HAVE BEEN BUILT, CONTAINING THE MAJORITY OF SAMPLED TARGETED URL S OF THE SAME TYPE OF PAGE . ‘T’ INDICATES THAT THE CLUSTER IS PRESENT CORRECTLY IN THE GENERATED TRAVERSAL PATH . ‘F’ INDICATES THAT A WRONGLY CHOSEN CLUSTER . T IME INDICATES THE TRAINING DURATION FOR EACH SITE .

Forum www.edaboard.com www.scam.com ubuntuforums.org forums.moneysavingexpert.com www.stormfront.org/forum www.cashfindforum.com creditcardforum.com/forum www.scamfound.com www.realscam.com forums.islamicawakening.com

Percentage bandwidth saving 71.9% 69.8% 67.7% 75.7% 76.9% 31.5% 73.3% 70.5% 50.6% 33.2%

Table III P ERCENTAGE BANDWIDTH SAVING FOR DIFFERENT FORUMS

of sampled redundant pages is also calculated. Specificity provides an indication of the traversal path’s ability to identify redundant pages. A higher value indicates a more focused traversal and less bandwidth wastage. From the results shown in Table IV, we observe that the best performance is obtained for forums with keywordbased URL structures, followed by those with mixed and verbose-based URL structures. This is because different types of pages in keyword-based forums are usually distinctly separated by their keywords. On the other hand, URLs of different page types in verbose-based forum sites may not be as well separated as those in keywordbased forum sites (e.g. URLs of multiple page types sharing the same cluster, URLs of the same page type present in more than one clusters). For example, consider two list-of-pages with URLs “www.realscam.com/f22/2000week-wanting-stay-home-watch-movies-679” and “www. realscam.com/f22/how-identify-scam-502” will be placed in separate clusters due to the different signatures generated for the links. Hence the performance is negatively affected for forum sites with verbose-based (or mixed) URL structures. Nonetheless, the proposed method is generally able to build clusters containing the majority of links for each targeted page types, thus allowing the design of a traversal path representing the majority of such targeted pages. Thus, we are able to achieve high precision and recall in the range

of 80% to 100% for all the forums tested. Also, our proposed crawler uses simple text and link properties in contrast with costly visual feature extraction used in existing works. This helps us to achieve a short training time (usually 10 to 30 minutes for different forums). Lastly, our proposed method is more robust and general as it is independent of the template of different pages on forums (as compared with [2], [4]). VI. C ONCLUSION This paper proposes a lightweight algorithm using only the text and links properties in the forum web pages to generate a traversal path automatically. The purpose of this traversal path generation is twofold - to identify the targeted pages to access and to represent the traversal relationship among different types of pages in a forum site. From the experiments, we observed that the majority of the targeted pages are clustered correctly using our clustering algorithm simply based on their URL structures. In addition, the complete traversal paths have been generated correctly for a significant number of forum sites. The main advantages of this proposed forum site traversal path generation are that only a small number of sampled pages are needed to be sampled from the site and its short training time. Future work is envisioned to improve the detection rate for the forum sites with verbose-based URL structures, considering that clustering may not be perfect in such cases. We intend to use additional techniques such as XPaths learning to further improve our work in this aspect. ACKNOWLEDGEMENT We gratefully acknowledge the funding of this research by Visa Worldwide Pte Limited (research grant CA/20110921/017). R EFERENCES [1] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer networks and ISDN systems, vol. 30, no. 1-7, pp. 107–117, 1998.

Forum www.edaboard.com www.scam.com ubuntuforums.org forums.moneysavingexpert.com www.stormfront.org/forum www.cashfindforum.com/forum creditcardforum.com/forum www.scamfound.com www.realscam.com forums.islamicawakening.com

Precision (A) 1 1 1 1 1 1 .936 .936 .892 1

Specificity (A) 1 1 1 1 1 1 .979 .973 .887 1

Recall (A) 1 1 1 1 .943 .80 .831 .946 .967 .882

Precision (MC) 1 1 1 1 1 1 .936 1 1 1

Specificity (MC) 1 1 1 1 1 1 .979 1 .887 1

Recall (MC) 1 1 1 1 1 .819 .831 .947 .962 .93

Table IV P RECISION , S PECIFICITY AND R ECALL PERFORMANCE (A-

AUTOMATED TRAVERSAL PATH ,

[2] R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, “irobot: An intelligent crawler for web forums,” in WWW Conference, 2008, pp. 447–456. [3] Y. Guo, K. Li, K. Zhang, and G. Zhang, “Board forum crawling: a web crawling method for web forum,” in IEEE/WIC/ACM International Conference on Web Intelligence, 2006, pp. 745–748. [4] Y. Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, and W.-Y. Ma, “Exploring traversal strategy for web forum crawling,” in ACM SIGIR International Conference on Research and Development in Information Retrieval, 2008, pp. 459–466. [5] J. Zhang, C. Zhang, W. Qian, and A. Zhou, “Automatic extraction rules generation based on xpath pattern learning,” in Web Information Systems Engineering–WISE 2010 Workshops, 2011, pp. 58–69. [6] J.-M. Yang, R. Cai, C. Wang, H. Huang, L. Zhang, and wei Ying Ma, “A threadwise strategy for incremental crawling of web forums,” in WWW Conference, 2009. [7] W. Liu, H. Yan, and J. Xiao, “Automatically mining review records from forum web sites,” in IEEE International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2010, pp. 2450–2455. [8] A. Pirkola and T. Talvensaari, “Addressing the limited scope problem of focused crawling using a result merging approach,” in ACM Symposium on Applied Computing, 2010, pp. 1735–1740. [9] S. Batsakis, E. G. M. Petrakis, and E. Milios, “Improving the performance of focused web crawlers,” Data & Knowledge Engineering, vol. 68, no. 10, pp. 1001–1013, 2009. [10] J. Lu, Y. Wang, J. Liang, J. Chen, and J. Liu, “An approach to deep web crawling by sampling,” in IEEE/WIC/ACM International Conference on web Intelligence and Intelligent Agent Technology, 2008, pp. 718–724. [11] T. Berners-Lee, R. Fielding, and L. Masinter, “Uniform Resource Identifier (URI): Generic Syntax,” RFC 3986 (Standard), Internet Engineering Task Force, Jan. 2005, [31 May 2012]. [Online]. Available: http://www.ietf.org/rfc/ rfc3986.txt

MC- MANUALLY CORRECTED

TRAVERSAL PATH )

Qualitative properties of generalized principal ...