Generalized and Lightweight Algorithms for Automated Web Forum Content Extraction Wee-Yong Lim1 , Vyjayanthi Raja2 and Vrizlynn L. L. Thing1 Cybercrime and Security Intelligence Department, Institute for Infocomm Research 1 Fusionopolis Way, 138632, Singapore Email: {weylim, vriz}1 @i2r.a-star.edu.sg, [email protected]

Abstract—As online forums contain a vast amount of information that can aid in the early detection of fraud and extremist activities, accurate and efficient information extraction from forum sites is very important. In this paper, we discuss the limitations of existing works in the extraction of information from generic web sites and forum sites. We also identify the need for better suited, generalized and lightweight algorithms to carry out a more accurate and efficient information extraction while eliminating noisy data from forum sites. In this paper, we propose three generalized and lightweight algorithms to carry out accurate thread and post content extraction from web forums. We evaluate our algorithms based on two strict criteria and to the granularity of the (DOM tree) node level correctness. We consider a thread or post as successfully extracted by our algorithms only if (i) all the contents in its text and anchor nodes are extracted correctly, and (ii) each content node is grouped correctly according to its respective thread or post. Our experiments on ten different forum sites show that our proposed thread extraction algorithm achieves an average recall and precision rate of 100% and 98.66%, respectively, while our core post extraction algorithm achieves an average recall and precision rate of 99.74% and 99.79%, respectively. Index Terms—Online forums, information retrieval, content extraction, web intelligence.

I. I NTRODUCTION The widespread use and contribution of knowledge in the form of data uploaded to the Internet have made it a wealthy information source for any conceivable topic. One of the most important platforms on the Web is the online forums. Online web forums’ dynamically expanding contents, contributed by Internet users on a daily basis has led to its increasing richness of information. Its widespread popularity is due to its facilitation of global, convenient, fast and freely open discussions. As web forum data is an accumulation of a vast collection of updated human knowledge and viewpoints, it can thus be a highly valuable source of online information for knowledge acquisition to build up domain expertise [1], improve business intelligence [2], and early detection of the presence of extremist activities [3], [4]. Regardless of the application, the fundamental step to forum data mining is to fetch and extract valuable data from the various forum sites distributed on the Internet, in an efficient manner. Efficiency is measured by the retrieval of valuable data while avoiding noise such as irrelevant links and advertisements. One of the earliest techniques of generic web crawling is the breadth-first crawling approach [5]. The crawler starts with an

initial queue of URLs and proceeds to dequeue the URLs on downloading the corresponding pages. Each downloaded page is then parsed to extract any outlinks. These outlinks are added to the end of the queue for processing. The process continues until a preset maximum number of URLs is reached. However, the main limitation of the breadth-first crawling approach is the storage space constraint. In [6], a topic specific crawler is proposed. The basis of this work is that two neighbouring pages are semantically related; a conclusion drawn by the authors in [5]. The crawler employs a human assisted approach to learn a particular topic and identify documents related to the topic. Therefore, [6] is one of the earliest work that describes a focused and targeted approach of crawling. Although generic web crawlers are effective for applications such as search engines and topic-based web information retrieval, they are not suitable for forum crawling. This is due to the unique structure of forum sites and the associative human discussion behavior, which is evident in the dynamic nature of the forum contents and its presentation. [7]–[11] describe novel approaches to forum crawling in which the forum structure is learned dynamically by the crawler through forum page sampling. The different types of forum pages are detected and the site traversal path are derived by the forum crawlers. The relevant pages are then downloaded from the forum sites in the operational mode. The main problem with the downloaded pages is the presence of noise as forum pages are commonly populated with irrelevant data and links. These links may be for external sites and advertisements, which are not necessary for the purpose of data analysis. Therefore, an important step is to retrieve valuable data from the downloaded pages by extracting the relevant forum tables and lists, while avoiding the noise. In this paper, we firstly discuss the challenges and shortcomings of existing approaches in the area of relevant information extraction from forum pages. Next, we propose three lightweight algorithms to carry out automated information retrieval from forum sites in a generic and scalable manner. We evaluate the performance of our algorithms by carrying out experiments on ten forum sites. Our proposed algorithms are shown to achieve excellent results in the ability to significantly reduce noise during the information extraction phase in an automatic manner. The rest of the paper is organized as follows. In Section II, we discuss the existing work in the field of both web site and forum site data extraction. We present a brief overview

Fig. 1. Table Structure of Board/Thread List

Fig. 2. Table Headers and Column Headers

Fig. 3. Mismatched styling attributes to aid in identifying outliers

of forum sites and our observations in Section III. Our proposed algorithms are presented and discussed in Section IV. Experimental results are presented and discussed in Section V. Conclusions follow in Section VI. II. R ELATED W ORK Several works exist that attempt to extract data from web pages and forums. The first approach is wrapper based [12]– [15], and uses supervised machine learning to learn data extraction rules from the positive and negative samples. The structural information of the sample web pages is utilized to classify similar data records according to their subtree structure. However, this approach is inflexible and non-scalable due to its site template dependency. Manual labeling of the sample pages is also extremely labor intensive and time consuming, and has to be repeated even for pages within the same site

due to varying intra-site templates. Another approach [16], [17] relies on the web page visual attributes. The web pages are rendered by the web browser to ascertain the data structure on different pages. Features such as the positioning of the information units, the cell sizes, the font characteristics and the font colors are analysed to understand the semantic meaning of the contents based on the assumption that content rendering by the browser ensures human understandable output format. The results based on the data structure inference from the visual attributes are observed to be at 81% and 68% for the precision and recall measurement, respectively. In [18], the authors identify different data blocks based on the differences in their visual styles such as the width, height, background color and font. The disadvantage of the visual attribute based approach is the need to render the web pages during both the learning and extraction phases,

TABLE I DEFINITION OF TABLES AND THEIR CORRESPONDING ROWS AND COLUMNS Table ,,
    ,


      Row




    1. A combination of one
      and