Generalized and Lightweight Algorithms for Automated ...

Viewer
Transcript

Generalized and Lightweight Algorithms for Automated Web Forum Content Extraction Wee-Yong Lim1 , Vyjayanthi Raja2 and Vrizlynn L. L. Thing1 Cybercrime and Security Intelligence Department, Institute for Infocomm Research 1 Fusionopolis Way, 138632, Singapore Email: {weylim, vriz}1 @i2r.a-star.edu.sg, [email protected]

Abstract—As online forums contain a vast amount of information that can aid in the early detection of fraud and extremist activities, accurate and efficient information extraction from forum sites is very important. In this paper, we discuss the limitations of existing works in the extraction of information from generic web sites and forum sites. We also identify the need for better suited, generalized and lightweight algorithms to carry out a more accurate and efficient information extraction while eliminating noisy data from forum sites. In this paper, we propose three generalized and lightweight algorithms to carry out accurate thread and post content extraction from web forums. We evaluate our algorithms based on two strict criteria and to the granularity of the (DOM tree) node level correctness. We consider a thread or post as successfully extracted by our algorithms only if (i) all the contents in its text and anchor nodes are extracted correctly, and (ii) each content node is grouped correctly according to its respective thread or post. Our experiments on ten different forum sites show that our proposed thread extraction algorithm achieves an average recall and precision rate of 100% and 98.66%, respectively, while our core post extraction algorithm achieves an average recall and precision rate of 99.74% and 99.79%, respectively. Index Terms—Online forums, information retrieval, content extraction, web intelligence.

I. I NTRODUCTION The widespread use and contribution of knowledge in the form of data uploaded to the Internet have made it a wealthy information source for any conceivable topic. One of the most important platforms on the Web is the online forums. Online web forums’ dynamically expanding contents, contributed by Internet users on a daily basis has led to its increasing richness of information. Its widespread popularity is due to its facilitation of global, convenient, fast and freely open discussions. As web forum data is an accumulation of a vast collection of updated human knowledge and viewpoints, it can thus be a highly valuable source of online information for knowledge acquisition to build up domain expertise [1], improve business intelligence [2], and early detection of the presence of extremist activities [3], [4]. Regardless of the application, the fundamental step to forum data mining is to fetch and extract valuable data from the various forum sites distributed on the Internet, in an efficient manner. Efficiency is measured by the retrieval of valuable data while avoiding noise such as irrelevant links and advertisements. One of the earliest techniques of generic web crawling is the breadth-first crawling approach [5]. The crawler starts with an

initial queue of URLs and proceeds to dequeue the URLs on downloading the corresponding pages. Each downloaded page is then parsed to extract any outlinks. These outlinks are added to the end of the queue for processing. The process continues until a preset maximum number of URLs is reached. However, the main limitation of the breadth-first crawling approach is the storage space constraint. In [6], a topic specific crawler is proposed. The basis of this work is that two neighbouring pages are semantically related; a conclusion drawn by the authors in [5]. The crawler employs a human assisted approach to learn a particular topic and identify documents related to the topic. Therefore, [6] is one of the earliest work that describes a focused and targeted approach of crawling. Although generic web crawlers are effective for applications such as search engines and topic-based web information retrieval, they are not suitable for forum crawling. This is due to the unique structure of forum sites and the associative human discussion behavior, which is evident in the dynamic nature of the forum contents and its presentation. [7]–[11] describe novel approaches to forum crawling in which the forum structure is learned dynamically by the crawler through forum page sampling. The different types of forum pages are detected and the site traversal path are derived by the forum crawlers. The relevant pages are then downloaded from the forum sites in the operational mode. The main problem with the downloaded pages is the presence of noise as forum pages are commonly populated with irrelevant data and links. These links may be for external sites and advertisements, which are not necessary for the purpose of data analysis. Therefore, an important step is to retrieve valuable data from the downloaded pages by extracting the relevant forum tables and lists, while avoiding the noise. In this paper, we firstly discuss the challenges and shortcomings of existing approaches in the area of relevant information extraction from forum pages. Next, we propose three lightweight algorithms to carry out automated information retrieval from forum sites in a generic and scalable manner. We evaluate the performance of our algorithms by carrying out experiments on ten forum sites. Our proposed algorithms are shown to achieve excellent results in the ability to significantly reduce noise during the information extraction phase in an automatic manner. The rest of the paper is organized as follows. In Section II, we discuss the existing work in the field of both web site and forum site data extraction. We present a brief overview

Fig. 1. Table Structure of Board/Thread List

Fig. 2. Table Headers and Column Headers

Fig. 3. Mismatched styling attributes to aid in identifying outliers

of forum sites and our observations in Section III. Our proposed algorithms are presented and discussed in Section IV. Experimental results are presented and discussed in Section V. Conclusions follow in Section VI. II. R ELATED W ORK Several works exist that attempt to extract data from web pages and forums. The first approach is wrapper based [12]– [15], and uses supervised machine learning to learn data extraction rules from the positive and negative samples. The structural information of the sample web pages is utilized to classify similar data records according to their subtree structure. However, this approach is inflexible and non-scalable due to its site template dependency. Manual labeling of the sample pages is also extremely labor intensive and time consuming, and has to be repeated even for pages within the same site

due to varying intra-site templates. Another approach [16], [17] relies on the web page visual attributes. The web pages are rendered by the web browser to ascertain the data structure on different pages. Features such as the positioning of the information units, the cell sizes, the font characteristics and the font colors are analysed to understand the semantic meaning of the contents based on the assumption that content rendering by the browser ensures human understandable output format. The results based on the data structure inference from the visual attributes are observed to be at 81% and 68% for the precision and recall measurement, respectively. In [18], the authors identify different data blocks based on the differences in their visual styles such as the width, height, background color and font. The disadvantage of the visual attribute based approach is the need to render the web pages during both the learning and extraction phases,

TABLE I DEFINITION OF TABLES AND THEIR CORRESPONDING ROWS AND COLUMNS Table ,,

A combination of one
and

, <tt> Note - The <a> tag is listed as a raw information as well as an informative element. The reason is that, during data extraction based on DOM tree traversal, the link will be extracted from the <a> tag and the traversal will continue so as to extract the associated text, if any is present. Container Elements These elements are used to group units of data together in a page, and to provide the structural organisation to the page.<br /> <br /> TABLE II TABLE IDENTIFICATION AND EXTRACTION HEURISTICS FOR TABLE HEADER CONTEXT Table 1) 2) 3) 4)<br /> <br /> Identification and Extraction Steps for Table Header Context Since the context is TABLE, this is immediately assumed to be a table header. If a table element immediately follows the header’s table element, this is the first table to be extracted. Execution jumps to step 4. If no such table element follows the header’s table element, no results are reported. The tag of this table element is noted. Also, the number of informative columns of the first row of this table are noted where an informative column is defined as containing at least one child element that holds raw information, is informative or a container. Let this number be n. 5) All the rows of this first table are extracted unless a row with a column number not equal to n is encountered, in which case, extraction terminates. 6) If extraction does not terminate at step 5, for all subsequent table elements that follow the first table, if the table element tag is the same as that of the first, extracts all rows with n columns in these tables. Terminate extraction when the first non-conforming table is encountered or if a particular row in the table does not contain n columns.<br /> <br /> TABLE III TABLE IDENTIFICATION AND EXTRACTION HEURISTICS FOR ROW HEADER CONTEXT Table Identification and Extraction Steps for Row Header Context 1) If the row element of this header has no siblings, then it is a table header. Otherwise, it is a column header. 2) If the header is a column header, the position of the row amongst its siblings is noted. Let this position be p. Only informative rows are considered when calculating this position. Also, the total number of header rows n is noted. 3) If the table element of this row is immediately followed by another table element, this is the first table to be extracted. In the case of the column header, the contents of column p at each row of this table are extracted. Otherwise, the entire content of each row with n columns is extracted unless a row without n columns is encountered. Execution skips to step 5, if extraction does not terminate. 4) If no such table element follows this row’s table element, no results are reported. 5) For all tables that are subsequent siblings of this first table, they are selected for extraction if their table element is the same as that of the first one and if their rows contain n columns. Extraction proceeds accordingly, depending on whether the header is a column or table header. Terminates extraction when the first non-conforming table is encountered or in the case of a table header, if a non-conforming row is encountered.<br /> <br /> TABLE IV TABLE IDENTIFICATION AND EXTRACTION HEURISTICS FOR COLUMN HEADER CONTEXT Table Identification and Extraction for Column Header Context 1) If the column element of this header has no siblings, then it is a table header. Otherwise, it is a column header. 2) If the header is a column header, the position of the column amongst its siblings is noted. Let this position be p. Only informative columns are considered when calculating this position. Also, the total number of header columns n is noted. 3) First, subsequent rows to this column’s row element are searched for. If they exist, they are assumed to be the rows of the table corresponding to this header. Depending on whether the header is a column or table header, data is extracted accordingly following which, extraction terminates. 4) If no subsequent rows are detected, a table element that immediately follows this column’s table is searched for. If it exists, depending on whether the header is a column or table header, data is extracted accordingly. Execution skips to step 6, unless extraction terminates. 5) If no such table exists, no results are reported. 6) For all tables that are subsequent siblings of this first table, they are selected for extraction if their table element is the same as that of the first one and if their rows contain n columns. Extraction proceeds accordingly, depending on whether the header is a column or table header. Terminates extraction terminates when first non-conforming table is encountered or in the case of a table header, if a non-conforming row is encountered.<br /> <br /> Tags - <dd>, <dt>, <div>, <dl>, <form>, <li>, <tr>, <td>, <option>, <span> Table Elements These elements fall into a special class of container elements that are used to group data on a webpage into a tabular or list format. Tags - <table>, <ul>, <ol>, <dl>, <select>, <div> with the id attribute set Note - The data within a <table> tag is often grouped into sections enclosed in <thead>, <tbody> and <tfoot> tags. Each of them is considered a table element as well. When a <table> tag is encountered in the DOM tree, it is replaced by a <thead>, <tbody> and <tfoot> set to form an enclosure of the affected nodes. In addition, a <div> element, with its id attribute set, can be used to indicate the start of a new section or block in a webpage, or in a forum page, the start of a board, thread or post list. In contrast, a <div> element, without an id attribute set, is only used to add styling attributes to the text it encloses.<br /> <br /> Sibling Elements HTML elements in a web page with the same XPath are termed as sibling elements (or nodes). Note - <thead>, <tbody> and <tfoot> elements are assumed to possess the same XPath as the enclosing <table> elements. Therefore, we consider the <thead>, <tbody> and <tfoot> tags within a single <table> tag to be siblings of each other. All tags that are not defined above (e.g. <img>, <script>) are assumed to be “noise” and are ignored by the algorithms during extraction. We also define the concept of “context” to be a table, row or column for describing the exact type of the element (or node) being processed. A. Definition of Table We regard the presence of a table element tag as the start of a table in this work. However, in order to extract data from specific rows or columns in a table, we need to further devise the definitions of what constitute a table, its corresponding<br /> <br /> rows and columns, in the language of HTML tags. We present these definitions and describe them in Table I. From Table I, it is clear that row elements in most tables are defined according to the HTML 4.01 specifications. However, a <div> element is merely defined as “a section of the document” in HTML 4.01, and so, the demarcation of rows and columns within a <div> must be identified through other means. Considering the case where a <div> with an id attribute is in fact encasing the board or thread list, it seems logical that the individual records would be a sequence of container elements that are immediate children of the table element. However, in the source HTML, the path from the table element to the row elements is often interleaved with encasing container elements that are present in order to add certain style or alignment attributes to the entire list. Therefore, the first sequence of the container elements (with two or more elements in the sequence) is selected as the rows for this table. The only exception to this rule, as explained in Table I, is the case where another table element is the first and only child of a <div> with an id attribute table. The explanation for the column selection rules for <li>, <dd>, <option> and <div> with an id attribute rows is similar. The column elements defined for <thead>, <tbody>, <tfoot> conform to the HTML 4.01 specifications. B. Board and Thread Extraction Algorithm Based on our observations of board and thread table structures in forum sites, we devise the following algorithm for extracting the relevant board or thread lists and columns. 1) Starting with the <body> node, traverse the web page DOM tree in a depth-first manner. 2) For each node, determine if the text value of the node matches any of the input headers. If so, mark the node for data extraction and execute step 7. 3) If a table element is encountered, traverse the DOM subtree of this node with the context set to TABLE. Next, traverse the DOM subtrees of all the siblings of this table element with the TABLE context as well. 4) When the context is set to TABLE, discern the rows of the table using the heuristics in Table I and traverse each row’s DOM subtree with the context set to ROW. 5) When the context is set to ROW, discern the columns of the row using the heuristics in Table I and traverse each column’s DOM subtree with the context set to COLUMN. 6) Traversal ends when every node of the DOM tree is visited (except for the “noise” nodes and their subtrees). 7) Depending on the context in which the header is encountered, locate the tables corresponding to this header and extract data according to Table II, III or IV based on the nature of the header (i.e. table or column header) C. Post Extraction Algorithms Based on our observations of post lists in forum sites, we devise the following algorithm for extracting the relevant post lists and posts.<br /> <br /> 1) Beginning with the <body> node, traverse the DOM tree in a depth-first manner. 2) When processing a single node, add its first child as the first element of a potential candidate set. 3) Assign each subsequent child to the most likely potential candidate set based on two criteria. The first criteria is that this subsequent child and the first node in the existing potential candidate set can be reached by the same XPath, therefore, implying that they are siblings. The second criteria is that this subsequent child and the first node in the existing potential candidate set are of the same HTML tag and share the same attributes and values set (except for the id attribute, which is required to be unique for each element on a page). Subsequent child nodes which does not fit into any existing set will be placed into new set(s). 4) When all the children of the current node have been visited, add potential candidate sets with more than one member to the final candidate set list. Nodes in these sets are not processed further. For the remaining nodes, they are traversed and processed according to Step 2 to 4. 5) After the traversal terminates, compute the average text content for each final candidate set. 6) The candidate set with the highest average text content is selected as the post list and the post data is extracted from each of its members. In addition to the above core algorithm, we propose another post extraction algorithm by incorporating the Tree Edit Distance (TED) filtering. With TED filtering, we ensure that the DOM subtrees under each element are similar before carrying out content extraction. Therefore, non-similar elements are not extracted (i.e. avoided during the extraction) based on this algorithm. We quantify the similarity between two trees based on the Zhang-Shasha dynamic programming algorithm [27]. Next, we implement our algorithms and carry out the experiments for performance evaluation. V. E XPERIMENT AND A NALYSIS In this section, we describe the design of the experiment setup to evaluate the performance of our proposed algorithms. We implement our algorithms and the Zhang-Shasha algorithm, and utilize the libcurl, tidy and rapidxml libraries for downloading, cleaning up and parsing of web pages in forum sites respectively. We retrieve 50 thread lists (i.e. pages) and 50 post tables from each forum, and consider each text and anchor node within the thread or post contents as the positive nodes (i.e. content nodes of interest). The rest of the nodes in the DOM tree are considered to be negative nodes (i.e. noise that we do not want to retrieve). As a strict evaluation criteria, a thread or post is regarded as extracted successfully by our algorithms (thereby, counted as a true positive) only if (i) all the contents in its text and anchor nodes are extracted correctly, and (ii) each content node is grouped correctly according to its respective thread or post. We show the results of our evaluation for the thread and post contents (without and with<br /> <br /> TABLE V THREAD EXTRACTION RESULTS Forum<br /> <br /> Total Nodes<br /> <br /> eda scam ubuntu stormfront creditcardforum scamfound realscam islamicawakening scambaits exposeascam<br /> <br /> 82348 107617 95729 94250 62606 150918 65637 110053 105975 76150<br /> <br /> Positive Nodes 23236 33200 8207 31995 15546 52562 14696 19903 29749 19259<br /> <br /> Extracted Nodes 23236 34570 8207 34171 15746 53468 14696 19903 29749 19259<br /> <br /> TP Nodes<br /> <br /> Recall<br /> <br /> Precision<br /> <br /> 23236 33200 8207 31995 15546 52562 14696 19903 29749 19259<br /> <br /> 1 1 1 1 1 1 1 1 1 1<br /> <br /> 1 0.960 1 0.936 0.987 0.983 1 1 1 1<br /> <br /> TABLE VI POST EXTRACTION RESULTS (LIGHTWEIGHT CORE ALGORITHM WITHOUT TED FILTERING) Forum<br /> <br /> Total Nodes<br /> <br /> eda scam ubuntu stormfront creditcardforum scamfound realscam islamicawakening scambaits exposeascam<br /> <br /> 51055 77384 66785 72896 40037 35945 116743 96111 60977 38322<br /> <br /> Positive Nodes 8939 15442 7842 12854 8938 3306 34351 13070 13613 5656<br /> <br /> Extracted Nodes 8939 15442 7842 12854 8938 3290 34351 13070 13613 5656<br /> <br /> TP Nodes<br /> <br /> Recall<br /> <br /> Precision<br /> <br /> 8939 15442 7842 12854 8938 3220 34351 13070 13613 5656<br /> <br /> 1 1 1 1 1 0.974 1 1 1 1<br /> <br /> 1 1 1 1 1 0.979 1 1 1 1<br /> <br /> TABLE VII POST EXTRACTION RESULTS (WITH TED FILTERING) Forum<br /> <br /> Total Nodes<br /> <br /> eda scam ubuntu stormfront creditcardforum scamfound realscam islamicawakening scambaits exposeascam<br /> <br /> 51081 77383 66785 72894 40038 35939 116965 96489 55743 38339<br /> <br /> Positive Nodes 8947 15442 7842 12854 8938 3306 34431 13070 12042 5656<br /> <br /> Extracted Nodes 8259 14674 7171 6616 8691 2731 21953 10034 8857 5656<br /> <br /> TED filtering) extraction in Table V, VI and VII, respectively. We observe that all the thread records are extracted successfully, resulting in a 100% recall rate. However, for some forums, there are additional nodes (i.e. redundant or noise information) being extracted from some pages. The average thread extraction precision rate is 0.9866. The reason is that one or more user-defined table headers are present in the content of the thread records, which leads to such content being identified as relevant table headers too. Therefore, it leads to nodes linked to both the correct and mistakenly identified headers, being extracted. Currently, the comparison in the algorithm relies on sub-string matching to allow the algorithm to support a more lenient and generic form of web forum extraction. To prevent the extraction of additional redundant nodes (i.e. to increase the precision rate), a strict string matching can be applied in the algorithm instead. We also observe that for most web forums, we obtain a 100% recall and precision rate in our post extraction results using the core post extraction algorithm (i.e. without TED filtering). However, for the scamfound forum, we notice a few<br /> <br /> TP Nodes<br /> <br /> Recall<br /> <br /> Precision<br /> <br /> 8259 14674 7171 6536 8691 2633 21953 10034 8657 5656<br /> <br /> 0.923 0.950 0.914 0.508 0.972 0.796 0.638 0.768 0.719 1<br /> <br /> 1 1 1 0.988 1 0.964 1 1 0.977 1<br /> <br /> list-of-posts pages with unusually short text contents. Therefore, for these pages, it leads to the selection of the wrong candidate set for extraction. The average recall and precision rates for our core post extraction algorithm (without TED filtering) are 0.9974 and 0.9979, respectively. To resolve the issue of low average text length in the post records, we attempt to exploit the difference in the DOM structure by carrying out the TED filtering in our second post extraction algorithm. However, we observe that despite incurring an additional overhead to compute the TED, our algorithm with TED filtering obtains a poorer result compared to our core post extraction algorithm. The average recall and precision rates for our post extraction algorithm (with TED filtering) are 0.8188 and 0.9929, respectively. While the precision rate is comparable to the core post extraction algorithm, the additional step of TED filtering may be too restrictive a filtering process, resulting in dropping the correct candidates due to the large intra-TED among the post records. Currently, the TED computation operations are given equal weights. However, for the subtree of a post record, it can be<br /> <br /> observed that there are a few common structures corresponding to data such as the user name, join date, post date, etc. at the beginning of the tree when traversing in a pre-order manner. On the other hand, the tree structure corresponding to the user posted content usually varies. Therefore, as a future work, we suggest carrying out an automated adaptive weight assignment to the TED computation operations according to the current hierarchical level of the nodes. VI. C ONCLUSIONS In this paper, we discussed the need for an accurate and efficient information extraction from forum sites. The limitations of existing methods such as the requirements to carry out sample collection, manual sample labelling and model training, amount to a substantial effort in human intervention and can be extremely time-consuming. Another limitation is the need to generate models specifically for different forum sites and intra-site templates, which results in scalability issues. The use of computationally intensive algorithms and techniques such as the rendering of web pages during the operational mode, also incur severe overhead. Therefore, we proposed three generalized and lightweight algorithms for the extraction of information from forum sites. Our contributions include the definition and classification of the related HTML 4.01 tags through an exhaustive study of the specifications. Based on our observations of various forum sites, we then proposed three generally applicable algorithms to support the extraction of threads and posts from forum sites. We evaluate our algorithms based on two strict criteria (to the granularity of node level correctness measurement). We only consider a thread/post as successfully extracted by our algorithms (thereby, counted as a true positive) only if (i) all the contents in its text and anchor nodes are extracted correctly, and (ii) each content node is grouped correctly according to its respective thread or post. Our experimental results showed that our proposed algorithms achieve a high efficiency in the extraction of relevant information and avoidance of noise and redundant content from forum sites. Our experiments on ten forum sites show that our proposed thread extraction algorithm achieves an average recall and precision rate of 100% and 98.66%, respectively, while our core post extraction algorithm achieves an average recall and precision rate of 99.74% and 99.79%, respectively. R EFERENCES [1] J. Zhang, M. S. Ackerman, and L. Adamic, “Expertise networks in online communities: structure and algorithms,” WWW Conference, pp. 221– 230, 2007. [2] N. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton, and T. Tomokiyo, “Deriving marketing intelligence from online discussion,” ACM SIGKDD International Conference on Knowledge discovery in Data Mining, pp. 419–428, 2005. [3] Y. Zhang, S. Zeng, L. Fan, Y. Dang, C. A. Larson, and H. Chen, “Dark web forums portal: searching and analyzing jihadist forums,” IEEE International Conference on Intelligence and Security Informatics, pp. 71–76, 2009. [4] Y. Zhou, J. Qin, G. Lai, and H. Chen, “Collection of u.s. extremist online forums: A web mining approach,” Annual Hawaii International Conference on System Sciences, p. 70, 2007.<br /> <br /> [5] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer networks and ISDN systems, vol. 30, no. 1-7, pp. 107–117, 1998. [6] S. Chakrabarti, M. van den Berg, and B. Dom, “Focused crawling: a new approach to topic-specific web resource discovery,” Computer Networks: The International Journal of Computer and Telecommunications Networking, vol. 31, no. 11-16, pp. 1623–1640, 1999. [7] R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, “irobot: An intelligent crawler for web forums,” WWW Conference, pp. 447–456, 2008. [8] Y. Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, and W.-Y. Ma, “Exploring traversal strategy for web forum crawling,” ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 459–466, 2008. [9] H.-M. Ying and V. L. L. Thing, “An enhanced intelligent forum crawler,” IEEE Symposium on Computational Intelligence for Security and Defence Applications, 2012. [10] A. Sachan, W. Y. Lim, and V. L. L. Thing, “A generalized links and text properties based forum crawler,” IEEE/WIC/ACM Web Intelligence Conference, 2012. [11] J. Joy and M. A., “Automated path ascend forum crawling,” Internation Journal of Engineering Research and Technology, vol. 2, no. 3, 2013. [12] W. W. Cohen, M. Hurst, and L. S. Jensen, “A flexible learning system for wrapping tables and lists in html documents,” WWW Conference, pp. 232–241, 2002. [13] N. Kushmerick, “Wrapper induction: efficiency and expressiveness,” Artificial Intelligence - Special issue on Intelligent Internet Systems, vol. 118, no. 1-2, pp. 15–68, 2000. [14] I. Muslea, S. Minton, and C. Knoblock, “A hierarchical approach to wrapper induction,” Annual Conference on Autonomous Agents, pp. 190– 197, 1997. [15] S. Zheng, R. Song, J.-R. Wen, and D. Wu, “Joint optimization of wrapper generation and template detection,” ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 894–902, 2007. [16] W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krupl, and B. Pollak, “Towards domain-independent information extraction from web tables,” WWW Conference, pp. 71–80, 2007. [17] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma, “Simultaneous record detection and attribute labeling in web data extraction,” ACM SIGKDD International Conference on Knowledge discovery in Data Mining, pp. 494–503, 2006. [18] M. Asfia, M. M. Pedram, and A. M. Rahmani, “Main content extraction from detailed web pages,” International Journal of Computer Applications, vol. 4, no. 11, pp. 18–21, August 2010. [19] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma, “2d conditional random fields for web information extraction,” International Conference on Machine Learning, pp. 1044–1051, 2005. [20] D. Buttler, L. Liu, and C. Pu, “A fully automated object extraction system for the world wide web,” IEEE International Conference on Distributed Computing Systems, pp. 361–370, 2001. [21] B. Liu, R. Grossman, and Y. Zhai, “Mining data records from web pages,” ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606, 2003. [22] D. Gusfield, Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997. [23] F. Sun, D. Song, and L. Liao, “Dom based content extraction via text density,” International ACM SIGIR Conference on Research and Development in Information, pp. 245–254, 2011. [24] J.-M. Yang, R. Cai, Y. Wang, J. Zhu, L. Zhang, and W.-Y. Ma, “Incorporating site-level knowledge to extract structured data from web forums,” WWW Conference, pp. 181–190, 2009. [25] W. Y. Lim, A. Sachan, and V. L. L. Thing, “A lightweight algorithm for automated forum information processing,” IEEE/WIC/ACM Web Intelligence Conference, 2013. [26] W3Schools, “Html 4.01/xhtml 1.0 reference,” June 2012. [27] K. Zhang and D. Shasha, “Simple fast algorithms for the editing distance between trees and related problems,” SIAM Journal on Computing, vol. 18, no. 6, pp. 1245–1262, 1989.<br /> <br /> </div> </div> </div> </div> </div> </div> </div> <div class="col-lg-3 col-md-4 col-xs-12"> <div class="panel-meta panel panel-info"> <div class="panel-heading"> <h2 class="text-center panel-title">Generalized and Lightweight Algorithms for Automated ...</h2> </div> <div class="panel-body"> <div class="row"> <div class="col-md-12"> <span class="st">limitations of existing works in the <em>extraction</em> of information from generic web sites and forum sites. We also identify the need for better suited, generalized and lightweight algorithms to carry out a more accurate and efficient information <em>extraction</em> while eliminating noisy <em>data</em> from forum sites. In this paper, we propose three ...</span> </div> <div class="col-md-12"> <div class="doc"> <hr /> <div class="download-button" style="margin-right: 3px; margin-bottom: 6px;"> <a href="https://p.pdfkul.com/download/generalized-and-lightweight-algorithms-for-automated-_5a7c078e1723dd79ca16624d.html" class="btn btn-success btn-block"><i class="fa fa-cloud-download"></i> Download PDF </a> </div> <div class="share-box pull-left" style="margin-right: 3px;">  <a href="http://www.facebook.com/sharer.php?u=https://p.pdfkul.com/generalized-and-lightweight-algorithms-for-automated-_5a7c078e1723dd79ca16624d.html" target="_blank" class="btn btn-social-icon btn-facebook"> <i class="fa fa-facebook"></i> </a>  <a href="http://www.linkedin.com/shareArticle?mini=true&url=https://p.pdfkul.com/generalized-and-lightweight-algorithms-for-automated-_5a7c078e1723dd79ca16624d.html" target="_blank" class="btn btn-social-icon btn-twitter"> <i class="fa fa-twitter"></i> </a> </div> <div class="fb-like pull-left" data-href="https://p.pdfkul.com/generalized-and-lightweight-algorithms-for-automated-_5a7c078e1723dd79ca16624d.html" data-layout="button_count" data-action="like" data-size="large" data-show-faces="false" data-share="false"></div> <div class="clearfix"></div> <div class="row"> <div class="col-md-12" style="margin-top: 6px;"> <span class="btn pull-left" style="padding-left: 0;"><i class="fa fa-file-pdf-o"></i> 593KB Sizes</span> <span class="btn pull-left"><i class="fa fa-download"></i> 1 Downloads</span> <span class="btn pull-left" style="padding-right: 0;"><i class="fa fa-eye"></i> 203 Views</span> </div> </div> <div class="clearfix"></div> <div class="row"> <div class="col-md-12"> <span class="btn pull-left" style="padding-left: 0;"><a data-toggle="modal" data-target="#report" style="color: #f44336;"><i class="fa fa-handshake-o"></i> Report</a></span> </div> </div> </div> </div> </div> <h4 id="comment"></h4> <div id="fb-root"></div> <script> (function (d, s, id) { var js, fjs = d.getElementsByTagName(s)[0]; if (d.getElementById(id)) return; js = d.createElement(s); js.id = id; js.src = "//connect.facebook.net/en_GB/sdk.js#xfbml=1&version=v2.9&appId=266776430439748"; fjs.parentNode.insertBefore(js, fjs); }(document, 'script', 'facebook-jssdk')); </script> <div class="fb-comments" data-href="https://p.pdfkul.com/generalized-and-lightweight-algorithms-for-automated-_5a7c078e1723dd79ca16624d.html" data-width="100%" data-numposts="6"></div> </div> </div> <div class="panel-recommend panel panel-success"> <div class="panel-heading"> <h4 class="text-center panel-title">Recommend Documents</h4> </div> <div class="panel-body"> <span>No documents</span> </div> </div> </div> </div> </div> <div class="modal fade" id="report" tabindex="-1" role="dialog" aria-hidden="true"> <div class="modal-dialog"> <div class="modal-content"> <form role="form" method="post" action="https://p.pdfkul.com/report/5a7c078e1723dd79ca16624d" style="border: none;"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal" aria-hidden="true">×</button> <h4 class="modal-title">Report Generalized and Lightweight Algorithms for Automated ...</h4> </div> <div class="modal-body"> <div class="form-group"> <label>Your name</label> <input type="text" name="name" required="required" class="form-control" /> </div> <div class="form-group"> <label>Email</label> <input type="email" name="email" required="required" class="form-control" /> </div> <div class="form-group"> <label>Reason</label> <select name="reason" required="required" class="form-control"> <option value="">-Select Reason-</option> <option value="pornographic" selected="selected">Pornographic</option> <option value="defamatory">Defamatory</option> <option value="illegal">Illegal/Unlawful</option> <option value="spam">Spam</option> <option value="others">Other Terms Of Service Violation</option> <option value="copyright">File a copyright complaint</option> </select> </div> <div class="form-group"> <label>Description</label> <textarea name="description" required="required" rows="3" class="form-control">