THE DISSERTATION ENTITLED FOCUSED CRAWLER ...

Viewer
Transcript

THE DISSERTATION ENTITLED FOCUSED CRAWLER SUBMITTED IN PARTIAL FULFILLMENT FOR THE DEGREE OF

MASTER OF TECHNOLOGY IN COMPUTER ENGINEERING SUBMITTED BY

GOHIL BHAVESH N. GUIDED BY

Dr. DHIREN R. PATEL

Department Of Computer Engineering SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECHNOLOGY SURAT- 395007, GUJARAT, INDIA 2007-2008

DEDICATED TO Rev. Pandurang Shastri Athavale and My Family

STUDENT DECLARATION

I hereby declare that the work being presented in this thesis; entitled “Focused Crawler” by Gohil Bhavesh N. (Roll No: P06CO956) and submitted to the Computer Engineering Department at Sardar Vallabhbhai National Institute of Technology, Surat, is an authentic record of my own work carried out during the period of ___________________ to ___________________ under the supervision of Dr. Dhiren R. Patel. The matter presented in this thesis has not been submitted by me in any other University/Institute for any cause.

Neither the source code there in, nor the content of the project report have been copied or downloaded from any other source. We understand that our result grade would be revoked if later it is found to be so.

Date: _____________________

Signature of the Student:

GOHIL BHAVESH N.

CERTIFICATE This is to certify that the dissertation report entitled “FOCUSED CRAWLER” submitted by Mr. Gohil Bhavesh N. in partial fulfillment of the requirement for the award of the degree of MASTER OF TECHNOLOGY in Computer Engineering, Computer Engineering Department of the SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECHNOLOGY, SURAT is a record of his own work carried out for the coursework for the year 2007-08. The matter embodied in the report has not been submitted elsewhere for the award of any degree or diploma.

Dr. D. R. Patel Thesis Guide Computer Engineering Dept SVNIT, Surat.

PG In-Charge Computer Engineering Dept SVNIT, Surat.

Head Computer Engineering Dept SVNIT, Surat.

EXAMINER’S CERTIFICATE OF AN APPROVAL The dissertation report entitled “FOCUSED CRAWLER” submitted by Mr. Gohil Bhavesh N. in partial fulfillment of the requirement for the award of the degree of MASTER OF TECHNOLOGY in Computer Engineering, Computer Engineering Department of the SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECHNOLOGY, SURAT is hereby approved for the award of the degree.

EXAMINERS:

A BSTRACT Search engines of the past years, used to gather web documents, using their respective web crawlers. These documents were indexed appropriately to resolve queries fired at the search engines. This consortium of crawlers was termed “General Crawlers” or simply “Crawlers”. Contemporary domain specific search engines (web directories) on the other hand, possess the aptitude to respond to queries that are very informationprecise. The kind of crawlers used by such search engine rightly termed as “Focused Crawlers”. A Focused Crawler examines its crawl boundary so that it limits its crawling subroutine only to the links that are most appropriate. By doing so, it hence elegantly avoids irrelevant neighborhood of the web. This approach leads not only to a significant saving in hardware and network resources, but also facilitates in keeping the crawled database in state-of-the-art.

In this dissertation work we have built a focused crawling system which first of all learn and build an index from the seed URLs (trusted URLs). For subsequent resources on the web, a component has been made that determines whether the resource is relevant, thereby also classify the links embedded in the resource. This component adds focus to the crawling, and hence the term focused crawler. Keywords: Search Engine, Cralwer, Focused Crawler, Classifier- Cosine Similarity, Meta tag

A CKNOWLEDGEMENT It is a pleasure to thank all those who made this thesis possible.

I would like to thank my guide Dr. Dhiren R. Patel for giving me the opportunity to work with him. Thank you sir, for your valuable guidance, for the patience with which you taught me, for your support and encouragement.

I extend my thanks to Dr. M.A.Zaveri, Prof. D.C.Jinwala and Mr. R.P.Gohil for their valuable suggestions during my project work.

I would like to thank my friends Dharmesh Patel, Pradip Tiwari and Ramesh Solanki for helping me in my project.

I would also like to thank all my classmates for providing a stimulating and supporting environment for work. Thanks, to all those whom I have not mentioned here, but have helped me in some or the other way in completing this work.

Lastly but most importantly, I would like to thank my family for having faith and confidence in me, for encouraging and supporting me.

Gohil Bhavesh N. P06CO956 SVNIT, Surat.

T ABLE O F C ONTENTS Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Tables of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 1. Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.

Problems and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.

Thesis’ Objective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.

Organization of this Report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2. Background and Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.

4

5

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1

Role of Focused Crawler in Domain Specific Search Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1.2

Existing Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.3

Web Analysis Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.1.4

Web Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2.

Application of Focused Crawler . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.3.

Literature Survey . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .

10

2.4.

Vector Space Model (Term Vector Model) . . . . . . . . . . . . . . . . . .

13

2.5.

Weighing System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.6.

Vector Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.7

Meta Tag. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3. Proposed Focused Crawling System. . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.1.

System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.2.

Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.2.1 Cosine similarity measure. . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.3.

Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.4.

Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

4. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

4.1.

Required Platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

4.2.

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

4.2.1

Threshold Vs Acceptance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.2

Threshold Vs Precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.3

Seed URLs Vs Acceptance. . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.4

Seed URLs Vs Precision. . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

5. Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

4.3.

5.1.

Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

5.2.

Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

References and Bibliography Appendix A: Code Design and Program Flow Appendix B: Setting and Running up of the Focused Crawler

L IST O F FIGURES Fig. 1.1 Difference between Standard Crawling and Focused Crawling. . . . . . .

02

Fig. 2.1 Role of Focused Crawler in Domain Specific Search Engine . . . . . . . .

06

Fig. 2.2 Existing Focused Crawler Architecture . . . . . . . . . . . . . . . . . . . . . . . . .

07

Fig. 3.1 Proposed Focused Crawler System Architecture . . . . . . . . . . . . . . . . . . 15 Fig. 4.1 Threshold Vs Accepted URLs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

Fig. 4.2 Threshold Vs Precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

Fig. 4.3 No of Seed URLs Vs Accepted URLs . . . . . . . . . . . . . . . . . . . . . . . . . .

22

Fig. 4.4 No of Seed URLs Vs Precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Chapter 1

1 Preface Focused crawler is an automated mechanism to efficiently find pages relevant to a topic on the web. Focused crawlers were proposed to traverse and retrieve only a part of the web that is relevant to a particular topic, starting from a set of pages usually referred to as the seed set. It makes efficient usage of network bandwidth and storage capacity.

In this dissertation work we have built a focused crawling system which first of all learn and build an index from the seed URLs (trusted URLs). For subsequent resources on the web, a component has been made that determines whether the resource is relevant, thereby also classify the links embedded in the resource. This component adds focus to the crawling, and hence the term focused crawler.

1.1 Problems and Motivation Being already the largest information source in the world, the World Wide Web (WWW) is still expanding rapidly. Nowadays, millions of people are seeking information on it, and search engines play a very important role during this process. A Web crawler is a key component inside a search engine. It can traverse (a portion of) the Web space by following Web pages’ hyperlinks and storing the downloaded Web documents in local repositories that will later be indexed and used to respond to the users’ queries efficiently. However, with the huge size and explosive growth of the Web, it becomes more and more difficult for search engines to provide effective services to end-users. Moreover, such a large collection often returns thousands of result documents in response to a single query. Browsing many documents to find the relevant ones is time-consuming and tedious.

The indexable Web has more than 11.5 billion pages. Even Google, the largest search engine, has only 76.16% coverage [1]. About 7 million new pages go online each day. It is impossible for major search engines to update their collections to meet such rapid

growth. As a result, end-users often find the information provided by major search engines not comprehensive or out-of date.

To address the above problems, domain-specific search engines were introduced, which keep their Web collections for one or several related domains. Focused crawlers were used by the domain-specific search engines to selectively retrieve Web pages relevant to particular domains to build special Web sub-collections, which have smaller size and provide search results with high precision. The general Web crawlers behind major search engines download any reachable Web pages in breadth-first order. This may cause heavy network traffic. On the other hand, focused crawlers follow Web links that point to the relevant and high-quality Web pages. Those Web links that point to non-relevant and low-quality Web pages will never be visited. The above-mentioned major differences between general Web crawlers and focused crawlers are summarized in Figure 1.1.

(a)Standard crawling

(b)Focused crawling Relevant Page

Non-Relevant Page

Figure 1.1: Difference between standard Crawling and Focused Crawling

In Figure 1.1(a), a standard crawler follows each link, typically applying a breadthfirst search strategy. If the crawler starts from a document which is i steps from a target document, all the documents that are up to i-1 steps from the starting document must be downloaded before the crawler hits the target. In Figure 1.1(b), a focused crawler tries to identify the most promising links, and ignores off-topic documents. If the crawler starts from a document which is i steps from a target document, it downloads a small subset of all the documents that are up to i-1 steps from the starting document. If the search strategy is optimal, the crawler takes only i steps to discover the target.

Most focused crawlers use simple adaptations of the vector space model to judge the relevance of Web pages, and local search algorithms such as best-first to determine the collecting order in which the target links are visited. Some problems caused by traditional focused crawlers arise as more and more knowledge about Web structure has been revealed through Web structural studies. These problems could result in lowquality domain-specific collections, which would degrade the performance of domainspecific search engines.

Other reason for the motivation is that different user communities have different requirements, and search engines are built without any consideration for their special requirements. Due to space constraints, such generic search engine systems might not be able to store sufficient information about all the possible domains. Compared to the generic search engine systems, domain-specific search systems can be built with substantially less computing power and a disk space. To optimally use the resources and to make informed decisions, some intelligence has to be built into such a system. Normal search engines are built by crawling resources on the web, following links on each page, and then constructing an index which can be used for searching. The only thing to ensure here is that cycles are avoided. Focused crawling involves the system learning from a set of seed websites and building an index out of it. For subsequent resources on the web, a component has to be constructed that determines whether the resource is relevant, thereby also endorsing the links embedded in the resource. This component adds focus to the crawling, and hence the term focused crawler.

1.2. Thesis’ Objective In this dissertation work we have built a focused crawling system which first of all learn and build an index from the seed URLs (trusted URLs). For subsequent resources on the web, a component has been made that determines whether the resource is relevant, thereby also classify the links embedded in the resource. This component adds focus to the crawling, and hence the term focused crawler.

The pro posed work used the html tags information (title tag, meta tags), URL name and cosine similarity for the classification of the page. It means the proposed work considers the content based and link based classification technique, which results better precision than the focused crawler that are only based on standard classifier.

1.3. Organization of this Report The rest of the report is organized as follows:

Chapter 2 describes and elaborates the basic focused crawling concepts and algorithms used. Also we have tried to describe the work already been done in that area.

Chapter 3 we describe our proposed system architecture of focused crawler including the system design, algorithm and advantages

Chapter 4 consists of experimental setup, hardware/software specification, and obtained results with the relevant discussion.

Chapter 5 we conclude our system and indicate potential areas of future work.

Chapter 2 2 Background and Related Work 2.1 Introduction Generally the popular portals or search engines like Yahoo and Alta Vista are used for gathering information on the WWW. However, with the explosive growth of the Web, fetching information about a special-topic is becoming an increasingly difficult task. Moreover, the Web has over 11.5 billion pages and continues to grow rapidly at a million pages per day [1]. Such growth and flux pose basic limits of scale for today's generic search engines. Thus, many relevant information may not have been gathered and some information may not be up-to-date. Because of these problems, recently there is much awareness that for serious Web users, focused portholes are more useful than generic portals.

Focused crawler is an automated mechanism to efficiently find pages relevant to a topic on the web. Focused crawlers were proposed to traverse and retrieve only a part of the web that is relevant to a particular topic, starting from a set of pages usually referred to as the seed set. It makes efficient usage of network bandwidth and storage capacity. Focused crawling provides a viable mechanism for frequent updation of search engine indexes. They have been useful for other applications like distributed processing of the complete web, with each crawler assigned to a small part of the web. They have also been used to provide customized alerts, personalized/community search engines, web databases and business intelligence. One of the first focused Web crawlers is discussed in [2]. Experiences with a focused crawler implementation were described by Chakrabarti in [3] and briefly explained in 2.2. Focused crawlers contain two types of algorithms to keep the crawling scope within the desired domain: (1) Web analysis algorithms are used to judge the relevance and quality of the Web pages pointed to by target URLs; and (2) Web search algorithms determine the optimal order in which the target URLs are visited [4].

2.1.1 Role of Focused Crawler in Domain Specific Search Engine Mainly Focused crawlers are used in domain specific search engines (Web directories/digital libraries). Figure 2.1 depicts the role of it in such search engines. Here Crawler component with the classifier becomes focused crawler else it works as general crawler as used in general search engines.

Domain Specific Search Engine (Front End) 7 1 2

Crawler

WWW

Index Barrel

4

3

5 6

Parser Classifier

Focused Crawler (Back End)

Figure: 2.1 Role of Focused Crawler in Domain Specific Search Engine

Following steps describes the flow of domain specific search engines where the crawlers generally termed as focused crawlers. •

Crawler is assigned seed URLs set as input to be crawled.

•

These URLs are then fetched from WWW by Crawler.

•

It then parses the document using parser and separate out the text and html tags from it.

•

Then it assigned the content to index barrel to build an index based on available words.

•

It will further crawl the extracted links available on seed URLs if classifier endorses it according to the technique it uses.

2.1.2 Existing Architecture It [3] uses existing document taxonomy (e.g. pages in Yahoo tree) and seed documents to build a model for classification of retrieved pages into categories (corresponding to nodes in the taxonomy). The use of taxonomy also helps at better modeling of the negative class: irrelevant pages are usually not drawn from a homogenous class but could be classified in a large number of categories with each having different properties and features. In this paper the same applies for the positive class because the user is allowed to have interest in several non-related topics at the same time. The system is built from 3 separate components: crawler, classifier and distiller.

Figure 2.2: Existing Focused Crawler Architecture •

Crawler: Crawler thread picks best score URL from the priority queue and fetches the web page pointed to by that URL and pass on to classifier.

•

Classifier: Find out the relevance score of that page if that page is relevant then all the out links in that page are inserted in the priority queue with priority proportional to the relevance score of the web page.

•

Distiller: Periodically runs through the crawled pages to find the hubs and authorities to prioritize URLs that are found to be in hubs.

2.1.3 Web Analysis Algorithms In general, these kinds of algorithms can be categorized into two types: content based Web analysis algorithms and link-based Web analysis algorithms.

Content-based analysis algorithms apply indexing techniques for text analysis and keyword extraction to help determine whether a page’s content is relevant to a target domain. They can incorporate domain knowledge into their analysis to improve the results. For example, they can check the words on a Web page against a list of domain-specific terminology and assign a higher weight to pages that contain words from the list. Assigning a higher weight to words and phrases in the title or headings is also standard information-retrieval practice that algorithm can apply based on appropriate HTML tags. The URL address often contains useful information about a page. For example, http://eprints.svnit.ac.in tells us that the page comes from the svnit.com domain and that it likely has information relating to e materials.

Previous research has shown that the link structure of the Web offers some important information for analyzing the relevance and quality of Web pages [5]. Intuitively, the author of a Web page A, who places a link to Web page B, believes that B is relevant to A. The term in-links refers to the hyperlinks pointing to a page. Usually, the larger the number of in-links, the higher a page will be rated. The rationale is similar to citation analysis, in which an often-cited article is considered better than one never cited. The assumption is made that if two pages are linked to each other, they are likely to be on the same topic. One study [6] actually found that the likelihood of linked pages having similar textual content was high, if one considered random pairs of pages on the Web. Anchor text is the word or phrases that hyperlink to a target page. Anchor text can provide a good source of information about a target page because it represents how people linking to the page actually describe it. Several studies have tried to use either the anchor text or the text near it to predict a target page’s content. It is also reasonable to give a link from an authoritative source, such as Yahoo (www.yahoo.com), a higher weight than a link from a personal homepage. Researchers have developed several link-analysis algorithms over the past few years. The most popular link-based Web analysis algorithms include PageRank [7] and HITS [8].

2.1.4 Web Search Algorithms These other kinds of algorithms are used in focused crawlers to determine an optimal order in which the URLs are visited. Even though many different search algorithms have been tested in focused crawling, the two most popular ones are: Breadth-first Search and Best-first Search.

Breadth-first search is the simplest strategy for crawling. It does not utilize heuristics in deciding which URL to visit next. All URLs in the current level will be visited in the order they are discovered before URLs in the next level are visited. Although breadth-first search does not differentiate Web pages of different quality or different topics, it is well suited to build collections for general search engines.

In best-first search, URLs are not simply visited in the order they are discovered; instead, some heuristics (usually results from Web analysis algorithms) are used to rank the URLs in the crawling queue and those that are considered more promising to point to relevant pages are visited first. Non-promising URLs are put to the back of the queue where they rarely get a chance to be visited. Clearly, best-first search has advantages over breadth-first search because it probes only in directions where relevant pages locate and avoids visiting irrelevant pages. However, best-first search also has some problems. In [6], it has been pointed out that using best first search the crawlers could miss many relevant pages and result in low recall of the final collection, because best-first search is a Local Search Algorithm, i.e., best-first search can only traverse the search space by probing neighbors of the nodes previously visited.

Some other more advanced search algorithms also were introduced into the focused crawling domain. For example, a parallel search technique called the Spreading Activation Algorithm was proposed by Chau and Chen [9] to build domain-specific collections. The algorithm effectively combined content-based and link-based Web analysis algorithms together, which successfully avoids many shortcomings of using either one of them alone, but as the spreading activation algorithm is still a local search algorithm; it shares the limitations of other local search algorithms.

2.2 Application of Focused Crawler •

A focused crawler acquires relevant pages steadily while a standard crawler quickly loses its way, even if they start from the same root set.

•

Focused crawler can discover valuable resources that are dozens of links away from the start set. It is a very effective solution for building high-quality collections of Web pages on specific topics.

•

Using focused crawlers, one can order sites according to the density of relevant pages found there. For example, one can find the top five sites specializing in cars.

•

Focused crawlers also identify regions of the Web that grow or change dramatically against those that are relatively stable.

2.3 Literature Survey 2.3.1 Fish Search Some early work on the subject of focused collection of data from the Web was done by [2] in the context of client-based search engines. Web crawling was simulated by a “group of fish” migrating on the web. In the so called “fish search” each URL corresponds to a fish whose survivability is dependant on visited page relevance and remote server speed. Page relevance is estimated using a binary classification (the page can only be relevant or irrelevant) by a means of a simple keyword or regular expression match. Only when fish traverse a specified amount of irrelevant pages they die off - that way information that is not directly available in one ‘hop’ can still be found. On every document the fish produce offspring – its number being dependant on page relevance and the number of extracted links. The school of fish consequently ‘migrates’ in the general direction of relevant pages which are then p resented as results. Starting point is specified by the user by providing ‘seed’ pages that are used to gather initial URLs. URLs are added to the beginning of the crawl list which makes this a sort of a depth first search.

2.3.2 Shark Search [10] extends fish algorithm into “shark-search”. URLs of pages to be downloaded are prioritized by taking into account a linear combination of source page relevance, anchor text and neighborhood (of a predefined size) of the link on the source page and

inherited relevance score. Inherited relevance score is parent page’s relevance score multiplied by a specified decay factor. Unlike in [2] page relevance is calculated as a similarity between document and query in vector space model and can be any real number between 0 and 1. Anchor text and anchor context scores are also calculated as similarity to the query.

The classifier is used to determine page relevance (according to the taxonomy) which also determines future link expansion. Two different rules for link expansion are presented. Hard focus rule allows expansion of links only if the class to which the source page belongs with the highest probability is in the ‘interesting’ subset. Soft focus rule uses the sum of probabilities that the page belongs to one of the relevant classes to decide visit priority for children; no page is eliminated a priori. Periodically the distiller subsystem identifies hub pages (using a modified hubs & authorities algorithm. Top hubs are then marked for revisiting

2.3.3 Accelerated focused crawling An improved version was proposed in [11] which extend the previous baseline focused crawler to prioritize the URLs within a document. Relevance of a crawled page is obtained using the document taxonomy as explained above using the ’baseline’ classifier. The URLs within the document are given priority based on the local neighborhood of the URL in the document. An apprentice classifier learns to prioritize the links in the document. Once sufficient number of source pages and the target pages pointed to by the URL in the source page are downloaded and labeled as relevant/irrelevant, the apprentice is learnt. A representation for each URL based on target page relevance, source page relevance, Document Object Model (DOM) structure, co-citation and other local source page information is constructed. The apprentice is trained online to predict the relevance of the target page pointed to by a URL in the source page. The relevance so calculated is used to prioritize the URLs. The apprentice is periodically retrained to improve the performance. Both these methods depend on the usage of quality document taxonomy for good performance. The dependence on the document taxonomy makes it inapplicable to applications where the topic is too specific.

2.3.4 Intelligent crawling Intelligent crawling was proposed in [12] that allow users to specify arbitrary predicates. It suggests use of arbitrary implementable predicates that use four sets of document statistics including source page content, URL tokens, linkage locality and sibling locality in the classifier to calculate the relevance of the document. The source page content allows prioritizing different URLs differently. URL tokens help in getting approximate semantics about the page. Linkage locality is based on the assumption that web pages on a given topic are more likely to link to those of the same topic. Sibling locality is based on the assumption that if a web page points to pages of a given topic then it is more likely to point to other pages on the same topic.

2.3.5 Ontology based algorithm Consider an ontology-based algorithm for page relevance computation [13]. After reprocessing, entities (words occurring in the ontology) are extracted from the page and counted. Relevance of the page with regard to user selected entities of interest is then computed by using several measures on ontology graph (e.g. direct match, taxonomic and more complex relationships). The harvest rate is improved compared to the baseline focused crawler (that decides on page relevance by a simple binary keyword match) but is not compared to other types focused crawlers.

2.3.6 Focused Crawling Using Combination of Link Structure and Content Similarity In this article [14] a crawler which uses a combination of links structure and contents to do focus crawling is introduced. To implement it we need to maintain link structure of pages and also introduce a metric for measuring the similarity of a page to a domain.

2.3.7 Using Context Graph In the article [15] relevant pages can be found by knowing what kinds of off topic pages link to them. For each seed document a several layers deep graph is constructed that consists of pages pointing to that seed page. Because that information is not directly available from the web, a search engine is used to provide backward links. Graphs for all seed pages are then merged together and a classifier is trained to

recognize a specific layer. Those predictions are then used to assign priority to the page.

2.4 Vector Space model (Term vector model) It is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms.

A document is represented as a vector. Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf weighting

2.5 Weighing System TF.IDF is one way to combine a word’s term frequency tf, and document frequency df into a single weight:

 N Weight(i, j ) = (1 + log(tf i , j )) log  df  i, j = 0 , if tf i , j = 0

  , if tf i , j ≥ 1  

Where, tfi,j = term frequency which is no. of occurrence of word wi in document dj. dfi,j = document frequency which is no. of documents in the collection where wi occurs in. N = is the total no. of documents.

2.6 Vector similarity To do retrieval in the vector space model, documents are ranked according to Similarity with the query as measured by the cosine measure or normalized correlation Coefficient. Here is the definition of computing vector similarity of query Q and document D:

n

∑W

qi

Sim(Q, D) =

∗ Wdi

i =0 n

n

i =1

i =1

∑ (Wqi ) 2 ∗∑ (Wdi ) 2 Where Wqi is the weight of word wi in query Q, Wdi is the weight of word wi in document D.

2.7 Meta Tag Meta tags are HTML codes that are inserted into the header on a web page, after the title tag. In the context of search engine, when people refer to meta tags, they are usually referring to the meta description tag and the meta keywords tag. The meta description tag and the meta keywords tag are not seen by users. Instead, these tags main purpose is providing meta document data (data about the document contents) to user agents, such as search engines. It provides information about a given webpage, most often to help search engines categorize them correctly

Chapter 3

3 Proposed Focused Crawling System In this dissertation work we have built a focused crawling system which first of all learn and build an index from the seed URLs (trusted URLs). For subsequent resources on the web, a component has been made that determines whether the resource is relevant, thereby also classify the links embedded in the resource. This component adds focus to the crawling, and hence the term focused crawler.

The proposed work used the html tags (meta tags, title tag), URL name and cosine similarity for the classification of the page. It means the proposed work considers the content based and link based classification technique, which results better precision than the focused crawler that are only based on standard classifier.

3.1 System Architecture

Search Interface Seed URLs, Threshold, Depth

Index

6

1

5 2

Crawler

3

Parser

4

Documents (Storage) Figure 3.1: Proposed Focused Crawler System Architecture

Classifier

•

The user provides a bunch of URLs that are called seed URLs for the system to learn. The interface also provides an option to bind the depth of following embedded URLs.

•

The crawler retrieves resources from the seed URLs, and passes them on to the parser, which cleans the text, and then requests the classifier to build a knowledge base, namely the index.

•

The classifier uses cosine similarity to approve or reject a document. It operates on a threshold value between 0 and 1. This value has to be specified along with the seed URLs during the process. This value is a tradeoff between precision and recall. As a brief reminder, precision corresponds to the correctness or relevance of the resource retrieved, while recall refers to the number of such resources retrieved. This section talks about the experiments used to determine a suitable threshold for our domain of interest.

•

The cleaned up contents of the URL are also made persistent on the disk.

•

After learning, the crawler then follows the hyperlinks from these URLs as long as the depth allowed is non-zero.

•

If zero, crawling is stopped. At this point of time, the classifier compares the retrieved document against the built index for similarity, and decides whether to accept or reject it, and communicates the decision back to the parser.

•

The parser communicates the decision to the crawler through a list of URLs that the crawler needs to crawl. If the classifier does not endorse the resource, the parser returns an empty list to the crawler.

3.2 Classifier We are using cosine similarity measure and html tags (meta tags and title tag) to classify (accept/reject) documents. By that means we are providing content and important html tags checking to classify documents.

3.2.1 Cosine similarity measure The Cosine measure is the most popular measure for evaluating document similarity based on the Vector Space Model (VSM). The VSM creates a space in which documents are represented by vectors. For a fixed collection of documents, a feature vector is generated for each document from sets of terms with associated weights. Then, a vector similarity function is used to compute the similarity between the vectors.

In the VSM, the weight wd,t associated with term t in any document d is calculated by tfd,t×idft, where tfd,t and idft are defined as follows: • Term frequency tfd,t, the number of occurrences of term t in document d.

N • Inverse document frequency idf t = log  , where N is the total number of  nt  documents in the collection and nt is the number of documents containing term t. The similarity between two documents a and b can be defined as the normalized inner →

→

product of the two corresponding vectors a and b : → →

Simcos ine (a, b) =

a⋅ b

→

→

|| a || × || b ||

=

∑t∈a ∩b ( wa ,t × wb ,t ) ∑t∈a wa2,t × ∑t∈b wb2,t

Where a∩b gets the common words between a and b. t ∈ a (or b) means t is a unique term in document a (or b). wa,t and wb,t are the weights of term t in documents a and b, respectively.

3.3 Algorithm We try to improve the preciseness of the topic. For algorithms used in current focused Crawler, it’s hard to build crawlers for different but similar topics. That means if you want to find topics about “finance” or “home finance”, the web pages fetched by current focused crawler may almost the same. Because for those systems based on only classifier technique, it’s difficult to judge which page belongs to the category of “finance” and which one belongs to the category of “home finance”. And for those

systems based on the traditional vector space model, the seed url would be very important. It’s very hard to find one or two seed urls to represent the topic. Our algorithm is a method combines TF.IDF weighting scheme [16] and vector space model [17], so that we can improve the preciseness of judging whether a web page is relevant to a topic or not.

Following is our proposed algorithm:

• Step 1: get URL set to represent the topic

o Use popular meta search engine or some good search engines to get top N urls as seed urls to fetch seed web pages. o Combine all seed web pages as one document named D. O Also provide the Threshold and Depth (finished condition – when depth reaches zero) as input. • Step 2: fetch web pages

o Push seed web pages to a web page queue o While the web page queue is not empty or finished condition isn’t Satisfied -

Pop web page from the web page queue

-

Extract out-links from the web page

-

Fetch web pages of the out-links

-

For each web page P, compute the cosine similarity of D and P: sim (D, P) If (sim (D, P) < threshold θ and HTML tags matched) • Push web page P to the web page queue • Index the web page (update the existing index)

o End While

3.4 Features •

It follows the robot exclusion protocol/standard: The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is, otherwise, publicly viewable.

•

Preciseness: As we have used html tags and URL name to extend the classification to endorse the document, it definitely results well in terms of precision.

Chapter 4

4 Experimental Setup 4.1 Required Platform Any Linux distribution with Perl and Java language support with Mozilla Firefox web browser can be used as a platform to run this system.

4.2 Results Following are the various performance metrics comparison after conducting focused crawling on the 3 set of seed urls. For that we have kept the crawling depth equals 2.

URL set 1:

Regarding ‘Gujarat’

http://en.wikipedia.org/wiki/Gujarat http://www.gujaratindia.com http://www.webindia123.com/gujarat/index.htm http://www.answers.com/topic/gujarat http://india.gov.in/knowindia/st_gujurat.php

URL set 2:

Regarding ‘Education India’

http://en.wikipedia.org/wiki/Education_in_India http://www.educationindiainfo.com http://www.indiaedu.com/index.html http://www.angelfire.com/indie/educationinfoindia http://www.highereducationinindia.com

URL set 3:

Regarding ‘Culture India’

http://en.wikipedia.org/wiki/Culture_of_India http://www.indianmirror.com/culture/cul1.html http://www.indianchild.com/culture%20_1.htm http://www.cultureofindia.net http://www.sscnet.ucla.edu/southasia/Culture/culture.html

4.2.1 Threshold Vs Acceptance A set of seed URLs was fixed (it was first 3 here), and the effect on the number of resources termed fit by the classifier by the threshold value was studied. The experimental results have been laid out in a graph as shown below: Threshold(Theta) Vs No of Accepted URLs

No of Accepted URLs

50 40 URL set 1

30

URL set 2 20

URL set 3

10 0 0.1

0.2

0.3

0.4

0.5

Threshold(Theta)

Figure 4.1: Threshold (%) Vs No of Accepted URLs

4.2.2 Threshold Vs Precision

Precision(%)

Threshold(Theta) Vs Precision(% ) 16 14 12 10 8 6 4 2 0

URL set 1 URL set 2 URL set 3

0.1

0.2

0.3

0.4

0.5

Threshold(Theta)

Figure 4.2: Threshold (Theta) Vs Precision (%)

4.2.3 No. of Seed URLs Vs Acceptance In this experiment, the acceptance of the system being affected by the number of seed URLs is studied. Since the maximum number of resources is accepted (with precision) when threshold is 0.1, this value was carried into this experiment.

Minimum number of seed URLs is 2, because as per the vector space model, the term weight is a product of inverted document frequency of the term and term frequency of the term in the document, and for documents less than 1, the term frequency becomes 0. No of Seed URLs Vs No of Accepted URLs

No of Accepted URLs

300 250 200

URL set 1

150

URL set 2

100

URL set 3

50 0 2

3

4

5

6

No of Seed URLs

Figure 4.3: No of Seed URLs Vs No of Accepted URLs

4.2.4 No. of Seed URLs Vs Precision

Precision(%)

No of Seed URLs Vs Precision(% ) 40 35 30 25 20 15 10 5 0

URL set 1 URL set 2 URL set 3

2

3

4

5

6

No of Seed URLs

Figure 4.4: No of Seed URLs Vs Precision

4.3 Discussion The above experimental results directly shows that the proposed focused crawler gives the best acceptance and precision while threshold is 0.1 while no of seed URLs are variable between 2,3 and 4 with the Depth equals 2. By increasing threshold, precision gets decreased. But by increasing no of seed URLs precision may get increased/decreased based on how the seed URLs are. Depth also affects the result but in terms of recall only. So we have not considered it for the comparison.

Chapter 5 5 Conclusion and Future Work 5.1 Conclusion By applying important html tags matching upon cosine similarity improves the precision compare to the system with only cosine similarity as the classification method. We have covered three main html tag elements i.e. meta keywords, meta description and title tag to extends the matching (similarity) the documents against existing index.

5.2 Future Work We identified the following as possible enhancements to the implemented system: 1. A search engine on top of the accumulated index. 2. Using a different model to compute similarity at the classifier and Comparing its performance against the used vector-space model. 3. Personalize the system for users, granting them access to the multiple indices and documents that were constructed in response to various requests and ability to search through them. Group management might also 4. A systematic noise-remover that would remove words of no value from the index built. This would result in better quality of index. 5. System can be enhanced and provide more precise output using implementing good stemming algorithm. 6. It can be enhanced by using anchor text and its surrounding texts to classify the documents well and precisely.

References and Bibliography 1. “The indexable web is more than 11.5 billion pages”, Gulli, A. and Signorini, A., in Special interest tracks and posters of the 14th international conference on World Wide Web. p. 902 – 903, Chiba, Japan, 2005.

2. “Information Retrieval in Distributed Hypertexts”, Bra, P.D., Houben, G., Kornatzky, Y., and Post, R., in Proceedings of the 4th RIAO Conference. p. 481-491, New York, 1994.

3. “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery”, Chakrabarti, S., Berg, M.V.D., and Dom, B. in Proceedings of the 8th International WWW Conference. p. 545-562, Toronto, Canada, 1999.

4. “Using Genetic Algorithm in Building Domain-Specific Collections: An Experiment in the Nanotechnology Domain”, Qin, J. and Chen, H., in Proceedings of the 38th Annual Hawaii International Conference on System Sciences - HICSS 05. p. 102.2, Hawaii, USA, 2005.

5. “Inferring Web Communities from Link Topology”, Gibson D., Kleinberg J. and Raghavan P.,in Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, Pittsburgh, Pennsylvania, USA,1998.

6. “Focused Crawls, Tunneling, and Digital Libraries”, Bergmark D., Lagoze C. and Sbityakov A.,in Proceedings of the 6th European Conference on Digital Libraries, Rome, Italy,2002.

7. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Brin S. and Page L.,Computer Networks and ISDN Systems, 30(1-7),1998.

8. “Authoritative Sources in a Hyperlinked Environment”, Kleinberg J. M. in Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, USA, 1998.

9. “Comparison of Three Vertical Search Spiders”, Chau, M. and Chen, H., IEEE Computer, 36(5): p. 56-62, 2003.

10."The Shark-Search Algorithm – An Application: Tailored Web Site Mapping", M. Hersovici, M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalhaim and S. Ur. In Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, April 1998.

11.“Accelerated focused crawling through online relevance feedback”, S. Chakrabarti, K. Punera, and M. Subramanyam. Proc. of 11th Intl. Conf. on World Wide Web, pages 148–159, 2002.

12. “Intelligent crawling on the World Wide Web with arbitrary predicates”, C.C. Aggarwal, F. Al-Garawi, and P.S. Yu., Proc. of 10th Intl. Conf. on WWW, 2001.

13. “Ontology-focused Crawling of Web Documents”, M. Ehrig, A. Maedche. In Proceedings of the 2003 ACM symposium on applied computing.

14. “A Method for Focused Crawling Using Combination of Link Structure and Content”, Mohsen Jamali, Hassan Sayyadi, Babak Bagheri Hariri and Hassan Abolhassani.

15. “Focused crawling using context graphs”, M. Diligenti, F. Coetzee, S. Lawrence, C.L. Giles, and M. Gori., Proc. of 26th Intl. Conf. on VLDB, 2000.

16. Foundations of statistical natural language processing by Christopher D. Manning and Hinrich Schutze 17. Introduction to Modern Information Retrieval by G. Salton and M. McGill, McGraw- Hill, 1983.

Appendix A: Code Design and Program Flow Our Proposed Focused Crawling System consists of the following files:

1. crawler.conf (input file) 2. fcrawl.pl 3. indexer.java 4. accepted.txt (output file) 5. acc_keywords.txt (output file) 6. rejected.txt (output file)

crawler.conf (input file) This is configuration file contains initialization parameter including seed urls set, depth, threshold, crawl directory and keywords to be used by fcrawl.pl script.

fcrawl.pl Our system starts by executing this perl script. It consists of the following subroutines with its flow.

Based on initial parameter settings it will call subroutine called crawl, which is responsible to start crawling and will call subroutine named train to build an index initially based on seed URLs. It also extracts the links available on seed URLs and crawl in depth by calling depthcrawl subroutine and its is recursive.

Both train and depthcrawl subroutine further calls subroutine named parse to parse the documents and remove the html tags, ads, script and etc., and called indexer. java to make the classification of the parsed document.

indexer. java This file contains the cosine similarity classifier code which is responsible to classify the document passed by train and depthcrawl subroutine.

accepted.txt file contains the list of urls accepted by cosine similarity classifier.

acc_keywords.txt file contains the list of urls accepted by classifier and html tags also. rejected.txt file contains the list of urls rejected by classifier.

Appendix B: Setting and Running up of Focused Crawler First of all we need to initialize the parameter like seed URLs, Threshold, Depth and the crawl directory in crawler.conf file.

After that we can start focused crawling by running the file – fcrawl.pl in command prompt by ./fcrawl.pl or using perl fcrawl.pl

To see an output we need to check the file accepted.txt, acc_keywords.txt and rejected.txt for accepted urls list by classifier only, accepted urls list by classifier and html tags and rejected urls list by classifier respectively.

THE DISSERTATION ENTITLED FOCUSED CRAWLER ...

requirements, and search engines are built without any consideration for their special ... search engines, web databases and business intelligence. One of the ...

Download PDF

325KB Sizes 1 Downloads 264 Views

Report

THE DISSERTATION ENTITLED FOCUSED CRAWLER ...

Recommend Documents