POPSS (Peer tO Peer based Semantic Search) Submitted as an Entry for
Team Members (in alphabetical order) • • • • •
Aditya Raj (
[email protected])† Alok G. Kumbhare (
[email protected]) † Krishna Kishore (
[email protected]) † Nishant Tandon (
[email protected]) † Sameer Agarwal (
[email protected]) †
†Department of Computer Science and Engineering Indian Institute of Technology Guwahati India
CONTENTS
1 Introduction 1.1Theme 1.2Objective 1.3Background and Overview 1.4Key Features 2 System Design 2.1Architecture 2.2Interactions 2.3How a Query is searched? 3 Viability Issues 3.1Monetization 3.2Security implications 3.3Privacy Issues 3.4Limitations 3.5Extensibility 4 References
1.1 Theme To organize the world’s information and to make it universally accessible and useful.
1.2 Objective POPSS (Peer tO Peer based Semantic Search) Engine is a Peer to Peer based search engine which takes to its advantage, the powerful and exciting ideas of the Semantic Web.
1.3 Background and Overview 1.3.1 Motivation Today, probably, searching is as fast as it can be (?), thanks to the huge computing power behind Google. But the problem is that the size of the internet is increasing by huge leaps and bounds. Search Engines which can now index huge Terabytes of data on their servers comfortably will most likely not be able to do this in maybe 25 years from now. Such a future will make it difficult to constantly store the increasingly large internet and also make it nearly impossible to keep the indexed copies up‐to‐date. Such a scenario will call for a slight tilt in thinking, a twist in our approaches and concepts of storage and manipulating data. We would like to present POPSS (Peer tO Peer Semantic Search) which is a P2P based search engine that enables indexing on different nodes on the network rather than on a central server. Smaller indexes can enable us to frequently update data and create dynamic ontology trees and n‐triples to take advantages of the Semantic Search Technologies. Moreover, presence of few super nodes (nodes which index Word2Peer Mapper) in the network makes the searching a lot faster. 1.3.2 Scope of the Search engine Initial Stage In its initial stage, it is intended as a plug‐in for very big search engines like Google so that they can provide more relevant and better search results specific to local groups of people. Advance Stage When the system becomes sufficiently large with a large number of nodes in the system, the entire P2P search can hold out on its own. 1.3.3 Use Cases Client Web Searching When a client that is connected to POPSS, searches for a query, that query is parsed by the Semantic Parser and is broken down into various meaningful and non‐redundant words. These words are sent to a central authority of the network called a supernode. There can be many
supernodes in a P2P network depending on the size and scale of the network and the number of peers. Once the parsed query reaches the supernode, the supernode performs a lookup in the Word2Peer Mapper wherein all the words along with the node IDs of the peers have been mapped and informs the nodes (that have the cached and indexed data of the queried words) about the query. These nodes in turn perform a dynamic search in their ontology trees and try to get the context of the query. Moreover, ontology trees on different peers can also be merged to generate dynamic ontology trees. In case, ontology definitions are not adequate for a particular query, the search falls back to normal text search. On getting all these results, the peers send the appropriate URLs corresponding to that query to the client node which are displayed according to a Ranking Algorithm.
1.4 Key Features: • • • • • •
Distributed Indexing of Data Use of Semantic Web technologies that can generate n‐triples and dynamic ontology trees from crawled data. No specialized crawler for data. A website is crawled when a user visits it. Optional Global crawler available with the supernode that can crawl the web initially when the number of nodes are below a certain threshold. More relevant and localized search results possible. Optimistic Replication of Data to ensure higher availability and performance.
2 System Design 2.1 ARCHITECTURE POPSS uses Reverse Word Index. Each peer group has a SERVER (or Supernode) which assigns the responsibility of storing web index of a word(s) to a group of PEERS. Each group’s Supernode(s) contains the exact copy of Word Index. SERVER has Word2Peer Mapper which tells us which PEER is storing which WORD. Most of the processing is done on Requester PEER,(it saves CPU cycles of other PEERS). Every query searched is first processed by the Semantic Parser and for final database result , request is sent to SERVER. SERVER looks in the Word2Peer database and forwards request with Requester PEER’s address to respective PEERS. These PEERS send response directly to Requester PEER. The system has two parts : Supernodes & Client nodes. SERVER or SUPER NODE (Note: We will use the terms SERVER and SuperNode interchangeably)
SERVER has complete database keeping a record of which word is indexed on which PEER (known as the Word2Peer Mapper). SERVER decides the assignment of words to PEERS so that they can index data related to these words. Assignment is done using proper algorithm so that we have the Word Index for complete search and there is proper redundancy of every word in order to preserve the information, when a user leaves the network. It has a load balancer which distributes the load evenly on all PEERS. Moreover, the Word2Peer Mapper accepts dynamic entries of words that were searched by the users but were not there in it initially.
CLIENT NODE
Each CLIENT acts as a PEER. Each PEER has an unique PEER ID in network using which PEERS and SERVER communicate among each other. Whenever a user wants to use POPSS, it will be made to download a software that will make connection with the SERVER which will assign some words to it for storing. For every assigned word W, the client will be storing W’s word index. CLIENT PEER is responsible for crawling and sending responses to all search queries related to W. When any PEER visits any site, the web page is cached, now these cached web pages are passed through the semantic parser .This parses out HTML text, pure text and meta data from the web pages and converts them into RDF triplets or n‐ triples as shown in the diagram:
Now, it has to give this information to those peers which have responsibility of storing Index of the words that occur in that web page. For this it needs the nodes IDs of those peers. So, it looks into its local cache for these node IDs and if it fails, then it requests the Server for it. After getting Node IDs it sends the Data to Word Index Database of the corresponding peer(s). Also, group of peers those are responsible for storing the Word Index and n‐triples database, use decentralized and asynchronous replication protocol using vector clocks for maintaining same copy of data. Over a period of time, we create a dynamic ontology trees from these triplets. So, we gradually develop ontology of data using which we can search on this data more efficiently. The advantage here is that since the Indexed data per client is not too big, it can be relatively easily processed to create dynamic ontology trees. Such a processing can be easily done in the time when the client computer’s CPU usage is low. Eg. Suppose we obtained a given data from the webpage:
This data is converted to n‐triples and then its corresponding Ontology tree will look like:
Similarly, Merging Huge Ontology trees may look like :
Each peerr has a built‐in n web serverr. Every queryy is processed d by a semanttic parser and d data requestt for resultant word set is se end to SERVEER. And on insstruction of SERVER concerned PEERS o of network seend er PEER. the resultts to requeste Indexed d data is kept in n distributed m manner on diifferent PEERSS. Reverse Word Index is u used here too o. Semantic indexer uses Stanford's N NLP which is o open sourced. Semantic indexer generaates n‐triples ffrom HTML tagg, Meta data aand Pure Textt. Generating n‐triples from m HTML tags and Meta daata are almostt 100% accurate. POPSSS will try to geenerate n‐trip ples from puree text as bestt as possible ((around 30‐40 0% accuratelyy). Because of this rather low accuracy in pure text, we are mainttaining a norm mal word index too.
Gen nerate n‐triplees instead of n normal full pa age index whiich can be easily queried u using SPARQL
2.2 INTERACTIONS 2.2.1 PEER and SERVER Interaction 1. When a PEER connects to the Network If PEER, P is connected to the network first time, SERVER decides (a group of) word, W(words) for PEER, P and give responsibility to store Word Index and N‐triples of this word, W. If PEER, P is not a new user SERVER updates its Word2Peer database and in future it can redirect search requests toward this PEER, P. 2. When PEER does a search query PEER, P sends processed search query to SERVER for getting PEERS for queried words. SERVER checks its Word2Peer database and extract related PEERS .SERVER forwards search request to these PEERS. In return these PEERS send response to the requester PEER. 3. When PEER has some crawled data When a PEER visits any site, Browser Crawler and Semantic Parser process the HTML of web page and decide the keywords for this URI. Now, it has to store this information on the PEERS which have responsibility of storing Index of this word. For this it needs the PEER IDs of those PEERS. So, it looks into local cache and if fails then request to Server for those IDs. SERVER sends PEER IDs to requester PEER.
2.2.2 PEER to PEER Interaction 1. Send Response of a search request When SERVER forwards any PEER’s search request, a PEER send search result to the requester PEER. 2. Update Index Whenever one PEER’s Index database is updated, It triggers collaborative replication of this updated data on all PEERS which have responsibility of the same word. So that all PEERS in a group have same data. Optimistic Replication techniques are used here to lessen the computational and bandwidth load on any individual peer. 3. Write crawl data to specific peer Local PEER gets Remote PEER IDs which store the specific words (either from its cache or from server) and it writes to those PEERS.
2.3 How a Query is searched? When a query is searched by PEER, P, we do a slight processing and extract relevant words from it. Let the refined query be now (Wi is a word): W1 , W2 , W3 , … , Wn With this, we send this to SERVER which has the Word2Peer Mapper. SERVER find that related PEERS to queried words are P1,P2,…,Pn. SERVER sends to PEERS P1,P2,…,Pn that P has query of W1,W2,W3,…,Wn. Now, let us talk about a specific node which has a index of Wi, Wj, Wk (number of words can vary) of our query. The idea of searching is to merge the ontology tree (if any) of Wi,Wj,Wk and then search for the rest of the words in this merged tree by simple text matching. If the number of results is less than a certain threshold, we switch back to normal text searching from the Indexed data. Finally all the peers P1,P2,…,Pn return the URLs of results and the client PEER, P can get them on its screen in a certain order based on certain Ranking Algorithms. In case the query W1,W2,…,Wn contains a given threshold of words that are not indexed ,we transfer the control to the external parent search engine and also notify the SERVER so that it can index those words in its Word2Peer Mapper.
3.1 Monetization The main source of revenue will be through Ads. We can use a system like Google Ad Sense to generate revenue based on user‐clicks on sponsored links. Moreover, we can also share some of this revenue with our peer nodes based on how often a word that is indexed on their PCs is accessed. This would be a motivation for all peers to be an active part of the System and they will actively cache data for us.
3.2 Security implications There are many security issues which can be handled at time of implementation, such as: • PEER spoof a PEER Id We distribute cryptographically generated Ids so that it cannot be spoofed easily. • Authentication, Data Integrity and Encryption Every communication between PEER and SERVER or between PEER to PEER is encrypted. For Authentication and Data Integrity Digital signature is used.
3.3 Privacy Issues POPSS protects everyone’s privacy by: • In POPSS, only Word Index can be accessed and only via server. • Local files and data are neither accessible nor searchable for other users. • Content of visited web pages is searchable. But we are not storing any information regarding who has visited these WebPages. • Word index location of a visited web page is independent from the visiting user. So, it is impossible to figure out a user’s browsing history.
3.4 Limitations POPSS gives us complete search when we have a large number of PEERS in our network and also these PEERS have hashed enough data in their Index. So, it takes time to become a complete search engine.
3.5 Extensibility Gradually POPSS will be implementing support for FTP and SMB protocols. SMB protocol will make it easier for people to access each other’s SMB shares and will contribute to the richness of data returned in results. Similarly FTP protocol implementation will richen the quality of our search. The extension diagram is shown below:
4 References
y
“Extraction and Indexing of Triplet‐Based Knowledge Using Natural Language Processing”, David Hooge
y
“Semantic Search”, R. Guha, Rob McCool, Eric Miller; 2003
y
Marie‐Catherine de Marneffe, Trond Grenager, Bill MacCartney, Daniel Cer, Daniel Ramage, Chloé Kiddon and Christopher D. Manning. Aligning semantic graphs for textual inference and machine reading. In AAAI Spring Symposium at Stanford. 2007.
y
Bill MacCartney, Trond Grenager, Marie‐Catherine de Marneffe, Daniel Cer and Christopher D. Manning. Learning to recognize features of valid textual entailments. In Proceedings of the North American Association of Computational Linguistics (NAACL‐06). 2006
y
Ivan Herman’s slides on the Semantic Web Ontologies
y
David J. Duke, Ken W. Brodlie, David A. Duce, Ivan Herman: Do You See What I Mean? IEEE Computer Graphics and Applications 25
y
OWL 1.1 Web Ontology Language: Structural Specification and Functional‐Style Syntax. Peter F. Patel‐Schneider, Ian Horrocks, and Boris Motik, eds., 2006.