A Short Survey on P2P Data Indexing Yuzhe Tang Department of Computer Science and Engineering Fudan University, Shanghai China [email protected]

I. I NTRODUCTION P2P data indexing has recently attracted a great many research efforts. For various proposed schemes, there are generally two taxonomies: 1) From a systematic point of view, existing schemes fall into two categories: the over-DHT indexing paradigm, which as a layered manner, indexes data in DHT key space (i.e., over DHT), and the overlay-dependent indexing paradigm, which indexes data directly on a specific overlay. 2) From a viewpoint of indexing purpose, different schemes aim at supporting different query operators, like range queries, similarity queries, etc. Each query operator defines distinct semantic context and demands specific indexing scheme to fit it. At this point, we survey three subjects most relevant to P2P indexing: over-DHT indexing for various query types; several overlay-dependent indexing schemes (mainly for range queries); and finally a classic topic, DHT overlay. Here, only structured P2P is considered. II. OVER -DHT I NDEXING PARADIGM In over-DHT indexing paradigm, as aforementioned, data resource and DHT are loosely coupled together, by hashing a resource key, called DHT key. To design an over-DHT scheme, how to generate the DHT key regarding data locality is a critical issue. A. Range/Similarity Queries Range/similarity queries serve as fundamental operators in both DB and IR system. To support them with efficiency, locality preservation comes first, and thus effective index structures are essential. As a representative, PHT [36] incorporates trie structure to index one dimensional discrete bounded data, and simply use the label of tree node as DHT key. As a contrast, LigHT uses a more elegant method (the naming function), which gracefully distributes its index tree. For bandwidth-efficiency of range query processing, PHT maintains the pointers between neighboring leaf nodes. Towards multi-dimensional indexing, PHT leverages Space Filling Curves [8]. RandPeer [31] applies PHT to a specific scenario, indexing membership data for QoS-sensitive P2P applications. DST [49] supports efficiently range queries and cover queries. Similar to PHT, DST materializes its index tree by direct mapping of the tree node label. DST replicates data keys across all ancestors of the leaf, which leads to inefficient data insertion. When processing range queries, DST uses parallel lookups to shorten time latency, which on the other hand, incurs high bandwidth cost. Zahn et al. [47] adapted

DST to mobile Ad Hoc networks for supporting efficient range queries. RST [16] also replicates data in internal nodes. Besides, it replicates the information regarding tree shape, to all other nodes (or peers). This replicate-in-all strategy provides a global view for each peer, which is beneficial to query processing, but results in extremely high and thus unscalable maintenance overhead. For example, a single node split in RST could lead to a broadcast to all present peers. As an extension of RST, DKDT [17] embeds the k-d tree to support similarity search. PRISM [41] employs reference vectors to generate DHT keys for multi-dimensional objects and supports similarity search over DHTs. Chen et al. [9] presented a framework for range indexing and proposed various strategies for mapping tree-based index structures into DHTs. Tanin et al. [46] superimposed quadtree over DHT towards spatial indexing and querying. Each quadtree node is mapped into DHT by hashing its centroid. To avoid the hot-spot on root, some constraints are given that tree nodes lie only between levels of fmin and fmax . (It is similarly used in RST.) In finding a proper value for fmin , a general problem for overDHT paradigm is posed, how to trade off between locality preservation and peer load balance. B. Other DB/IR Queries Towards effective P2P databases, PIER [19] is proposed as a massively distributed database query engine. In particular, it implements several equi-join algorithms on top of DHTs, which are originated from traditional join algorithms. To support continuous two-way equi-join queries, Idreos et al. [21] proposed a two-level indexing framework. In this framework, both query and tuple are indexed at two levels, the attribute level and value level. That is, each query/tuple has two DHT keys which are sensitive to attribute name and tuple value respectively. A series of algorithms is proposed to check and notify query initiator when new tuples are inserted. In a recent work [20], this two-level index structure is extended and generalized to support continuous multi-way join queries. The Distributed Inverted Index (DII) [38] is a classic framework for P2P keyword search. In DII, the inverted index is superimposed over DHT by directly hashing indexed keywords, and the posting lists are intersected for multi-keyword search. The major flaw of DII is that due to Zipf distribution of text keywords, the direct keyword hashing results in load imbalance, and due to data locality (specifically, the keyword correlation) is destroyed by hashing, the posting list intersection is bandwidth-consuming. On addressing this problem,

many techniques have been proposed. P. Reynolds et al. [38] proposed three methods for posting list intersection, Bloom filter, promising caching and top-k list join. pSearch [45] addresses the locality preservation — It places documents onto a DHT network according to their semantic vectors produced by Latent Semantic Indexing (LSI). A recent proposal [35] carefully selects Highly Discriminative Keys (HDK) by using a set-based vector model, and maps HDKs to underlying DHTs. As to adapt to the dynamism of P2P text collection, querydriven indexing schemes [30], [43] were recently proposed. Besides the snapshot search, SmartSeer [26] addresses continuous keyword search on DII. To index queries, it uses the most selective keyword as query’s DHT key. III. OVERLAY- DEPENDENT I NDEXING PARADIGM In the overlay-dependent indexing paradigm, the underlying overlay bears data semantic (or data locality). Note that in over-DHT paradigm, the overlay is semantic-free. This architecture, which lowers the data indexing to overlay level, is three-fold. First of all, it offers efficient query processing. Second, a specific indexing scheme relies on a specific overlay, which weakens its adaptability and prevents it from wide deployment. Third, the overlay design and implementation is typically complicated. Specifically, a “good” overlay should accomodate various factors, like load balance, fault-tolerance, scalability and data locality etc. Regarding all of these, there is no satisfied proposal so far. Existing overlay-dependent indexing schemes generally follow two ways: DHT-modified indexing and DHT-free indexing. A. DHT-modified Indexing The DHT-modified indexing reserves the DHT framework and does modification within it (rather than over it) to preserve data locality. The LSH paradigm is a typical solution, in which the DHT’s uniform hash is replaced by the Locality Sensitive Hash. By this means, some DHT overlays directly index data, and support efficient range query processing [1], [42], [12], [28]. Gupta et al. [18] applied LSH to mapping ranges to a DHT and providing approximate answers to range queries. For efficient similarity queries, LSH-Forest [3] refines the traditional LSH by eleminating its data dependence and applied it to P2P systems. For text similarity information retrieval, I. Bhattacharya et al. [6] adapted the vector model to DHT system, by introducing an intuitive Similarity Preserving Hash function(SPH). Y. Joung et al. [24] proposed a novel keyword indexing and searching scheme. They replaced the uniform hash with Bloom filter and modeled the underlying overlay as a multi-dimensional hypercube. To traverse the hypercube, which is demanded by keyword search, a spanning tree is generated. Based on this framework, KISS [25] further supports prefix search queries. While preserving data locality, LSH is weak at providing effective load balance. This defect of LSH corrupts the uniform key distribution that DHT assumes, and further deteriorates DHT capacities in many other aspects. DHT augmentation is another choice for DHT-modified indexing. Cone [4] attaches a distributed heap structure on the

DHT identifier space, and reconstructs data locality for P2P aggregation queries. This additional data structure typically doubles maintenance cost of underlying routing tables. B. DHT-free Indexing The DHT-free indexing makes no use of full-fledged DHTs and re-designs its own overlay. The proposed schemes are based on various data structures. Skip graph [2] is a distributed range queriable structure originated from skip lists. PTree [10] and PRing [11] are distributed B-trees on P2P networks. BATON [22] is an overlay organized as a balanced binary tree. These overlays can directly support one-dimensional data indexing. VBI-Tree [23] is a general framework that aims at mapping any existing index tree into BATON. It can index multi-dimensional data, and support range queries and kNN queries. As a similarly solution, SD-Rtree [13] uses a distributed balanced binary tree for spatial indexing. Mercury [5] uses a hierarchical ring structure to index multidimensional data. In these non-hash schemes, data locality is well preserved, but at the price of deterioration of many other aspects. Peer load balance, for instance, is problematic in these schemes, and a non-trivial extension is always needed. Recent years, many sophisticated balancing strategies [15], [11], [27] have been proposed, but they cost much more in maintenance than the DHT hashing method. IV. DHT OVERLAYS For DHT overlay, the primary concern is topological scalability, especially, in two aspects: the diameter which determines the bound of hops of a lookup operation, and the degree which determines the routing table size. Many proposed DHT overlays, including Chord [44], Pastry [40], Tapestry [48], Bamboo [39] are based on the Plaxton Mesh [34], which achieves (β − 1) logβ N diameter and logβ N degree. Here, β indicates the base of DHT identifier space, for example β = 2 in Chord. Another classic DHT, CAN [37] leverages d-torus 1 topology, which possesses 2d degree and 12 dN d diameter. From a graph-theoretic viewpoint, given the degree d and diameter k, the node number N in a graph is bounded by the Moore bound [7], 1 + d + d2 + ... + dk . The Moore bound is not generally achievable. Towards this optimal case, several DHT overlays are inspired from the topologies of de Bruijn [14], [32], butterfly [33] and Kautz graph [29]. D. Loguinov et al. [32] presented a thorough theoretic analysis, regarding DHT scalability and fault-tolerance. R EFERENCES [1] A. Andrzejak and Z. Xu. Scalable, efficient range queries for grid information services. In Peer-to-Peer Computing, pages 33–40, 2002. [2] J. Aspnes and G. Shah. Skip graphs. In SODA, pages 384–393, 2003. [3] M. Bawa, T. Condie, and P. Ganesan. Lsh forest: self-tuning indexes for similarity search. In WWW, pages 651–660, 2005. [4] R. Bhagwan, G. Varghese, and G. M. Voelker. Cone: Augmenting dhts to support distributed resource discovery, 2003. [5] A. R. Bharambe, M. Agrawal, and S. Seshan. Mercury: supporting scalable multi-attribute range queries. In SIGCOMM, pages 353–366, 2004. [6] I. Bhattacharya, S. R. Kashyap, and S. Parthasarathy. Similarity searching in peer-to-peer databases. In ICDCS, pages 329–338, 2005.

[7] W. G. Bridges and S. Toueg. On the impossibility of directed moore graphs. J. Comb. Theory, Ser. B, 29(3):339–341, 1980. [8] Y. Chawathe, S. Ramabhadran, S. Ratnasamy, A. LaMarca, S. Shenker, and J. M. Hellerstein. A case study in building layered dht applications. In SIGCOMM, pages 97–108, 2005. [9] L. Chen, K. S. Candan, J. Tatemura, D. Agrawal, and D. Cavendish. On overlay schemes to support point-in-range queries for scalable grid resource discovery. In Peer-to-Peer Computing, pages 23–30, 2005. [10] A. Crainiceanu, P. Linga, J. Gehrke, and J. Shanmugasundaram. Querying peer-to-peer networks using p-trees. In WebDB, pages 25–30, 2004. [11] A. Crainiceanu, P. Linga, A. Machanavajjhala, J. Gehrke, and J. Shanmugasundaram. P-ring: an efficient and robust p2p range index structure. In SIGMOD Conference, pages 223–234, 2007. [12] A. Datta, M. Hauswirth, R. John, R. Schmidt, and K. Aberer. Range queries in trie-structured overlays. In Peer-to-Peer Computing, pages 57–66, 2005. [13] C. du Mouza, W. Litwin, and P. Rigaux. Sd-rtree: A scalable distributed rtree. In ICDE, pages 296–305, 2007. [14] P. Fraigniaud and P. Gauron. Brief announcement: an overview of the content-addressable network d2b. In PODC, page 151, 2003. [15] P. Ganesan, M. Bawa, and H. Garcia-Molina. Online balancing of rangepartitioned data with applications to peer-to-peer systems. In VLDB, pages 444–455, 2004. [16] J. Gao and P. Steenkiste. An adaptive protocol for efficient support of range queries in dht-based systems. In ICNP, pages 239–250, 2004. [17] J. Gao and P. Steenkiste. Efficient support for similarity searches in dht-based peer-to-peer systems. In ICC, pages 1867–1874, 2007. [18] A. Gupta, D. Agrawal, and A. E. Abbadi. Approximate range selection queries in peer-to-peer systems. In CIDR, 2003. [19] R. Huebsch, J. M. Hellerstein, N. Lanham, B. T. Loo, S. Shenker, and I. Stoica. Querying the internet with pier. In VLDB, pages 321–332, 2003. [20] S. Idreos, E. Liarou, and M. Koubarakis. Continuous multi-way joins over distributed hash tables. In EDBT, 2008. [21] S. Idreos, C. Tryfonopoulos, and M. Koubarakis. Distributed evaluation of continuous equi-join queries over large structured overlay networks. In ICDE, page 43, 2006. [22] H. V. Jagadish, B. C. Ooi, and Q. H. Vu. Baton: A balanced tree structure for peer-to-peer networks. In VLDB, pages 661–672, 2005. [23] H. V. Jagadish, B. C. Ooi, Q. H. Vu, R. Zhang, and A. Zhou. Vbi-tree: A peer-to-peer framework for supporting multi-dimensional indexing schemes. In ICDE, page 34, 2006. [24] Y.-J. Joung, C.-T. Fang, and L.-W. Yang. Keyword search in dht-based peer-to-peer networks. In ICDCS, pages 339–348, 2005. [25] Y.-J. Joung and L.-W. Yang. Kiss: A simple prefix search scheme in p2p networks. In WebDB, 2006. [26] J. Kannan, B. Yang, S. Shenker, P. Sharma, S. Banerjee, S. Basu, and S.-J. Lee. Smartseer: Using a dht to process continuous queries over peer-to-peer networks. In INFOCOM, 2006. [27] D. R. Karger and M. Ruhl. Simple efficient load balancing algorithms for peer-to-peer systems. In SPAA, pages 36–43, 2004. [28] D. Li, X. Lu, B. Wang, J. Su, J. Cao, K. C. C. Chan, and H. V. Leong. Delay-bounded range queries in dht-based peer-to-peer systems. In ICDCS, page 64, 2006. [29] D. Li, X. Lu, and J. Wu. Fissione: a scalable constant degree and low congestion dht scheme based on kautz graphs. In INFOCOM, pages 1677–1688, 2005. [30] Y. Li, H. V. Jagadish, and K.-L. Tan. Sprite: A learning-based text retrieval system in dht networks. In ICDE, pages 1106–1115, 2007. [31] J. Liang and K. Nahrstedt. Randpeer: Membership management for qos sensitive peer-to-peer applications. In INFOCOM, 2006. [32] D. Loguinov, A. Kumar, V. Rai, and S. Ganesh. Graph-theoretic analysis of structured peer-to-peer systems: routing distances and fault resilience. In SIGCOMM, pages 395–406, 2003. [33] D. Malkhi, M. Naor, and D. Ratajczak. Viceroy: a scalable and dynamic emulation of the butterfly. In PODC, pages 183–192, 2002. [34] C. G. Plaxton, R. Rajaraman, and A. W. Richa. Accessing nearby copies of replicated objects in a distributed environment. In SPAA, pages 311– 320, 1997. [35] I. Podnar, M. Rajman, T. Luu, F. Klemm, and K. Aberer. Scalable peerto-peer web retrieval with highly discriminative keys. In ICDE, pages 1096–1105, 2007. [36] S. Ramabhadran, S. Ratnasamy, J. M. Hellerstein, and S. Shenker. Brief announcement: prefix hash tree. In PODC, page 368, 2004. [37] S. Ratnasamy, P. Francis, M. Handley, R. M. Karp, and S. Shenker. A scalable content-addressable network. In SIGCOMM, pages 161–172, 2001.

[38] P. Reynolds and A. Vahdat. Efficient peer-to-peer keyword searching. In Middleware, pages 21–40, 2003. [39] S. C. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz. Handling churn in a dht (awarded best paper!). In USENIX Annual Technical Conference, General Track, pages 127–140, 2004. [40] A. I. T. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware, pages 329–350, 2001. [41] O. D. Sahin, A. Gulbeden, F. Emekc¸i, D. Agrawal, and A. E. Abbadi. Prism: indexing multi-dimensional data in p2p networks using reference vectors. In ACM Multimedia, pages 946–955, 2005. [42] C. Schmidt and M. Parashar. Flexible information discovery in decentralized distributed systems. In HPDC, pages 226–235, 2003. [43] G. Skobeltsyn, T. Luu, I. P. Zarko, M. Rajman, and K. Aberer. Web text retrieval with a p2p query-driven index. In SIGIR, pages 679–686, 2007. [44] I. Stoica, R. Morris, D. R. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM, pages 149–160, 2001. [45] C. Tang, S. Dwarkadas, and Z. Xu. On scaling latent semantic indexing for large peer-to-peer systems. In SIGIR, pages 112–121, 2004. [46] E. Tanin, A. Harwood, and H. Samet. Using a distributed quadtree index in peer-to-peer networks. VLDB J., 16(2):165–178, 2007. [47] T. Zahn, G. Wittenburg, and J. Schiller. Towards efficient range queries in mobile ad hoc networks using dhts. In MobiShare. [48] B. Y. Zhao, J. Kubiatowicz, and A. D. Joseph. Tapestry: a fault-tolerant wide-area application infrastructure. Computer Communication Review, 32(1):81, 2002. [49] C. Zheng, G. Shen, S. Li, and S. Shenker. Distributed segment tree: Support of range query and cover query over dht. In The 5th International Workshop on Peer-to-Peer Systems (IPTPS), Feb. 2006.

A Short Survey on P2P Data Indexing - Semantic Scholar

Department of Computer Science and Engineering. Fudan University ... existing schemes fall into two categories: the over-DHT index- ing paradigm, which as a ...

46KB Sizes 0 Downloads 241 Views

Recommend Documents

A Short Survey on P2P Data Indexing - Semantic Scholar
Department of Computer Science and Engineering. Fudan University .... mines the bound of hops of a lookup operation, and the degree which determines the ...

Indexing Dataspaces - Semantic Scholar
and simple structural requirements, such as “a paper with title 'Birch', authored ... documents, Powerpoint presentations, emails and contacts,. RDB. Docs. XML.

Chord4S: A P2P-based Decentralised Service ... - Semantic Scholar
... Ryszard Kowalczyk1, Hai Jin3. 1 Faculty of Information and Communication Technologies ... the large scalable service network, thus functioning abnormally.

A Survey on Efficiently Indexing Graphs for Similarity ...
Keywords: Similarity Search, Indexing Graphs, Graph Edit Distance. 1. Introduction. Recently .... graph Q, we also generate its k-ATs, and for each graph G in the data set we calculate the number of common k-ATs of Q and G. Then we use inequality (1)

Survey on multiobjective evolutionary and real ... - Semantic Scholar
Chromosome represents a solution and population is a collection ... overview of MOEA and generic population-based algorithm-generator for optimization.

On Knowledge - Semantic Scholar
Rhizomatic Education: Community as Curriculum by Dave Cormier. The truths .... Couros's graduate-level course in educational technology offered at the University of Regina provides an .... Techknowledge: Literate practice and digital worlds.

On Knowledge - Semantic Scholar
Rhizomatic Education: Community as Curriculum .... articles (Nichol 2007). ... Couros's graduate-level course in educational technology offered at the University ...

A Survey of Eigenvector Methods for Web ... - Semantic Scholar
Oct 12, 2004 - Nevertheless, ties may occur and can be broken by any tie-breaking strategy. Using a “first come, first serve” tie-breaking strategy, the authority and hub scores are sorted in decreasing order and the ..... surfer's “teleportati

A Survey of Eigenvector Methods for Web ... - Semantic Scholar
Oct 12, 2004 - Consider that this term-by-document matrix has as many columns as there are documents in a particular collection. ... priority on the speed and accuracy of the IR system. The final .... nonnegative matrix possesses a unique normalized

A Survey of Key Management Schemes in ... - Semantic Scholar
X. Du is with Dept. of Computer Science, North Dakota State Univ., Fargo, ND .... After deployment, every node in the network can ...... Each node in this scheme must store a t-degree polynomial which occupies (t + 1)log(q) storage space.

Attentional modulation of short - Semantic Scholar
rInternational Journal of Psychophysiology 32 1999 239 250. ¨. 242. Fig. 1. Visual lead stimuli for the selective counting task. The zebra stripe patterns were presented as slides. The dimensions of the projected patterns were. 9.25=19.5 inches. Pur

A Survey of the Bacteriophage WO in the ... - Semantic Scholar
cellular symbionts, which infect a wide range of arthropods and filarial ... The success of Wolbachia is best explained by the variety of phenotypes they induce ...

A Measurement Study of Short-time Cell Outages ... - Semantic Scholar
Jan 19, 2016 - supply variation, preventive BS activity state transition due to excessive temperature increase or ... Figure 1: Outage events recording in the OMC database. Table 1: Data set features. Feature. Value ... formance monitoring database (

Fast data extrapolating - Semantic Scholar
near the given implicit surface, where image data extrapolating is needed. ... If the data are extrapolated to the whole space, the algorithm complexity is O(N 3. √.

Reactive Data Visualizations - Semantic Scholar
of the commercial visualization package Tableau [4]. Interactions within data visualization environments have been well studied. Becker et al. investigated brushing in scatter plots [5]. Shneiderman et al. explored dynamic queries in general and how

A Short Introduction to Computer Graphics - Semantic Scholar
Although computer graphics is a vast field that encompasses almost any graphical aspect, ... Computer graphics relies on an internal model of the scene, that is, ...

A Measurement Study of Short-time Cell Outages ... - Semantic Scholar
Jan 19, 2016 - in hot-spot locations. In this scenario, we expect that. STCOs may occur, due to the possible high load experi- enced by the cells. We therefore point out the importance of load balancing and off-loading techniques [14], being low load

A Short Introduction to Computer Graphics - Semantic Scholar
Although computer graphics is a vast field that encompasses almost any graphical aspect, ... Computer graphics relies on an internal model of the scene, that is, ...

Survey on Data Clustering - IJRIT
common technique for statistical data analysis used in many fields, including machine ... The clustering process may result in different partitioning of a data set, ...

On Approximation Algorithms for Data Mining ... - Semantic Scholar
Jun 3, 2004 - Since the amount of data far exceeds the amount of workspace available to the algorithm, it is not possible for the algorithm to “remember” large.

Survey on Data Clustering - IJRIT
Data clustering aims to organize a collection of data items into clusters, such that ... common technique for statistical data analysis used in many fields, including ...

On Approximation Algorithms for Data Mining ... - Semantic Scholar
Jun 3, 2004 - problems where distance computations and comparisons are needed. In high ..... Discover the geographic distribution of cell phone traffic at.

On Approximation Algorithms for Data Mining ... - Semantic Scholar
Jun 3, 2004 - The data stream model appears to be related to other work e.g., on competitive analysis [69], or I/O efficient algorithms [98]. However, it is more ...

Distributed Indexing for Semantic Search - Semantic Web
Apr 26, 2010 - 3. INDEXING RDF DATA. The index structures that need to be built for any par- ticular search ... simplicity, we will call this a horizontal index on the basis that RDF ... a way to implement a secondary sort on values by rewriting.