A Short Survey on P2P Data Indexing - Semantic Scholar

Viewer
Transcript

A Short Survey on P2P Data Indexing Yuzhe Tang Department of Computer Science and Engineering Fudan University, Shanghai China [email protected]

I. I NTRODUCTION P2P data indexing has recently attracted a great many research efforts. For various proposed schemes, there are generally two taxonomies: 1) From a systematic point of view, existing schemes fall into two categories: the over-DHT indexing paradigm, which as a layered manner, indexes data in DHT key space (i.e., over DHT), and the overlay-dependent indexing paradigm, which indexes data directly on a speciﬁc overlay. 2) From a viewpoint of indexing purpose, different schemes aim at supporting different query operators, like range queries, similarity queries, etc. Each query operator deﬁnes distinct semantic context and demands speciﬁc indexing scheme to ﬁt it. At this point, we survey three subjects most relevant to P2P indexing: over-DHT indexing for various query types; several overlay-dependent indexing schemes (mainly for range queries); and ﬁnally a classic topic, DHT overlay. Here, only structured P2P is considered. II. OVER -DHT I NDEXING PARADIGM In over-DHT indexing paradigm, as aforementioned, data resource and DHT are loosely coupled together, by hashing a resource key, called DHT key. To design an over-DHT scheme, how to generate the DHT key regarding data locality is a critical issue. A. Range/Similarity Queries Range/similarity queries serve as fundamental operators in both DB and IR system. To support them with efﬁciency, locality preservation comes ﬁrst, and thus effective index structures are essential. As a representative, PHT [36] incorporates trie structure to index one dimensional discrete bounded data, and simply use the label of tree node as DHT key. As a contrast, LigHT uses a more elegant method (the naming function), which gracefully distributes its index tree. For bandwidth-efﬁciency of range query processing, PHT maintains the pointers between neighboring leaf nodes. Towards multi-dimensional indexing, PHT leverages Space Filling Curves [8]. RandPeer [31] applies PHT to a speciﬁc scenario, indexing membership data for QoS-sensitive P2P applications. DST [49] supports efﬁciently range queries and cover queries. Similar to PHT, DST materializes its index tree by direct mapping of the tree node label. DST replicates data keys across all ancestors of the leaf, which leads to inefﬁcient data insertion. When processing range queries, DST uses parallel lookups to shorten time latency, which on the other hand, incurs high bandwidth cost. Zahn et al. [47] adapted

DST to mobile Ad Hoc networks for supporting efﬁcient range queries. RST [16] also replicates data in internal nodes. Besides, it replicates the information regarding tree shape, to all other nodes (or peers). This replicate-in-all strategy provides a global view for each peer, which is beneﬁcial to query processing, but results in extremely high and thus unscalable maintenance overhead. For example, a single node split in RST could lead to a broadcast to all present peers. As an extension of RST, DKDT [17] embeds the k-d tree to support similarity search. PRISM [41] employs reference vectors to generate DHT keys for multi-dimensional objects and supports similarity search over DHTs. Chen et al. [9] presented a framework for range indexing and proposed various strategies for mapping tree-based index structures into DHTs. Tanin et al. [46] superimposed quadtree over DHT towards spatial indexing and querying. Each quadtree node is mapped into DHT by hashing its centroid. To avoid the hot-spot on root, some constraints are given that tree nodes lie only between levels of fmin and fmax . (It is similarly used in RST.) In ﬁnding a proper value for fmin , a general problem for overDHT paradigm is posed, how to trade off between locality preservation and peer load balance. B. Other DB/IR Queries Towards effective P2P databases, PIER [19] is proposed as a massively distributed database query engine. In particular, it implements several equi-join algorithms on top of DHTs, which are originated from traditional join algorithms. To support continuous two-way equi-join queries, Idreos et al. [21] proposed a two-level indexing framework. In this framework, both query and tuple are indexed at two levels, the attribute level and value level. That is, each query/tuple has two DHT keys which are sensitive to attribute name and tuple value respectively. A series of algorithms is proposed to check and notify query initiator when new tuples are inserted. In a recent work [20], this two-level index structure is extended and generalized to support continuous multi-way join queries. The Distributed Inverted Index (DII) [38] is a classic framework for P2P keyword search. In DII, the inverted index is superimposed over DHT by directly hashing indexed keywords, and the posting lists are intersected for multi-keyword search. The major ﬂaw of DII is that due to Zipf distribution of text keywords, the direct keyword hashing results in load imbalance, and due to data locality (speciﬁcally, the keyword correlation) is destroyed by hashing, the posting list intersection is bandwidth-consuming. On addressing this problem,

many techniques have been proposed. P. Reynolds et al. [38] proposed three methods for posting list intersection, Bloom ﬁlter, promising caching and top-k list join. pSearch [45] addresses the locality preservation — It places documents onto a DHT network according to their semantic vectors produced by Latent Semantic Indexing (LSI). A recent proposal [35] carefully selects Highly Discriminative Keys (HDK) by using a set-based vector model, and maps HDKs to underlying DHTs. As to adapt to the dynamism of P2P text collection, querydriven indexing schemes [30], [43] were recently proposed. Besides the snapshot search, SmartSeer [26] addresses continuous keyword search on DII. To index queries, it uses the most selective keyword as query’s DHT key. III. OVERLAY- DEPENDENT I NDEXING PARADIGM In the overlay-dependent indexing paradigm, the underlying overlay bears data semantic (or data locality). Note that in over-DHT paradigm, the overlay is semantic-free. This architecture, which lowers the data indexing to overlay level, is three-fold. First of all, it offers efﬁcient query processing. Second, a speciﬁc indexing scheme relies on a speciﬁc overlay, which weakens its adaptability and prevents it from wide deployment. Third, the overlay design and implementation is typically complicated. Speciﬁcally, a “good” overlay should accomodate various factors, like load balance, fault-tolerance, scalability and data locality etc. Regarding all of these, there is no satisﬁed proposal so far. Existing overlay-dependent indexing schemes generally follow two ways: DHT-modiﬁed indexing and DHT-free indexing. A. DHT-modiﬁed Indexing The DHT-modiﬁed indexing reserves the DHT framework and does modiﬁcation within it (rather than over it) to preserve data locality. The LSH paradigm is a typical solution, in which the DHT’s uniform hash is replaced by the Locality Sensitive Hash. By this means, some DHT overlays directly index data, and support efﬁcient range query processing [1], [42], [12], [28]. Gupta et al. [18] applied LSH to mapping ranges to a DHT and providing approximate answers to range queries. For efﬁcient similarity queries, LSH-Forest [3] reﬁnes the traditional LSH by eleminating its data dependence and applied it to P2P systems. For text similarity information retrieval, I. Bhattacharya et al. [6] adapted the vector model to DHT system, by introducing an intuitive Similarity Preserving Hash function(SPH). Y. Joung et al. [24] proposed a novel keyword indexing and searching scheme. They replaced the uniform hash with Bloom ﬁlter and modeled the underlying overlay as a multi-dimensional hypercube. To traverse the hypercube, which is demanded by keyword search, a spanning tree is generated. Based on this framework, KISS [25] further supports preﬁx search queries. While preserving data locality, LSH is weak at providing effective load balance. This defect of LSH corrupts the uniform key distribution that DHT assumes, and further deteriorates DHT capacities in many other aspects. DHT augmentation is another choice for DHT-modiﬁed indexing. Cone [4] attaches a distributed heap structure on the

DHT identiﬁer space, and reconstructs data locality for P2P aggregation queries. This additional data structure typically doubles maintenance cost of underlying routing tables. B. DHT-free Indexing The DHT-free indexing makes no use of full-ﬂedged DHTs and re-designs its own overlay. The proposed schemes are based on various data structures. Skip graph [2] is a distributed range queriable structure originated from skip lists. PTree [10] and PRing [11] are distributed B-trees on P2P networks. BATON [22] is an overlay organized as a balanced binary tree. These overlays can directly support one-dimensional data indexing. VBI-Tree [23] is a general framework that aims at mapping any existing index tree into BATON. It can index multi-dimensional data, and support range queries and kNN queries. As a similarly solution, SD-Rtree [13] uses a distributed balanced binary tree for spatial indexing. Mercury [5] uses a hierarchical ring structure to index multidimensional data. In these non-hash schemes, data locality is well preserved, but at the price of deterioration of many other aspects. Peer load balance, for instance, is problematic in these schemes, and a non-trivial extension is always needed. Recent years, many sophisticated balancing strategies [15], [11], [27] have been proposed, but they cost much more in maintenance than the DHT hashing method. IV. DHT OVERLAYS For DHT overlay, the primary concern is topological scalability, especially, in two aspects: the diameter which determines the bound of hops of a lookup operation, and the degree which determines the routing table size. Many proposed DHT overlays, including Chord [44], Pastry [40], Tapestry [48], Bamboo [39] are based on the Plaxton Mesh [34], which achieves (β − 1) logβ N diameter and logβ N degree. Here, β indicates the base of DHT identiﬁer space, for example β = 2 in Chord. Another classic DHT, CAN [37] leverages d-torus 1 topology, which possesses 2d degree and 12 dN d diameter. From a graph-theoretic viewpoint, given the degree d and diameter k, the node number N in a graph is bounded by the Moore bound [7], 1 + d + d2 + ... + dk . The Moore bound is not generally achievable. Towards this optimal case, several DHT overlays are inspired from the topologies of de Bruijn [14], [32], butterﬂy [33] and Kautz graph [29]. D. Loguinov et al. [32] presented a thorough theoretic analysis, regarding DHT scalability and fault-tolerance. R EFERENCES [1] A. Andrzejak and Z. Xu. Scalable, efﬁcient range queries for grid information services. In Peer-to-Peer Computing, pages 33–40, 2002. [2] J. Aspnes and G. Shah. Skip graphs. In SODA, pages 384–393, 2003. [3] M. Bawa, T. Condie, and P. Ganesan. Lsh forest: self-tuning indexes for similarity search. In WWW, pages 651–660, 2005. [4] R. Bhagwan, G. Varghese, and G. M. Voelker. Cone: Augmenting dhts to support distributed resource discovery, 2003. [5] A. R. Bharambe, M. Agrawal, and S. Seshan. Mercury: supporting scalable multi-attribute range queries. In SIGCOMM, pages 353–366, 2004. [6] I. Bhattacharya, S. R. Kashyap, and S. Parthasarathy. Similarity searching in peer-to-peer databases. In ICDCS, pages 329–338, 2005.

[7] W. G. Bridges and S. Toueg. On the impossibility of directed moore graphs. J. Comb. Theory, Ser. B, 29(3):339–341, 1980. [8] Y. Chawathe, S. Ramabhadran, S. Ratnasamy, A. LaMarca, S. Shenker, and J. M. Hellerstein. A case study in building layered dht applications. In SIGCOMM, pages 97–108, 2005. [9] L. Chen, K. S. Candan, J. Tatemura, D. Agrawal, and D. Cavendish. On overlay schemes to support point-in-range queries for scalable grid resource discovery. In Peer-to-Peer Computing, pages 23–30, 2005. [10] A. Crainiceanu, P. Linga, J. Gehrke, and J. Shanmugasundaram. Querying peer-to-peer networks using p-trees. In WebDB, pages 25–30, 2004. [11] A. Crainiceanu, P. Linga, A. Machanavajjhala, J. Gehrke, and J. Shanmugasundaram. P-ring: an efﬁcient and robust p2p range index structure. In SIGMOD Conference, pages 223–234, 2007. [12] A. Datta, M. Hauswirth, R. John, R. Schmidt, and K. Aberer. Range queries in trie-structured overlays. In Peer-to-Peer Computing, pages 57–66, 2005. [13] C. du Mouza, W. Litwin, and P. Rigaux. Sd-rtree: A scalable distributed rtree. In ICDE, pages 296–305, 2007. [14] P. Fraigniaud and P. Gauron. Brief announcement: an overview of the content-addressable network d2b. In PODC, page 151, 2003. [15] P. Ganesan, M. Bawa, and H. Garcia-Molina. Online balancing of rangepartitioned data with applications to peer-to-peer systems. In VLDB, pages 444–455, 2004. [16] J. Gao and P. Steenkiste. An adaptive protocol for efﬁcient support of range queries in dht-based systems. In ICNP, pages 239–250, 2004. [17] J. Gao and P. Steenkiste. Efﬁcient support for similarity searches in dht-based peer-to-peer systems. In ICC, pages 1867–1874, 2007. [18] A. Gupta, D. Agrawal, and A. E. Abbadi. Approximate range selection queries in peer-to-peer systems. In CIDR, 2003. [19] R. Huebsch, J. M. Hellerstein, N. Lanham, B. T. Loo, S. Shenker, and I. Stoica. Querying the internet with pier. In VLDB, pages 321–332, 2003. [20] S. Idreos, E. Liarou, and M. Koubarakis. Continuous multi-way joins over distributed hash tables. In EDBT, 2008. [21] S. Idreos, C. Tryfonopoulos, and M. Koubarakis. Distributed evaluation of continuous equi-join queries over large structured overlay networks. In ICDE, page 43, 2006. [22] H. V. Jagadish, B. C. Ooi, and Q. H. Vu. Baton: A balanced tree structure for peer-to-peer networks. In VLDB, pages 661–672, 2005. [23] H. V. Jagadish, B. C. Ooi, Q. H. Vu, R. Zhang, and A. Zhou. Vbi-tree: A peer-to-peer framework for supporting multi-dimensional indexing schemes. In ICDE, page 34, 2006. [24] Y.-J. Joung, C.-T. Fang, and L.-W. Yang. Keyword search in dht-based peer-to-peer networks. In ICDCS, pages 339–348, 2005. [25] Y.-J. Joung and L.-W. Yang. Kiss: A simple preﬁx search scheme in p2p networks. In WebDB, 2006. [26] J. Kannan, B. Yang, S. Shenker, P. Sharma, S. Banerjee, S. Basu, and S.-J. Lee. Smartseer: Using a dht to process continuous queries over peer-to-peer networks. In INFOCOM, 2006. [27] D. R. Karger and M. Ruhl. Simple efﬁcient load balancing algorithms for peer-to-peer systems. In SPAA, pages 36–43, 2004. [28] D. Li, X. Lu, B. Wang, J. Su, J. Cao, K. C. C. Chan, and H. V. Leong. Delay-bounded range queries in dht-based peer-to-peer systems. In ICDCS, page 64, 2006. [29] D. Li, X. Lu, and J. Wu. Fissione: a scalable constant degree and low congestion dht scheme based on kautz graphs. In INFOCOM, pages 1677–1688, 2005. [30] Y. Li, H. V. Jagadish, and K.-L. Tan. Sprite: A learning-based text retrieval system in dht networks. In ICDE, pages 1106–1115, 2007. [31] J. Liang and K. Nahrstedt. Randpeer: Membership management for qos sensitive peer-to-peer applications. In INFOCOM, 2006. [32] D. Loguinov, A. Kumar, V. Rai, and S. Ganesh. Graph-theoretic analysis of structured peer-to-peer systems: routing distances and fault resilience. In SIGCOMM, pages 395–406, 2003. [33] D. Malkhi, M. Naor, and D. Ratajczak. Viceroy: a scalable and dynamic emulation of the butterﬂy. In PODC, pages 183–192, 2002. [34] C. G. Plaxton, R. Rajaraman, and A. W. Richa. Accessing nearby copies of replicated objects in a distributed environment. In SPAA, pages 311– 320, 1997. [35] I. Podnar, M. Rajman, T. Luu, F. Klemm, and K. Aberer. Scalable peerto-peer web retrieval with highly discriminative keys. In ICDE, pages 1096–1105, 2007. [36] S. Ramabhadran, S. Ratnasamy, J. M. Hellerstein, and S. Shenker. Brief announcement: preﬁx hash tree. In PODC, page 368, 2004. [37] S. Ratnasamy, P. Francis, M. Handley, R. M. Karp, and S. Shenker. A scalable content-addressable network. In SIGCOMM, pages 161–172, 2001.

[38] P. Reynolds and A. Vahdat. Efﬁcient peer-to-peer keyword searching. In Middleware, pages 21–40, 2003. [39] S. C. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz. Handling churn in a dht (awarded best paper!). In USENIX Annual Technical Conference, General Track, pages 127–140, 2004. [40] A. I. T. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware, pages 329–350, 2001. [41] O. D. Sahin, A. Gulbeden, F. Emekc¸i, D. Agrawal, and A. E. Abbadi. Prism: indexing multi-dimensional data in p2p networks using reference vectors. In ACM Multimedia, pages 946–955, 2005. [42] C. Schmidt and M. Parashar. Flexible information discovery in decentralized distributed systems. In HPDC, pages 226–235, 2003. [43] G. Skobeltsyn, T. Luu, I. P. Zarko, M. Rajman, and K. Aberer. Web text retrieval with a p2p query-driven index. In SIGIR, pages 679–686, 2007. [44] I. Stoica, R. Morris, D. R. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM, pages 149–160, 2001. [45] C. Tang, S. Dwarkadas, and Z. Xu. On scaling latent semantic indexing for large peer-to-peer systems. In SIGIR, pages 112–121, 2004. [46] E. Tanin, A. Harwood, and H. Samet. Using a distributed quadtree index in peer-to-peer networks. VLDB J., 16(2):165–178, 2007. [47] T. Zahn, G. Wittenburg, and J. Schiller. Towards efﬁcient range queries in mobile ad hoc networks using dhts. In MobiShare. [48] B. Y. Zhao, J. Kubiatowicz, and A. D. Joseph. Tapestry: a fault-tolerant wide-area application infrastructure. Computer Communication Review, 32(1):81, 2002. [49] C. Zheng, G. Shen, S. Li, and S. Shenker. Distributed segment tree: Support of range query and cover query over dht. In The 5th International Workshop on Peer-to-Peer Systems (IPTPS), Feb. 2006.