Shared Query Processing in Data Streaming Systems

Viewer
Transcript

Shared Query Processing in Data Streaming Systems by Saileshwar Krishnamurthy B.E.(Hons) (Birla Institute of Technology and Science, Pilani) 1995 M.S. (Purdue University, West Lafayette, IN) 1997 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor Michael J. Franklin, Chair Professor Joseph M. Hellerstein Professor Douglas Dreger Fall 2006

The dissertation of Saileshwar Krishnamurthy is approved:

Professor Michael J. Franklin, Chair

Date

Professor Joseph M. Hellerstein

Date

Professor Douglas Dreger

Date

University of California, Berkeley

Fall 2006

Shared Query Processing in Data Streaming Systems c 2006 Copyright by Saileshwar Krishnamurthy

Abstract

Shared Query Processing in Data Streaming Systems by Saileshwar Krishnamurthy Doctor of Philosophy in Computer Science University of California, Berkeley Professor Michael J. Franklin, Chair

In networked environments there is an increased proliferation of sources (e.g., seismic sensors, financial tickers) that produce live data streams. As a consequence, systems that can manage streaming data have gained tremendous importance. These systems provide declarative query-based interfaces that have enabled new classes of applications that can react to live streaming data in real time. As such applications flourish, they result in large numbers of concurrent queries that data streaming systems have to support. The traditional approach of executing concurrent queries separately can lead to resource shortages that severely limit the usefulness of such systems. A better alternative is shared query processing, an approach where a system shares its resources to cooperatively process multiple concurrent queries by exploiting the similarities among these queries. Over the past two decades there has been significant research on shared query processing that has typically led to approaches that optimize multiple concurrent queries in a static batch-oriented fashion. This static approach is, however, unsuitable in real world environments where queries join and leave the system in an unpredictable fashion. In this thesis, I reject the traditional methods of static shared query processing in favor of a more dynamic “on-the-fly” approach. In particular, I develop

1

on-the-fly shared processing techniques for the following different kinds of queries: (1) joins with varying predicates, (2) aggregates with varying windows, and (3) joins and aggregates, with varying predicates and windows. The techniques developed in this thesis can be used to share both computation resources in single-site systems and communication resources in distributed systems. Furthermore, this thesis shows that systems that use these techniques can achieve significant improvements in scalability and performance. For instance, shared computation was shown in experiments to enable a system to support between 8 to 16 times (i.e., roughly an order of magnitude) the number of concurrent queries that a system that uses existing unshared and shared approaches can support. Similarly, shared communication was shown in experiments to enable up to a 50% reduction in bandwidth consumption as compared to earlier techniques. In summary, this thesis advances the state of the art in two important ways. The first is to demonstrate that shared query processing can offer significant scalability improvements that are crucial in data streaming systems. The second is to show that on-the-fly approaches to sharing are feasible, and can make shared query processing useful in real world scenarios.

Professor Michael J. Franklin, Chair

2

Date

Acknowledgements

Mike Franklin has been an incredible fount of inspiration and guidance throughout my years at Berkeley. As an advisor he showed me how to look at the big picture, choose the problems that really matter, and identify the best way to solve those problems. Mike also taught me the importance of effectively communicating research results. His untiring patience with my efforts have made me a better writer and public speaker. As a mentor, Mike has provided me with invaluable advice in career development and helped me make many hard choices. As I enter the next phase of my professional life, I cannot imagine a better person to have as a colleague. Joe Hellerstein is one of the most energetic professors I have worked with. He was my advisor for part of my graduate work, and a leader of the TelegraphCQ project I worked on. Joe has always been unconditionally available for a deep technical conversation on any topic, whether it related to a project for a class, a paper for a conference, or just an engineering problem in TelegraphCQ. Most of my research has centered around the TelegraphCQ and HiFi projects. I am fortunate to have worked on such vibrant projects and extremely grateful for the support of outstanding researchers who were also excellent engineers. This thesis would not have been possible without the energies of Owen Cooper, Fred Reiss, Mehul Shah, Sirish Chandrasekaran, Shawn Jeffery, Shariq Rizvi, Anil Edakkunni, Wei Hong, Amol Deshpande, and Sam Madden. Many of these colleagues have become great friends and I treasure these personal relationships. I have also had the pleasure of working with and mentoring Chung Wu and Garrett Jacobson, two exceptional undergraduate students whose hard work I am extremely thankful for.

i

Thanks are also due to other students at the database research group – Matt Denny, David Liu, Yanlei Diao, Boon Thau Loo, Ryan Huebsch, Alexandra Meliou, Tyson Condie, and David Chu – who have made the last few years extremely satisfying, both intellectually and personally. Soda Hall would have been very boring without other graduate students, and I am glad that I could always count on the company of Yatish Patel, Steve Sinha, Mark Whitney, Sonesh Surana, Manikandan Narayanan, and Karthik Lakshminarayan. While the path to a Ph.D. is full of twists and turns, the biggest burdens are borne by those close to the scholar. Priya has been my partner, my friend, and my rock through it all. She made incredible sacrifices so that I could focus on my work, and her love and support have been my biggest motivation. Although our parents live halfway around the world, it has always seemed as if they were right by my side egging me on every step of the way. All our siblings – Manju, Kaveri, Latha, Raja, and Sriram – have enriched my life in countless ways. Thank you all for everything.

ii

Dedicated to the memory of Blanco, my beloved companion on the countless long walks where I got most of my best ideas. You are missed and never forgotten.

iii

Contents 1 Introduction

1

1.1

Shared Query Processing . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

The Benefits of Sharing . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.3

Shared Processing: Bad News and Good News . . . . . . . . . . . . .

6

1.4

An Overview of Sharing Techniques . . . . . . . . . . . . . . . . . . .

9

1.5

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

1.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2 Related Work

17

2.1

Shared query processing . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.2

Unshared Stream Query Processing . . . . . . . . . . . . . . . . . . .

26

2.3

Distributed Processing . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

3 Data Stream Management

35

3.1

Static Dataflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.2

Adaptive Dataflows . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

3.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

4 Precision Sharing for Joins with Varying Predicates

iv

54

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.2

Precision Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.3

TULIP: Tuple Lineage in Plans . . . . . . . . . . . . . . . . . . . . .

72

4.4

Adaptive Precision Sharing . . . . . . . . . . . . . . . . . . . . . . . .

78

4.5

Performance of TULIP . . . . . . . . . . . . . . . . . . . . . . . . . .

83

4.6

Performance of CAR . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

4.7

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

5 Sharing for Aggregation with Varying Windows

97

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2

Windowed Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.3

Slicing a stream with Paired Windows . . . . . . . . . . . . . . . . . 106

5.4

Sharing Aggregates with Varying Time Windows . . . . . . . . . . . 111

5.5

Shared Communication . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.6

Performance study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.7

Putting it all together . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6 Sharing for Aggregation with Varying Windows and Predicates

97

132

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2

Varying Selection Predicates . . . . . . . . . . . . . . . . . . . . . . . 136

6.3

Putting it all together . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.4

Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

7 Conclusions

177

Bibliography

181

v

A The TelegraphCQ System

190

A.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 A.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 A.3 Status and Major Features . . . . . . . . . . . . . . . . . . . . . . . . 195 A.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 B The HiFi System

197

B.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 B.2 HiFi Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 B.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

vi

Chapter 1 Introduction Recently, there has been a profusion of applications that provide increased visibility for live data streams. These applications challenge traditional data management systems where data is stored first and queried later. In response, a new class of data streaming systems that offer query-based interfaces have been developed in order to support applications that monitor, summarize, clean, and raise alerts on data streams. As these applications get more sophisticated, they result in a streaming system having to support large numbers of concurrent queries. The straightforward approach of processing such queries separately can lead to an overuse of system resources, which can in turn cause significant scalability bottlenecks. Since such bottlenecks can severely limit the usefulness of a system, it is vital to design a system that can process large numbers of concurrent queries without overloading its resources. Shared query processing is an alternative approach to processing large numbers of concurrent queries in a cooperative fashion by sharing system resources among the queries being processed. In this thesis, I develop a suite of shared query processing techniques. I begin this chapter by explaining what shared query processing means in a data management system. Next, I explain why shared processing is an impor-

1

Chapter 1. Introduction

tant and challenging problem in data stream management systems. I then describe the specific opportunities that a data streaming system present for this problem. Finally, I present a roadmap of the specific shared query processing techniques that are developed in this thesis.

1.1

Shared Query Processing

Shared query processing is a technique to share the resources used to process multiple concurrent queries, by exploiting the similarities among these queries. This technique is in contrast to the traditional model where each query is processed separately in an isolated fashion. As explained later (in Section 1.2) shared processing is crucial for enabling a data management system to handle large numbers of concurrent queries. Since this thesis focuses on shared processing in data streaming systems, we now describe the types of queries that execute in such systems, and the kinds of similarities we can expect among them. At Berkeley, we have built two data streaming systems. The first is a single-site system called TelegraphCQ [Chandrasekaran et al., 2003] that processes streams in a centralized fashion. The second is HiFi [Franklin et al., 2005], a distributed streaming system. HiFi uses a hierarchy of stream processors (e.g., instances of TelegraphCQ), to successively collect and aggregate data from widely distributed sources (e.g., sensors in seismic monitoring networks [Benz et al., 2000]). A DSMS like TelegraphCQ or HiFi can process what are called continuous queries. As the name suggests, a continuous query (CQ) is long-lived and continues to run until its creator stops its execution. An application that is built using a DSMS submits one or more CQs to the system, and continually consumes their results in order to take some action. For example, consider a scenario where a financial money manager poses con2

Chapter 1. Introduction

Query 1.1 Total $ value of high-volume transactions of stocks whose β > 1.05 SELECT FROM

T.symbol, sum(T.volume*T.price) Trades T [RANGE ‘5 minutes’ SLIDE ‘30 seconds’], Information I WHERE (T.symbol = I.symbol) and (T.volume > 100) and (I.beta > 1.05) GROUP BY T.symbol tinuous queries over streams of trading data from equities markets (e.g., NYSE). An example of one such query is shown in Query 1.1 above. This query monitors the total amount of money transacted in high-volume trades (volume > 100) of all stocks whose beta coefficient exceeds 1.05 (beta > 1.05).1 The query computes this total transaction value over data within a time interval (the last 5 minutes), and updates these results periodically (every 30 seconds). In the query, the time interval and periodicity are specified with RANGE and SLIDE clauses. These clauses are collectively known as window clauses, and are applied to the stream Trades, that represents the trading data from stock markets. This stream is combined with a static table (Information) using the join predicate (T.symbol = I.symbol). In the query, the predicates that involve only the attributes of a single stream or table are known as individual predicates. For instance, the individual predicates in Query 1.1 are (T.volume > 1000) and (I.beta > 1.5). Application User

Range

Slide

Volume

Automated algorithmic trader Institutional money manager Individual day trader Individual “value” investors

5 seconds 10 minutes 1 hour 1 week

1 seconds 90 seconds 10 minutes 1 day

(> 0) (> 100) (> 500) (> 0)

Beta (> (> (> (>

1.00) 1.05) 1.25) 1.50)

Table 1.1: Examples of different parameters for continuous queries Query 1.1 is an example of a query that is tailored to the requirement of a pro1

The beta coefficient is the sensitivity of the returns of an asset to that of the market.

3

Chapter 1. Introduction

fessional money manager. In general, however, specific application requirement will govern the various parameters of a given query. In Table 1.1, we show examples of various different application requirements, each of which would result in Query 1.1 being issued with different parameters. Each line of the table shows a specific application requirement and the corresponding values for the RANGE clause, SLIDE clause, volume predicate, and beta predicate of Query 1.1. For instance, the first line of the table shows an example requirement that an automated algorithmic trader might have. Such a trader might need to compute the aggreates over a short interval (5 seconds), update results at even shorter intervals (1 second), and would be interested in considering all trades (volume > 0) of stocks whose beta coefficient exceeds 1.0 (beta > 1.00). Since different applications may all be active at the same time, a DSMS must be able to support large numbers of concurrent continuous queries without degrading their performance requirements. In a traditional data management system, multiple concurrent queries are executed in isolation from each other. Typically, the system will have multiple concurrent threads (or processes) of control, each of which is devoted to processing precisely one query. This query-per-thread approach is used even when the concurrent queries are similar, and operate over the same data, as is the case with the financial trading scenario above. Such an approach adversely affects the scalability of a system since it limits the number of queries that can be executed concurrently. With the query-per-thread approach, the scalability problem is exacerbated in a DSMS since CQs are long-lived and always active. In contrast, with traditional systems there is often “think time”, i.e., intervals of time between successive query requests when an application executes business and presentation logic, where there are fewer active queries competing for system resources. Shared query processing is an alternative to the query-per-thread model, where a system executes concurrent queries in a cooperative fashion in a single thread, by 4

Chapter 1. Introduction

“sharing” the available processing resources among the different queries. A system that employs this resource sharing approach will typically reuse the results of various data manipulation operations among different queries. Thus, such a system can run the queries corresponding to the requirements in Table 1.1 using a single shared “query plan”. This shared query processing approach is the focus of this thesis.

1.2

The Benefits of Sharing

The crux of this thesis is that shared query processing is crucial in a DSMS like TelegraphCQ or HiFi. More specifically, the primary contribution of this thesis is to show how to build a DSMS that can share its computation and communication resources among many concurrent queries, by exploiting the similarities between them. Without such a shared approach, the na¨ıve technique of executing multiple concurrent queries separately can lead to scalability and performance problems, as each additional query can add significant load to the system. Thus, a DSMS that is not built for shared processing from the ground up will likely not be able to satisfy its consumers. In contrast, the techniques developed in this dissertation provide the following two properties: 1. Increased scalability. Shared query processing lets a system support workloads with large numbers of concurrent queries. With the techniques developed in this thesis, more applications can use the system without increasing the investment in hardware infrastructure. For instance, in a single-site system precision sharing of join queries (see Chapter 4) is shown in experiments to support up to 16 times more queries than previous techniques with comparable latency in query results. 2. Reduced communication costs. Even with an infrastructure that can handle 5

Chapter 1. Introduction

many concurrent queries adequately (e.g., with low result latency) it is vital to minimize the recurring communication costs of the system. Shared hierarchical aggregation using partial push-down (see Chapter 5) is shown in experiments to halve the recurring bandwidth usage of a non-shared approach.

1.3

Shared Processing: Bad News and Good News

Despite its importance, shared query processing has been a particularly hard problem to solve in data management systems in general. It turns out however, that data streaming systems offer compelling opportunities for sharing. In this section, we first outline the challenges in shared query processing. Next, we describe the opportunities for shared processing in a streaming system. Finally, we introduce the notion of “onthe-fly sharing” and argue that it is vital for shared query processing.

1.3.1

The Sharing Challenge

There has been significant prior research on shared query processing (reviewed in Section 2.1), under the rubric of Multiple Query Optimization (MQO) (e.g. [Sellis, 1988; Roy et al., 2000]). These existing techniques statically analyze a fixed set of queries in order to find a global shared plan that represents an optimal execution strategy. Such a “compile-time” approach has the following two disadvantages. 1. Complexity of Analysis. A static MQO approach that attempts to find an efficient strategy for a basket of queries can be prohibitively expensive, especially for some of the cases that we consider. For example, in Section 5.4.1 we show that it can be very expensive (in space) to even represent a common windowed aggregate sub-query whose results can be used to process multiple concurrent queries. In addition, this cost of analysis is exacerbated when the number of 6

Chapter 1. Introduction

queries is very large. 2. Dynamic Environment. In a real world environment, queries are typically added and removed in an ad hoc fashion. A static approach would require expensive recompilation for each such event making it impractical. These disadvantages are the reasons why current static MQO techniques are infeasible in real world systems. In contrast, the techniques developed in this thesis permit on-the-fly sharing where queries can be added and removed from a streaming system as explained in more detail in Section 1.3.3. These on-the-fly techniques are feasible because they enable incremental MQO as new queries are added to the system. Incremental optimization is a powerful concept and a major contribution of this thesis, as it can enable the benefits of MQO without any complex apriori analysis of the queries being processed (e.g., Section 6.2.3).

1.3.2

The Sharing Opportunity

We now describe why data stream management offers a compelling opportunity for shared query processing. It turns out that sharing is inherently more feasible in a DSMS than in a traditional non-streaming system. The increased feasibility of sharing in a DSMS does not depend on the specific sharing techniques used. Instead, the sharing opportunity arises because of the following aspects of streaming systems, and the applications that use them: 1. Long lived queries. Continuous queries are long-lived and can be profitably shared as and when they arrive, provided there exists an appropriate incremental sharing technique. In contrast, a typical query in a non-streaming system does not last long and the system cannot reasonably delay its execution on the off chance that a new query with similar requirements will come along in the near future. 7

Chapter 1. Introduction

2. Applications query the future. Since continuous queries are often (but not exclusively) interested in future data, a new query can “hop on” at any time and join the processing of an existing query without having to individually process data already considered by the existing query. 3. Transaction isolation. Streaming data is generally append-only without updates. Thus individual queries need not be transactionally isolated from each other making it easier to share their processing.

1.3.3

On-the-fly Sharing

We end this section by introducing the idea of on-the-fly sharing, which is a unifying theme for the techniques developed in this thesis. As explained earlier in this section, the static approach has been the key factor limiting the real-world efficacy of prior work on shared query processing. With a static approach, when a new query is added to the system, an expensive recompilation process is required to produce a new shared plan. Furthermore, the state accumulated in the already executing queries makes an “in-flight” change of shared plans very complicated and intrusive. In contrast, a key theme across all the techniques of this thesis is that shared query processing is most effective with an incremental approach that lets queries be added and removed on-the-fly in a relatively inexpensive and unintrusive fashion, while still providing the scalability benefits of MQO. This idea of on-the-fly sharing has been used before in non-streaming, as well as in streaming, contexts. As an example of the former, the data warehouse engine from Red Brick Systems used a shared scan operator, whose results could be used by multiple different queries as reported in [Fernandez, 1994]. More recently, in the context of stream processing, systems like CACQ [Madden et al., 2002b] and PSoup [Chandrasekaran and Franklin, 2002] also permit on-the-fly sharing for join 8

Chapter 1. Introduction

queries. In this thesis, the use of on-the-fly sharing is a core component of every sharing technique developed. In fact, an important contribution of this work is the demonstration that on-the-fly sharing is widely applicable, especially for aggregate queries.

1.4

An Overview of Sharing Techniques

Having described the general concept of on-the-fly sharing for a data streaming system, we now present an overview of the three main shared query processing techniques that comprise this thesis. Before we describe each of these techniques, we outline a plan for attacking the shared query processing problem.

1.4.1

A plan of attack

The main goal of this thesis is to develop techniques to share the processing of queries that have the same basic form as Query 1.1, and have the sorts of variations listed in Table 1.1. We tackle this high-level shared processing problem by carving it into a set of smaller sub-problems that we can solve separately. In each sub-problem, we consider how to share multiple queries that are identical except for variations in one or more aspects. We now briefly describe each of the sub-problems that are solved as part of this thesis. 1. Varying predicates (Joins): We investigate (in Chapter 4) how to share multiple join queries that do not compute aggregates and are identical except for their individual predicates. These queries are similar to Query 1.1, except that they do not compute any grouping or aggregate expressions. 2. Varying predicates (Aggregates): We address (in Chapter 6) how to share multiple single-stream input aggregate queries that are identical except for their 9

Chapter 1. Introduction

individual predicates. These queries are similar to Query 1.1, except that they do not compute any join expressions. 3. Varying windows (Aggregates): We investigate (in Chapter 5) how to share multiple single-stream input aggregate queries that are identical except for their window clauses. These queries are similar to Query 1.1, except that they do not compute any join expressions. 4. Varying windows and predicates (Joins and Aggregates): Here we study (in Chapter 6) how to share multiple queries that have the same form as Query 1.1, and are identical except for their window clauses and individual predicates. Clearly the afore-mentioned sub-problems do not constitute an exhaustive list that is sufficient to solve the shared stream query processing problem in its entirety. The survey of related work in Section 2.1.2 describes other important sub-problems that have been tackled in the literature. In this thesis we focus on sub-problems involving queries with join and aggregate operations, and with varying windows and aggregates, for the following reasons. In terms of operations, we chose to consider joins and aggregates because they are important building blocks that can be used in a very wide range of queries. In a streaming context in particular, joins are a key mechanism to correlate multiple streams of data, and aggregates are vital in summarizing high volume streams. Similarly, in terms of variations we chose predicates and windows because a predicate is primary way for a user to isolate data items of interest in a query, and windows are a fundamental means to restrict a potentially unbounded data stream into a finite subset that can be summarized. Now that we have defined the sub-problems that are at the core of this thesis work, we provide a brief summary of the actual techniques developed to solve them. 10

Chapter 1. Introduction

1.4.2

Precision Sharing for Joins with Varying Predicates

The first technique developed in this thesis arises in the context of a study of the fundamental trade-offs in shared query processing for both streaming and non-streaming systems. As originally proposed for non-streaming systems in [Sellis, 1988], the goal of sharing is “to limit the redundancy due to accessing the same data multiple times in different queries.” Systems that use sharing try to limit redundancy by avoiding repeated work. Repeated work is caused by applying the same operation on a given data item multiple times. In the process of avoiding repeated work, however, these existing systems often perform unnecessary work. Although this tension has thus far not been noticed in the simpler cases considered by these earlier papers, the tension is a source of significant performance problems when existing sharing schemes are used with more general complex cases. In this thesis we show that this tension is not actually irreconcilable even in complex cases. Towards this end, the notion of precision sharing is defined in Chapter 4, and can be used to characterize any shared query processing scheme. When sharing is precise, it is possible to avoid the overheads of repeated work as well as those of unnecessary work. Armed with the concept of precision sharing, this thesis develops and evaluates techniques that enable precision sharing in data streaming systems. Precision sharing imposes two requirements on a shared processing scheme. First, the scheme must perform no repeated work. Second, the scheme must never generate any unnecessary tuples, or zombies, that are not required for any query in the system. The latter requirement is because the production, and subsequent removal, of zombies is wasteful. In other words, repeated work must be accompanied by no wasteful work. While this definition of precision sharing leads to a generic characterization for any sharing scheme, the actual techniques developed in Chapter 4 are focused on queries that join multiple streams and apply different individual predicates on each

11

Chapter 1. Introduction

stream. It is exactly such join queries that lead to the production of zombies as a side-effect. This side-effect has previously gone undetected because earlier work only considered queries with varying predicates on exactly one of the streams being joined, and sharing such queries can easily avoid repeated work without producing zombies. The techniques developed for precision sharing in Chapter 4 lead to increased scalability in a DSMS. For instance, a system that uses these techniques is shown in experiments to support up to 16 times more queries than one that uses existing techniques, with a comparable latency in query results.

1.4.3

Shared Aggregation with Varying Windows

The second technique developed in this thesis arises in the context of streaming applications that deal with large volumes of data. While such applications often use join queries in an initial pre-processing step, they typically summarize and analyze these high-volume data streams by using aggregate queries as their main data processing operation. Sharing such aggregate queries is particularly challenging when the input data streams are produced by widely distributed data sources. Aggregate queries that operate over data streams (e.g., Query 1.1) typically involve the use of window clauses to restrict the amount of data that is summarized. An investigation on how to efficiently process multiple aggregate queries that have varying window clauses is presented in Chapter 5. This chapter first considers how to share computation resources in a single-site system like TelegraphCQ. A new technique, called Shared Time Slices (STS), is developed in order to share the processing of aggregate queries with varying windows. The STS approach chops an input stream into non-overlapping sets of contiguous tuples (called slices), that can be combined to form partial aggregates, which can in turn be aggregated to answer each query.

12

Chapter 1. Introduction

We then turn to situations where the data sources are widely distributed and managed by a hierarchical system like HiFi. Such a system is often built from a heterogeneous spectrum of nodes ranging from tiny sensors streaming data from the edges of the network, to “beefy” servers with more computation resources higher in the hierarchy. In such environments, it is very important to keep bandwidth consumption low. For example, low consumption can reduce operating costs in a wired network with usage-based pricing, and can increase battery life in a wireless sensor network. In order to share the communication resources of such a system while processing multiple windowed aggregate queries, the key question to address is where aggregation takes place. While aggressive push-down of aggregates permits shared computation at the edges, the push-down strategy does not allow the resources of the rest of the system to be shared. In contrast, while pulling up aggregation lets communication resources be shared, the pull-up strategy also results in moving large volumes of data through the system. Thus, both the push-down and pull-up approaches are unsuitable for a hierarchical streaming system. Instead, a new technique called Partial Push-Down (PPD) is developed that permits effective sharing across the HiFi resource hierarchy. This technique first extracts the non-overlapping parts of each query. Next, it composes these non-overlapping parts to form a common sub-query. Finally, it pulls up the overlapping parts of each query and pushes down the the non-overlapping common sub-query. The STS and PPD techniques described above can lead to increased scalability and decreased communication costs in a data streaming system. For instance, a system that uses STS is shown in experiments to support up to 8 times more queries than one that uses existing techniques, with a comparable latency in execution time. In addition, a hierarchical system that uses PPD is shown in experiments to require only half the bandwidth used by an unshared approach.

13

Chapter 1. Introduction

1.4.4

Shared Aggregation with Varying Predicates and Windows

The third technique developed in this thesis arises in the context of shared processing of aggregate queries that have varying predicates as well as windows. Sharing aggregate queries with varying predicates is a particularly hard problem that has, thus far, not been addressed in the MQO literature, despite being important even in a traditional non-streaming context. Furthermore, sharing aggregate queries that vary in more than one dimension (e.g., predicates and windows) has not been attempted before. Chapter 6 presents a staged approach to share the processing of queries with varying predicates and windows is presented. First, the STS approach described above is used for queries with identical predicates and varying windows. Next, for queries with identical windows and varying predicates, a technique called Shared Data Fragments (SDF) is developed. This technique divides a stream into disjoint groups of tuples (called fragments) where all tuples in a fragment behave identically with respect to predicates of the queries. Finally, for cases where both windows and predicates can vary, a method called Shared Data Shards (SDS) that effectively combines STS and SDF is developed. A potential danger with the SDF approach is the possible exponential growth in the number of fragments with very few tuples in each fragment. This could lead to severe degradation of performance, because of the significant overhead of managing the fragments without any attendant benefits. This situation is addressed by developing a partitioning technique to organize a set of queries into smaller subsets, each of which can be shared with the techniques described above. The SDS technique can lead to significant scalability and performance improvements. For instance, a system that uses the SDS technique was shown in experiments to support up to 8 times more queries than one that uses existing techniques.

14

Chapter 1. Introduction

1.5

Contributions

To summarize, the following are the main contributions of this dissertation: • First, I identify the importance of shared query processing as a key element of a data stream management system. I motivate the need for new sharing techniques that enable streaming systems to support large numbers of concurrent queries. A unifying theme across these techniques is that they all exploit the idea of on-the-fly sharing where queries can be added to and removed from the system in an ad hoc manner. I argue that such an incremental approach is essential to make shared query processing practical in real world systems. • Second, I propose Precision Sharing as a way to characterize any shared processing scheme that can reconcile the tension between avoiding repeating work and performing unnecessary work (Chapter 4). In particular, I develop precisely shared techniques to process join queries with varying individual predicates. • Third, I propose techniques to share computation and communication resources in processing aggregate queries with varying windows in a hierarchy of stream processors (Chapter 5). The computational and communication resources are respectively shared using the Shared Time Slices and Partial Push-Down approaches. • Fourth, I propose techniques to share computational resources in processing aggregate queries with varying predicates and windows (Chapter 6). I developed the Shared Data Fragments technique for queries where only predicates vary, and the Shared Data Shards technique for query workloads where both predicates and windows vary.

15

Chapter 1. Introduction

1.6

Summary

A new class of monitoring applications is being enabled by the declarative interfaces offered by modern Data Stream Management Systems. These applications are typically used to sense, analyze, and respond to data streams generated by different kinds of data sources. As these applications proliferate it will become increasingly vital that data streaming systems have the ability to execute large numbers of concurrent queries. For this reason, a DSMS must have the ability to share the resources used in processing multiple concurrent queries. In this chapter I have outlined the importance of sharing, and argued why the traditional static, compile-time approaches of batch-mode shared processing have had limited success. This thesis offers an alternative prescription in the form of on-the-fly sharing that does not suffer from the disadvantages of the static approaches. I develop three sharing techniques that embody this on-the-fly model in order to share different kinds of streaming queries. The remainder of this dissertation describes these solutions in greater detail and places them in the context of related work in the query processing research community. Chapter 2 presents a broad survey of prior work related to the contributions of this thesis. Chapter 3 drills down into the core area of shared stream processing to provide the context for the specific techniques developed in this thesis. Chapter 4 describes precision sharing for join queries with varying predicates. Chapter 5 addresses sharing aggregate queries with varying windows in both single-site and distributed systems. Chapter 6 describes techniques to share aggregate queries with both varying windows and varying predicates. Finally concluding remarks and an outlook on future work are presented in Chapter 7.

16

Chapter 2 Related Work In this chapter I survey previous research related to this thesis. The work in this thesis is at the intersection of shared query processing, stream processing, and distributed systems. A high level overview of the literature in these areas is shown in Figure 2.1. In the figure, each area is labeled with representative projects.

NiagaraCQ, STREAM, TelegraphCQ

MQO MatView

Tapestry Tribeca Aurora

HiFi DBCache, Semantic Cache

TAG, Borealis

Garlic Astrolabe IrisNet

Sharing Streaming Distributed

Sharing + Distributed Streaming + Distributed Sharing + Streaming

Figure 2.1: Related work

17

Sharing + Streaming + Distributed*

Chapter 2. Related Work

In what follows, I present a classification of different approaches for each major area using a figure and a textual description. These figures include underlined text that represent contributions of this thesis.

2.1

Shared query processing

As stated in Section 1.1, shared query processing enables a data management system to execute multiple concurrent queries in a cooperative fashion. In contrast, traditional approaches execute each query separately. The aim of shared processing is to increase the number of concurrent queries that a system can run concurrently, by sharing its resources among all running queries.

Shared Processing

Streaming

Semi-structured XFilter YFilter XMLTK

Non-Streaming

Structured

Single-site MQO: e.g., Sellis MatView: Agrawal et al

Joins+Selections NiagaraCQ (Chen et al) CACQ PSoup Hammad et al Precision Sharing

Aggregation

Varying Windows

Varying Groups Srivastava et al

Non-periodic PSoup Arasu et al

Varying Predicates Shared Fragments

Periodic Shared Slices

Figure 2.2: Related work on shared query processing 18

Distributed DBCache Semantic Cache

Chapter 2. Related Work

A taxonomy of shared query processing is shown in Figure 2.2. At a high level, existing work on shared processing can be classified as being part of either streaming or non-streaming systems. Each of these are next addressed in turn.

2.1.1

Non-streaming shared query processing

The earliest work in shared query processing was done for non-streaming systems. In general, the approach taken in this work is to statically compile a group of queries into an optimal shared query plan. As explained earlier in Chapter 1, this static “compile-time” approach suffers from two disadvantages. First, the analysis can be extremely expensive, especially when the number of queries is very large. Second, it is very difficult to use a static approach in systems where queries are being added to and removed from the system in an ad hoc fashion. Such changes result in frequent and expensive reoptimization, and “in-flight” changes of query plans are complicated by the state accumulated in the operators of the shared plan. In the rest of this section we discuss some specific important contributions from this literature. We first cover single-site systems, and then discuss distributed systems. 2.1.1.1

Single-site systems

In his seminal paper on Multiple Query Optimization (MQO) [Sellis, 1988], Sellis described techniques to form shared query plans from individual query plans. In particular, Sellis considered two main alternatives. The first is a two-phase approach that finds optimal plans for each individual query, and combines these locally optimal plans to form a shared plan that is not necessarily globally optimal. The second directly compiles the set of queries into a globally optimal shared plan in a singlephase. The main insight from this work is that, while the second approach can lead

19

Chapter 2. Related Work

to a better quality plan, the first approach is easier to implement. The first approach is easy to implement because it can use existing single query optimizers and then merge the resulting plans together by looking for common sub-queries. In contrast, the second approach requires the development of sophisticated global optimization techniques that are also more expensive to execute. A significant consequence of the work by Sellis has been several efforts to develop heuristics for single-phase multiple query optimization. For example, one important contribution in this area is [Roy et al., 2000], which presents a heuristic MQO technique that can be easily added to a traditional single query optimizer like Volcano [Graefe and McKenna, 1991]. Such heuristic techniques are beginning to mitigate the first disadvantage of static MQO, i.e., the expensive nature of global query optimization. They do not, however, help with the second disadvantage of this approach, i.e., the expense of recompilation, and the “in-flight” change of a global shared plan, when queries join and leave the system in an unpredictable fashion. Unfortunately, a static approach that optimizes a fixed set of queries is not feasible in a real-world scenario where queries join and leave the system in an ad hoc fashion. The static approach cannot be used in general, because a system cannot delay the execution of a query indefinitely on the off chance that another similar query, with which the first query can be profitably shared, will be submitted very soon. Thus, despite nearly two decades of research on shared query processing in non-streaming contexts, there has been only limited use of MQO technology in most real-world systems. A situation where the static MQO approach does work well is in decision support systems like data warehouses. These online analytic processing (OLAP) systems are used to run highly complex queries that are typically long-running. In such systems, it is vital to define the right materialized views in order to reduce the processing time for queries. The importance of defining the right materialized views has led to research on wizards [Mistry et al., 2001; Agrawal et al., 2000; Zilio et al., 2004] 20

Chapter 2. Related Work

that essentially solve an MQO problem in order to find the materialized views that represent common sub-queries for the queries in the workload. 2.1.1.2

Distributed systems

Although there has been significant research work on distributed data management systems (see Section 2.3 for a brief survey), only a small part of it involves shared query processing, and we mention this work here only for completeness. One instance of MQO in distributed contexts comes out of research in systems that use query caching. In such systems [Dar et al., 1996; Altinel et al., 2003], applications talk to a remote database server through a client-side cache. The cache runs the queries on a remote server, keeps a local copy of the results of the queries, and returns the results to applications. When a new query is sent to the cache, it checks to see if it can be answered using the cached results of earlier queries. The techniques used in the cache are very similar to those developed for materialized views and MQO. A query whose results are reused is akin to a shared common sub-query. Although such systems are designed in a distributed fashion, they mainly help in reducing overall computational costs. If the cache and application are colocated, however, the technique can also help in reducing communication costs.

2.1.2

Streaming shared query processing

In contrast to sharing in non-streaming systems, which is mostly based on finding common sub-queries, sharing in streaming systems has generally been accomplished by “indexing” potentially large sets of queries to efficiently process incoming data. This technique has been applied to systems that deal with semi-structured data, like XML documents, as well as those that deal with structured data, like relational set-oriented tuples.

21

Chapter 2. Related Work

The work in this dissertation focuses on structured data, and sharing in systems dealing with semi-structured data is only mentioned for completeness. Examples of such systems that deal with semi-structured data, and implement some form of shared query processing include XFilter [Altinel and Franklin, 2000], YFilter [Diao et al., 2003], and XMLTK [Green et al., 2004]. Such systems generally process queries that perform filter and transform operations on XML documents. These operations typically address portions of document, can be fairly complicated, and are generally expressed using a language like XPath [Clark and DeRose, 1999]. In addition, since these queries typically operate over exactly one data item at a time, they can be processed in a stateless fashion without buffering any data. In contrast, systems that deal with structured data streams process queries that perform filter, join and aggregation operations over windows, i.e., subsets of data items, of a stream. While the filters in these queries are fairly simple and can generally be expressed as a function of the attributes of a single data tuple, the join and aggregation operations are typically much more complicated and in general, must be processed in a stateful manner. As described in Section 1.1, the window clauses that are used in join and aggregate queries generally define a sliding window of contiguous tuples over a data stream. A sliding window is specified with two parameters: a range which is an interest interval over which the aggregate is computed, and an optional slide which is a periodic interval controlling when results are reported. A window with a slide is called periodic, and one without a slide is called aperiodic. With an aperiodic window, the system must produce the latest query results “on demand” at any time a client application requests it. Aperiodic queries are most useful for disconnected clients that check in with a remote server from time to time. Shared processing of join queries with varying individual predicates and window specifications has been studied extensively in streaming systems. For instance, [Ham22

Chapter 2. Related Work mad et al., 2003] showed how to share the processing of windowed join queries that have varying window specifications. Work that focuses on shared processing of join queries with varying individual predicates includes NiagaraCQ [Chen et al., 2000; Chen et al., 2002] and CACQ [Madden et al., 2002b]. The NiagaraCQ and CACQ papers restrict their focus to join queries where the individual predicates in every query involve the attributes of an identical stream. That is, if the individual predicates in one query involves the attributes of a particular stream, then the individual predicates of every other query being shared must also involve the attributes of the same stream. In contrast, Chapter 4 of this thesis considers shared processing of join queries without this restriction. These techniques for processing joins do not share any aggregate operations among queries. Aggregates are, however, vital in processing high volume streams such as financial trading data from the equities markets. All the existing approaches for shared processing of aggregate queries assume that the aggregate functions computed are distributive (e.g., max,min,sum, and count) or algebraic (e.g., avg) [Gray et al., 1996] and can thus be evaluated in constant state using partial aggregates over disjoint partitions of the data.1 The techniques developed in this dissertation also make the same assumption. Furthermore, shared processing of streaming aggregate queries have to consider variations in three aspects of each query: the window expressions, the grouping parameters, and the predicate filters. Existing sharing techniques require that all the queries being shared are identical except for variations in precisely one of these attributes. We now consider each of these variations in turn. 1. Varying Groups. The work reported in [Srivastava et al., 2005] showed how to share aggregate queries that only differ in their grouping parameters. This approach was aimed at a memory-limited stream processor such as Gigas1

The functions used for the partial aggregates can, in general, be different from those for the overall aggregate as is discussed further in Section 5.3.

23

Chapter 2. Related Work cope [Cranor et al., 2003], and reuses techniques in [Harinarayan et al., 1996; Deshpande et al., 1998] that were developed for efficient processing of aggregate queries in OLAP systems. 2. Varying Windows. The work reported in [Chandrasekaran and Franklin, 2002] and [Arasu and Widom, 2004] explores shared processing of aggregates with varying aperiodic windows. Here, the model is for clients that only intermittently connect to the server to fetch the latest aggregate value on demand. This aperiodic window technique is not, however, a complete solution because of the following reasons: • First, the techniques for shared aggregation with aperiodic windows do not exploit scenarios where it it is preferable for the system to push results to an application at well defined, periodic, intervals. Such scenarios are common when the results of a query feed a visualizer, or perhaps an enterprise messaging architecture. The aperiodic window technique can still, however, be used by programming client applications that poll the system at periodic intervals. Unfortunately, with this approach the system incurs heavy space and time overheads since it does not take advantage of the periodic nature of these poll requests. • Second, the techniques for aperiodic shared aggregation are best used as the final operator in a dataflow, and are not, in general, suitable for an upstream view that produces results used in other downstream queries that interact with on demand client requests. When building complex dataflows that compose multiple queries, however, it is the upstream queries that are likely less specialized and have greater sharing sharing opportunities. We now present situations where an upstream aperiodic shared aggregate feeds a downstream query. In these examples, the on demand shared aggregation 24

Chapter 2. Related Work

techniques are are either suitable or unsuitable for the upstream view. (a) On demand sharing is suitable. If the downstream query only applies a predicate on its input stream, then an on demand client request of the downstream query can be satisfied by making a recursive on demand request to the upstream view. (b) On demand sharing is not suitable. If the downstream query also computes an aperiodic streamed aggregate, then the downstream query requires a continuous stream of inputs (representing aggregates computed by the upstream view), and so the upstream view must produce a stream of output tuples for each and every input tuple it receives irrespective of any client requests on the downstream query. In other words, the upstream shared aggregate view must continuously compute its results and cannot take advantage of the fact that client requests on the downstream query are intermittent. Chapter 5 of this dissertation examines the problem of sharing aggregate queries that can have varying periodic windows. The solution developed for that problem is called Shared Time Slices. 3. Varying Predicates. The problem of sharing streamed aggregate queries that have varying selection predicates has not previously been addressed. The closest piece of work in this regard is for non-streaming systems in [Deshpande et al., 1998], with an additional restriction that the predicates must exclusively be over grouping attributes. Chapter 6 of this dissertation addresses this problem of sharing aggregate queries with varying selection predicates, where a new technique called Shared Data Fragments is developed. In addition, that chapter also considers the shared aggregation problem when queries have varying predicates and windows. 25

Chapter 2. Related Work

To summarize, the work reviewed in this section is at the confluence of shared and streamed query processing. The bulk of the contributions of this thesis fall under the same category. While this related work section provides a backdrop to the problem of sharing in streaming contexts, additional details on specific techniques is presented in the subsequent chapters of the thesis.

2.2

Unshared Stream Query Processing

In the previous section, we reviewed prior work related to the core contributions of this thesis, i.e., shared query processing. Since this thesis is rooted in a streaming context, we now briefly survey the literature on stream query processing that does not deal with sharing.

Structured Continuous Queries

Systems Tapestry Tribeca STREAM TelegraphCQ Aurora Gigascope StreamMill

Languages

Techniques

TQL CQL GSQL ESL

Joins

Aggregates Approximation

MJoin (Viglas et al) SteM (Raman et al) Golab et al

Arasu et al Li et al Paired Windows

TelegraphCQ (Reiss et al) Aurora (Tatbul et al) STREAM (Babcock et al)

Figure 2.3: Related work on stream query processing Figure 2.3 classifies this body of research into three broad areas: research systems, 26

Chapter 2. Related Work

query languages, and specific techniques. We begin this section by describing the early stream processing systems. Next, we provide a brief overview of some of the many modern CQ systems developed in the research community. Finally, we describe some important unshared query processing techniques from this body of work. Note that we initially focus on single-site systems. Distributed streaming systems are described subsequently in Section 2.3.

2.2.1

First Generation Systems

We begin by introducing the Tapestry [Terry et al., 1992] and Tribeca [Sullivan, 1996] systems, two projects that made seminal contributions in the field of processing continuous queries over data streams. Both can be thought of as “first-generation CQ systems” that served as precursors to the data streamq management projects that have recently gained prominence in the research community. Continuous queries (CQ) were first proposed in the Tapestry system for running filtering queries over append-only databases such as mail and news databases. In Tapestry, queries are written in a SQL-like language called TQL, and have continuous semantics that are defined as follows: “the result of a continuous query is the set of data that would be returned if the query were executed at every instant in time.” In other words, the results of a TQL query against a database can be thought of as a two-step process. In the first step, a one-time SQL query is executed over a snapshot of the database at that every time instant. In the second step, the results of all the one-time queries are merged using set union. For performance reasons, a continuous TQL query is rewritten to use an incremental query that only needs to examine new data. Tribeca is another early CQ system, which also provided a query language [Sullivan and Heybey, 1998], and was built for applications that monitor network logs.

27

Chapter 2. Related Work

A major contribution of Tribeca was the identification of a set of stream processing operators. The Tribeca system converts a query into a network of unary dataflow operators that each take precisely one stream as an input, and produce one or more output streams. The query languages in Tapestry and Tribeca both have shortcomings. The Tapestry language (TQL) relies on snapshot semantics and supports neither aggregates, nor sliding window join (see 2.1.2) operations. In contrast, while the Tribeca query language does support sliding window operations, it does not support join operations since Tribeca only supports unary streaming operators. Also, the Tribeca operators cannot deal with historical data. Despite these shortcomings, both Tapestry and Tribeca are extremely significant projects and were the intellectual forebears of many significant research projects. We explore some of these projects next.

2.2.2

Next Generation Systems

Over the last 5 years, there have been a number of “next-generation” CQ projects that have gained prominence in the data management research community. We now briefly describe some of these CQ systems. 1. Aurora. In the Aurora system [Carney et al., 2002], users build query plans from operators and not through declarative queries. In practice, one way to accomplish this is by using a tool with a Graphical User Interface (GUI). Like Tribeca, a major focus in Aurora is to define an appropriate set of stream processing operators, that when used together are powerful and expressive. A key contribution of the Aurora project was its focus on “load shedding” [Tatbul, 2003] where the system drops a fraction of input tuples under conditions involving high loads, as is the case when there are sudden bursts of input data. 28

Chapter 2. Related Work 2. STREAM. In the STREAM system [Motwani et al., 2003], queries are submitted in CQL (Continuous Query Language) [Arasu et al., 2006], a new query language that has gained significant traction in the research community. The CQL language is based on “black box” mappings between streams and relations. A CQL query operates on streams and relations to produce streams. More precisely, a CQL query is defined with up to three mappings: a window specification derived from SQL:1999 [Melton and Simon, 2002] that is used to map portions of streams to relations, SQL itself to map relations to relations, and new operators to map relations to streams. A major focus of the STREAM project was in minimizing the memory profile [Babcock et al., 2003] of a streaming dataflow. 3. TelegraphCQ. The TelegraphCQ system [Chandrasekaran et al., 2003] also uses the CQL query language and permits queries that involve streams and relations. Its focus was on shared (see Section 2.1) and adaptive query processing. A good overview of adaptive query processing can be found in [Hellerstein et al., 2000], where a query processing system is defined to be adaptive if it has three characteristics: (1) it receives information from its environment, (2) it uses this information to determine its behavior, and (3) this process iterates over time, generating a feedback loop between environment and behavior. We describe adaptive techniques in more detail in Chapter 3. 4. Gigascope. The Gigascope system [Cranor et al., 2003] is designed for network applications like traffic analysis, and deals with very high data rates. Its query language is an extension of stream-only extension of SQL called GSQL. An important attribute of Gigascope, is that some portions of the system run directly on a network interface card (NIC) and the remainder on the general purpose host computer that controls the NIC. As a consequence, it can push 29

Chapter 2. Related Work

down some aspects of stream processing right to the NIC and thereby gains significant performance improvements. 5. StreamMill. The StreamMill project [Zaniolo et al., b] aims to unify the processing of structured and semi-structured streams. It focuses on query languages (e.g., ESL[Zaniolo et al., a]) that are more expressive than SQL based languages like CQL as, discussed in [Law et al., 2004].

2.2.3

Stream Processing Techniques

The afore-mentioned projects, as well as other efforts, have produced a number of important non-shared query processing techniques. We describe some of them here. Join processing is a crucial requirement for correlating multiple streams. Raman et al. [Raman et al., 2003] proposed decoupling a binary join operator into a unary State Module (SteM) for adaptivity (described in more detail in Section 3.2). Multi-way joins over a set of streams have been studied by Viglas et al. [Viglas et al., 2003] and Golab et al. [Golab and Ozsu, 2003]. The first technique for efficient processing of periodic windowed aggregates was the Explicit WindowId (EWID) [Li et al., 2005b] and the paned window [Li et al., 2005a] approaches for unshared processing of periodic windowed aggregates. Chapter 5 of this thesis develops the paired window scheme that significantly improves on paned windows. An important challenge for a DSMS is to cope with high volume data streams that are consumed at a rate slower than they are produced. In such situations, approximate query processing is a popular technique because streaming data, for instance from sensors, is often inherently an approximation of reality. Further data streams can also often be very “bursty”, for instance when there is excessive trading of a particular equity for some reason, forcing systems to “shed load” under peak 30

Chapter 2. Related Work

conditions. Load shedding is generally accomplished by discarding one or more data tuples before they are completely processed in a dataflow. Some examples of such techniques are in [Reiss and Hellerstein, 2006], [Tatbul, 2003] and [Babcock et al., 2003]. These papers describe ways for users to trade accuracy for latency in query results with Quality of Service (QoS) specifications. Trading accuracy for latency is a reasonable choice to make in situations where the system does not archive the input data streams. In such situations, the system is said to be operating under the “one-look” assumption (i.e., all operations on the stream must be performed using only one pass through the data) [Garofalakis et al., 2002]. There are other scenarios, however, where the incoming data is important and must be archived for the correctness of future queries. In these situations, an alternative approach is proposed in OSCAR [Chandrasekaran and Franklin, 2004], an enhanced sequential access method that allows the system to trade-off the quality of accessed data for the I/O load suffered. In other words, OSCAR chooses to degrade the accuracy of currently running query so that the input data burst can be accommodated on disk, and the accuracy of future queries is not compromised.

2.3

Distributed Processing

Having described previous work in shared and streamed query processing, we now turn to “distributed processing”, the third area of research that is relevant to this thesis. More precisely, here we classify related work in query processing over data from distributed sources in a network. We consider projects that deal with semi-structured data, as well as those that deal with structured data (see Figure 2.4). The IrisNet [Deshpande et al., 2003] project provides a software infrastructure for managing a distributed set of sensors. The data from the sensors is treated as a large, distributed, continuously-updated XML document that can be queried by users. The 31

Chapter 2. Related Work

Distributed Query Processing

Structured

Semi-Structured IrisNet

Non-streaming (Traditional) Garlic

Non-streaming (Live snapshot)

Streaming

Astrolabe

Unshared TAG Tributaries-Deltas Aurora*

Shared HiFi

Figure 2.4: Related work on distributed processing queries are pushed down to the edges of the sensor network. Distributed processing for structured data has been explored in traditional systems in the context of federated mediators like Garlic [Josifovski et al., 2002]. Here applications query a mediator, which in turn sends query requests to remote sources. The mediator answers the application queries by combining the data from the remote sources. In this work it was generally found that communication costs were more important than computation costs. Astrolabe [van Renesse et al., 2003] is a system for processing structured data from distributed data sources. Unlike a traditional mediator that fetches data in response to each user query, Astrolabe is closer to a streaming system as it supports a rudimentary form of continuous queries. In Astrolabe, an aggregate query is written

32

Chapter 2. Related Work

in an SQL-like language, and is registered with the system by a client. In response, the system is contracted to produce a live snapshot of results for the query, whenever requested by the client. An Astrolabe query continues to live until its creator stops its execution. A client application can thus use the results of such a query to get current aggregate snapshots of the distributed data sources. Since applications interact with the system by requesting query results at any time, an important focus of the project is to develop techniques that enable the system to process a client’s request efficiently on demand. Thus, unlike in streaming systems where the results of a streaming query are pushed back to the client, in Astrolabe the onus is on the client to pull the latest answers from the server, and the server must be able to respond to that request in an efficient manner. Distributed streaming systems that process data in a structured fashion have also arisen in the context of sensor networks. TAG [Madden et al., 2002a] is an example of such a system where aggregate queries are posed with an epoch clause that identifies when to generate results. In TAG, aggregate results are accumulated and communicated in a hierarchical bottom-up fashion through the nodes of a sensor network. Recently, Tributaries-Deltas was proposed in [Manjhi et al., 2005] as an alternative to TAG. The main difference between TAG and Tributaries-Deltas is in the specific way in which the nodes of a sensor network organize themselves into an aggregation hierarchy. In both systems query processing is unshared. Aurora* [Cherniack et al., 2003] is a sequel to the Aurora system, and is intended for operating over geographically dispersed streams in a distributed fashion. Aurora* handles the same kinds of queries as Aurora, but partitions a query plan across the various distributed stream processing nodes. Queries in Aurora* are processed in an unshared fashion. The work in this thesis is in the context of the HiFi [Franklin et al., 2005] project at UC Berkeley that is aimed at hierarchical aggregation of data streams from widely 33

Chapter 2. Related Work

distributed receptors. As discussed in Chapter 5, a major focus of this project is to share communication resources among multiple concurrent queries.

2.4

Summary

In this chapter, we presented a broad survey of previous research that is related to the various contributions in this thesis. We covered each of these three areas individually, as well as techniques that straddle two or more areas. Now that the literature of prior related work has been surveyed, the central purpose of this thesis can be restated. This thesis sets out to build techniques that permit shared query processing in streaming and distributed systems. There are two major goals towards this end. The first goal is to show that it is possible to build a system which can profitably share different types of queries, including joins and aggregates, which vary in different ways. The second goal is to develop sharing techniques that are feasible and do not suffer from the inadequacies of traditional MQO approaches.

34

Chapter 3 Data Stream Management In this chapter I provide specific background on data stream management that is necessary to understand the thesis work reported in this dissertation. The material presented here is at a deeper level than the survey of related work in Chapter 2, and drills down into certain specific aspects of stream processing technology. Readers familiar with data stream management technology can skip this chapter. I first describe “operator dataflows”, a fundamental mechanism in any data stream management systems. Since shared processing of multiple concurrent queries is a prime focus of this dissertation, the discussion in this chapter is oriented towards the execution of dataflows that are used for shared query processing. A data streaming system that supports sharing is tasked to run a set of concurrent queries that are added to the system, either up front as a set, or incrementally in an ad hoc fashion. In either case, the system converts the queries into a network of operators called a dataflow. These dataflow operators (e.g., join, aggregate) are similar to relational operators that are used in traditional systems. As the system receives a continuous stream of data items (called tuples), it processes the tuples through the dataflow and continually updates the results of the queries.

35

Chapter 3. Data Stream Management

While from a high level, most data streaming systems (such as those described in Section 2.2) behave in a similar fashion, the dataflows they process are of two distinct kinds: static and adaptive. In the rest of this chapter, I describe how sharing is accomplished in both static and adaptive dataflows in data stream management systems.

3.1

Static Dataflows

Static dataflows that are used in streaming systems are a straightforward extension of traditional dataflows that are used in non-streaming systems. We consider, in turn, unshared and shared dataflows.

3.1.1

Unshared static dataflows

A dataflow of operators is usually represented as a query plan. For instance, consider the example shown in Query 3.1. This query returns only high volume trades, those where the number of shares exchanged exceed a threshold (5000), from a stream of stock market transactions. Query 3.1 Select high volume stock transactions SELECT FROM WHERE

T.* Trades T T.volume > 5000

A simple plan for this query is shown in Figure 3.1. In this plan, data tuples are first scanned from the stream Trades and then sent to σ, the filter operator labeled with its predicate (volume > 5000). The qualifying tuples that satisfy the predicate of the filter operator are then sent to an output operator Out. A query plan can be significantly more complex than the simple example shown 36

Chapter 3. Data Stream Management Out O σvolume>5000 O Trades Figure 3.1: Static unshared dataflow example above. For instance, operators need not be unary. That is, they may have multiple inputs (or children). Thus, a query plan can in general be represented as a tree of operators. A sub-tree of a query plan is called a sub-plan. Operators and the Iterator model We now describe the model that is generally used in implementing operators and processing static dataflows in traditional non-streaming systems. An excellent and comprehensive treatment of this topic is presented in [Graefe, 1993]. A query processor in such a system has the capacity to execute many different operators. When the system receives a new query, an optimizer “compiles” the query into a dataflow of operators that will correctly, and efficiently, answer the query. In most such systems, operators are defined using the iterator model. In this model, each operator defines four methods (init, next, rescan, and close). The system’s query executor initializes a plan by calling the init method on the root of the operator plan, which in turn recursively calls the init method of the other operators of the plan. Similarly, the executor responds to a fetch request from the client that submitted the query by calling the next method on the root of the plan. Again, this method is recursively dispatched through the entire plan to fetch the results of a query. The rescan method is used to re-initialize a sub-plan whose results are repeatedly used by the rest of the plan. Finally, when the last result (typically delineated as a special

37

Chapter 3. Data Stream Management

NULL value) has been fetched from the plan, the executor invokes the close method of the root of the plan. In response to a close method call, an operator performs cleanup operations (e.g., freeing any allocated memory) and recursively dispatches this request to its children in the plan. Blocking and Pipelining Operators It is important to understand that the operators used in the iterator model of a traditional system can, in general, be blocking. That is, a call to a sub-plan must block (wait) until the sub-plan fetches a result. This process can take a long time if the sub-plan has to look at a large amount of data, for instance from a very large table on disk. In a traditional system, such a plan is typically guaranteed to end at some point even if it does take a very long time to execute.1 This is because, queries in a traditional system operate on a bounded set of data. In a streaming system, however, blocking operators are generally infeasible. For instance, consider a “Sort” operator that might be part of a dataflow that is used in the input to a “Merge Join”. A Sort operator needs to see all the tuples of its input set in order to produced a sorted output set of tuples, and thus cannot operate on an unbounded amount of data. To use blocking operators on streaming data, it is possible to apply a windowing operation to map a subset of a stream to a temporary relation (this is the case with CQL as explained in Section 2.2.2), and then apply traditional query plans with blocking operators. This approach is, however, recognized as being very inefficient as pointed out in [Terry et al., 1992] in the context of Tapestry, the very first CQ system. Thus, in practice, streaming dataflows are generally built with non-blocking, or pipelined, operators. A pipelined operator can make progress as long its sub-plans continue to produce results. The symmetric hash join [Wilschut and Apers, 1993] is 1

The exceptions involve plans for certain kinds of “recursive queries” that are beyond the scope of this dissertation.

38

Chapter 3. Data Stream Management

an example of a pipelined operator. This operator is “symmetric” because it treats both inputs uniformly (as opposed to a nested loop join which treats its input subplans differently). Let us now consider the operation of a symmetric hash join that joins two streams R and S. This operator maintains two hash tables, one for each stream, that are hashed on the respective join attributes of the streams. The operator fetches a tuple from either stream (say R), and performs two operations. First it inserts the tuple in the hash-table for R. This insertion operation is called a “build”. Next, it uses this tuple to look up the hash-table consisting of previously built tuples from S in order to produce join tuples. This lookup operation is called a “probe”. The symmetric hash join operator can continue to make progress even if one of its inputs are temporarily blocked, so long as the blocked input returns an appropriate indication. Furthermore, this operator can produce join results without examining all the tuples from any of its inputs. This is very important in a streaming system, since it deals with unbounded data streams. A variant of the symmetric hash join is developed as part of the TULIP technique in Chapter 4. The TULIP approach marries the static dataflow approach that is described in this section, with some of the ideas underlying adaptive dataflow technology that are described later in this chapter. For the rest of this chapter, we use the generic term “join operator” to indicate a pipelined symmetric hash join unless explicitly stated otherwise. Fjords: An alternative to the iterator model As we saw above, the iterator model can be used in a streaming system where the operators are all pipelined and non-blocking. This model is driven by repeatedly fetching results from the root of a streaming dataflow. Here, each operator works in a strictly “pull-based” fashion. That is, an operator gets input data by “pulling” data from its children in a plan. While the pull-based approach works fine in situations

39

Chapter 3. Data Stream Management

where all sources are strongly connected to the system, it is less than ideal when sources are remote and intermittently connected. In a sense, all the non-blocking and pipelined operators in the plan are useless if a scan operator’s pull from a single remote source blocks and stops the rest of the system from making progress. To solve such problems, the Fjord model, an alternative interface to the iterator model, was developed in [Madden and Franklin, 2002]. In the Fjord model, an operator makes no assumptions about how it is connected with another operator that serves as its input. Instead, each operator has two queues (input and output), and must implement precisely one method, processNext. When an operator is scheduled by the system (typically accomplished by having a scheduling module call the operator’s processNext method), it reads any tuples from its input queue, processes these tuples and sends any resulting output tuples to its output queue. With this approach, it is the system’s responsibility to hook up the actual queues that connect the output of one operator to the input of another operator. Furthermore, the system also schedules the different operators appropriately. For instance, the scans can run in independent threads, fetching data from remote sources and writing the data to their output queues. The advantage of this scan-per-thread model is that a slowdown of one scan does not affect the processing of any tuples from any other scan. The other advantage to this model is that Fjord operators can be separated across process and network boundaries by using an appropriate queue abstraction. The TelegraphCQ system (described in Appendix A) uses the Fjord model for its operators.

3.1.2

Shared static dataflows

We now describe shared static dataflows, an execution mechanism for processing multiple concurrent queries in streaming systems. This mechanism is central to many

40

Chapter 3. Data Stream Management

of the techniques developed in this thesis. These techniques are TULIP (Chapter 4), Shared Time Slices and Partial Push-Down (Chapter 5), and Shared Data Fragments and Shared Data Shards (Chapter 6). The shared static dataflow technique extends the pipelined unshared dataflows that were described in the previous section, in order to exploit sharing by identifying and using common sub-plans. In this model, a common sub-plan serves as an input to multiple operators. Such a dataflow (called a “shared plan”) is represented by a Directed Acyclic Graph (DAG) of operators, in contrast to an unshared plan, which can be represented with a tree of operators. Shared static dataflows are now explained with the help of Example 3.1 below. Example 3.1 Consider two queries Q1 and Q2 that are described with the relational algebra expressions shown below. Each query applies an individual predicate (r1 and r2 for queries Q1 and Q2 respectively) on an input stream R, and joins the resulting stream with another stream S. • Q1 : σr1 (R) o nS • Q2 : σr2 (R) o nS Figure 3.2 shows a static dataflow plan that can be used to process the queries in Example 3.1 in a shared fashion. In this plan, tuples are scanned from streams R and S to a symmetric hash join operator (depicted with o n). The join tuples produced are then sent to a Copy operator. This operator, copies every tuple in its input stream to each of its output streams. Thus, an identical stream of join tuples (produced by a common sub-plan that evaluates the sub-query (R o n S)) is fed to the selection operators of each individual query. These operators (shown as σr1 and σr2 ) apply their respective filter predicates and send the qualifying tuples to the output operators (shown as OutQ1 and OutQ2 ) of each query. 41

Chapter 3. Data Stream Management OutQ1

OutQ2 O

O

σ σr1 dJ : r2 JJ tt JJ t JJ tt JJ tt tt Copy O

R

n eJJ t9 o JJ tt t JJ tt JJ t t JJ t J tt

S

Figure 3.2: Static shared dataflow example When new tuples arrive in the system they are driven through the dataflow plan according to an operator scheduling policy. In a static dataflow, while different operators may be executed at different times, the paths taken by each tuple from a given stream to its various destinations are the same for all tuples of the stream. Of course, with the static approach, it is important to determine an efficient shared query plan, a process that involves finding common sub-queries whose results can be used to answer the individual queries. For example, the NiagaraCQ approach in [Chen et al., 2002; Chen et al., 2000] describes detailed ways to form grouped plans for multiple queries. In Section 2.1, we identified the two main approaches to MQO: (a) optimize each individual query and then look for sharing opportunities in the access plans, and (b) globally optimize all the queries to produce a shared access plan. The first approach is easier to employ and is used in NiagaraCQ to group together plans for queries with similar structure. When a new query enters the system it is matched with the various existing groups of queries, and then attached to the group that it matches most closely. For example, if the new query is a selection over a stream R with a predicate on the attribute R.b, it will be matched to other queries

42

Chapter 3. Data Stream Management

that are identical, except for having possibly different predicates on the attribute R.b. The NiagaraCQ approach is revisited in Chapter 4 in the context of understanding a fundamental trade-off in shared query processing. In this section we described the mechanisms underlying static dataflows with an emphasis on stream processing. In addition, we also explained how these static dataflows can be used in unshared and shared context.

3.2

Adaptive Dataflows

This section describes a very different type of operator dataflow that can be characterized as adaptive. We consider, in turn, unshared and shared adaptive dataflows. The approaches that are described here provide context for two techniques that are described later in this dissertation.

3.2.1

Unshared adaptive dataflows

Adaptive dataflows were invented in the context of recent research on adaptive query processing. This work was a response to the problems faced by traditional nonstreaming systems that operate in uncertain environments. For instance, consider a federated mediator that processes queries over data that it fetches from remote web services. In such a situation, the system may not have the statistical information necessary for its optimizer to perform a good job. Furthermore, the environment the system operates in might change frequently – even while executing a given plan. For example, a connection to a remote source may be temporarily lost, or the rate at which data is fetched from different sources might vary over time. These factors can affect the performance of a system. Adaptive query processing research led to many innovative query processing tech-

43

Chapter 3. Data Stream Management niques. A good overview of these techniques can be found in [Hellerstein et al., 2000], where a query processing system is defined to be adaptive if it has three characteristics: (1) it receives information from its environment, (2) it uses this information to determine its behavior, and (3) this process iterates over time, generating a feedback loop between environment and behavior. The most important of these factors is the third, i.e., the feedback mechanism that allows the system to make multiple decisions. This section focuses on continuous adaptivity, one of the many contributions in adaptive query processing. In particular, we consider the important work on the Eddy [Avnur and Hellerstein, 2000] mechanism for pipelined query plans.

This

approach takes advantage of the fact that a pipelined operator continually makes progress, and so it is possible to get feedback for adaptivity on a per-tuple basis. An eddy is a special dataflow operator with an iterator interface that is used to construct an adaptive query plan. Unlike other operators that are used to perform a particular relational algebra expression, the job of an Eddy is restricted to routing tuples to operators, and scheduling which operator to execute next. To accomplish the former, an Eddy uses a idea called “tuple lineage” where each tuple carries a payload that contains extra information. This payload consists of a “steering” vector that keeps track of the operators a tuple must visit (e.g., different selection predicates) and those that it has already visited. The set of operators a tuple must visit can be computed based on the signature of the tuple, i.e., the set of base tuples that are its constituents. A tuple can have more than one constituent if it is produced while evaluating a join expression. Thus, at any point in the lifetime of the tuple, the Eddy can examine the steering vector to identify a set of operators that the tuple can be sent to next. These operators are called candidates for tuple routing, and at each routing step the Eddy needs to make a decision (based on a routing policy) as to which candidate operator a given tuple is next routed to. For instance, the Eddy could make routing decisions using an 44

Chapter 3. Data Stream Management

adaptive policy that is based on maintaining feedback information from each operator (e.g., its effective selectivity). The Eddy uses this information to run a lottery scheduling scheme [Waldspurger and Weihl, 1994] to determine which candidate operator to route a tuple to (and execute) next. The steering vector of a tuple is also used to determine when the tuple has been completely processed, i.e., been routed through all the operators necessary to satisfy the requirements of the query. Thus, with an adaptive query plan such as one based on an Eddy, different tuples from a given stream can take multiple paths to the output. This is in stark contrast to the traditional static dataflow approach. These adaptive query processing techniques are beginning to influence the database industry. For instance, while the leading commercial database vendors do not use the extremely adaptive techniques of the sort described above for the Eddy framework, they have begun to build adaptive optimizers such as LEO [Stillger et al., 2001].

3.2.2

Shared adaptive dataflows

We now describe how adaptive dataflows can be extended to incorporate shared processing. Two important projects in this area are CACQ [Madden et al., 2002b] and PSoup [Chandrasekaran and Franklin, 2002], which both support adaptive tuple routing. The topic of CACQ-style shared adaptive dataflows is central to the development of CAR, a new shared adaptive dataflow technique, in Chapter 4 of this dissertation. The CACQ approach In particular, we focus on the CACQ approach. This approach extends the Eddy framework described in the previous section for shared query processing. The CACQ version of the Eddy operator is called a Shared Eddy. In CACQ, the lineage of a tuple is extended to include a “completion” vector in addition to the “steering” vector described for the unshared Eddy. The completion vector of a tuple is generally 45

Chapter 3. Data Stream Management

implemented as a bitmap with one bit for each query in the system. It is used to keep track of the status of the tuple vis-`a-vis each of these queries. More specifically, if it is known that a tuple cannot possibly answer a particular query (e.g., because it failed to satisfy a predicate of the query), then this fact is recorded in the tuple’s completion vector (e.g., by setting the corresponding bit). The queries that a given tuple cannot possibly answer are called its dead queries, and the queries that a given tuple could possibly answer are called its live queries. In other words, if a tuple is live for a particular query, then that query is not yet recorded in the tuple’s completion vector. Similarly, if a tuple is dead for a particular query, the query is recorded in the tuple’s completion vector. The shared Eddy routes tuples to new operators specific to CACQ. We focus on these CACQ operators next. The CACQ operators In addition to the Shared Eddy, CACQ uses two other kinds of operators, a Grouped Selection Filter (GSFilter) and a State Module (SteM). The GSFilter operator is used to evaluate similar selection predicates from multiple concurrent queries. Likewise, the SteM operator [Raman et al., 2003] is used to process similar join expressions from multiple concurrent queries. These CACQ operators are aware of the completion vector of each input tuple. In fact, two otherwise identical tuples with different completion vectors may be processed differently. The operation of GSFilter and SteM operators are now described below. GSFilter. A GSFilter is an operator used by a shared Eddy to evaluate similar predicates from multiple queries. These predicates typically involve simple comparison operations (e.g., =, 6=, >) on an expression over the attributes of a given stream and a constant value. A key feature of a GSFilter operator is that it can efficiently evaluate multiple predicates in a shared fashion by maintaining an index over the constant values for each kind of comparison operator. When the GSFilter operator

46

Chapter 3. Data Stream Management

receives a new tuple, the operator efficiently probes the index to identify all registered predicates that the tuple does not satisfy. The queries that these unsatisfied predicates belong to are then all considered dead with respect to this tuple, and so the operator marks all these dead queries in the completion vector of the tuple. If, at the end of processing a tuple (as indicated by its steering vector), there are any queries associated with the tuple that are still live, then the tuple is sent to the output operators associated with those queries. Notice that with an appropriate index the probing operation can be very efficient in the number of predicates handled by the GSFilter (e.g., with a binary tree index the probing operation has a complexity of O(log n) when the GSFilter handles n predicates). As an example, let the predicates r1 and r2 , in the queries Q1 and Q2 from Example 3.1, be (R.a >50) and (R.a >75) respectively. A GSFilter for these predicates would maintain an index (e.g., a binary tree) for the comparison operator (>) over the constants 50 and 75. Let us suppose that the shared Eddy fetches a tuple from the R stream, where the value of the R.a attribute of the tuple is 60, and then routes it to the GSFilter. The GSFilter then probes the index to find the set of all index entries with a constant value greater than 60. In this case it will be the singleton set consisting of the constant 75. The GSFilter then determines that this tuple cannot satisfy the query Q2 (i.e., Q2 is a dead query for this tuple) and therefore records this fact in the completion vector of the tuple. The GSFilter is a key tool in the development of three specific schemes in this dissertation. The first technique (in Chapter 4) is called TULIP, and uses the notion of tuple lineage in static shared plans that contain GSFilter operators. The second technique (also in Chapter 4) is called CAR, and is a variant of CACQ that adopts some of the features of static dataflows. The third technique (in Chapter 6) is called Shared Data Fragments and uses GSFilter operators for shared processing of aggregate 47

Chapter 3. Data Stream Management

queries. SteM. A SteM is an operator used by a shared Eddy to evaluate similar binary join expressions (i.e., a join expression of the form R o na S over two streams), from multiple queries. Although these join expressions can involve various different streams, one half of the join must occur in every expression. Recall from Section 3.1.1 that a symmetric join operator performs both build and probe operations with its input tuples. Thus, a SteM can be thought of as one half of a decoupled symmetric join operator. While a SteM can process tuples from multiple streams, it builds tuples only from one stream (the build stream). Tuples from the other streams (the probe streams) are used only in probe operations. Since a SteM can be shared across multiple similar queries, it maintains information on all the join expressions of these queries that it must evaluate. Notice that a SteM can evaluate multiple join expressions for a single pair of build and probe streams (e.g., R.a = S.a and R.a = S.a+2). In addition, the SteM also maintains an index (e.g., a hash table) for the build operation. In order to implement a windowed join, the SteM periodically ages out build tuples from its index. When a SteM receives a tuple from a build stream, it hashes the tuple into its associated hash table. When a SteM receives a tuple from a probe stream, it consults the various join expressions associated with that probe stream, and for each such join expression it looks up (or probes) the hash table for build tuples that can match the probe tuple using the join expression. Thus, each probe can result in the production of multiple join tuples. Note that a join tuple produced by matching a particular join expression (e.g., R.a = S.a) may not be valid for queries associated with another join expression (e.g., R.a = S.a + 2). Therefore, the SteM ensures that a join tuple is only sent to a query that it is valid for by recording the fact that all other queries are dead in the tuple’s completion vector. For example, consider two concurrent queries, with respective join expressions of 48

Chapter 3. Data Stream Management

Ro na S and R o na T . Here, the R.a SteM can be used by both queries, and R tuples need to be built only in the hash table of the R.a SteM. When the R.a SteM receives a tuple from the R stream, it builds the tuple into its index. If the R.a SteM receives a tuple from an S (or T) stream, it probes the index with the tuple and produces RS (or RT) join tuples that are sent to the appropriate queries via the Eddy. Note that the SteM abstraction is independent of the specific indexes used. For instance, while a hash table makes sense for join predicates with equality expressions (e.g., R.a = S.a), a hash table cannot be used with a join predicate that uses a range comparison expression (e.g., R.a > S.a) for which a binary tree might be better suited. If the size of the windows are very large, then this index can be backed on to disk. An example: CACQ in action We now demonstrate how the CACQ scheme can be used to process two concurrent queries. Example 3.2 Consider the queries Q3 and Q4 that are described with the relational algebra expressions shown below. Each query applies an individual predicate (r1 and r2 for queries Q1 and Q2 respectively) on an input stream R, another individual predicate (s1 and s2 for queries Q1 and Q2 respectively) on the other input stream S, and joins the two streams formed by applying the individual predicates. • Q3 : σr1 (R) o n σs1 (S) • Q4 : σr2 (R) o n σs2 (S) In Figure 3.3 we show how CACQ processes the queries Q3 and Q4 in a shared fashion. Scan modules for R and S are scheduled to bring data in to the system. The tuples are fetched from the scan modules by the shared Eddy, which adaptively routes the tuples through its slave operators. There are two GSFilters, one each for all predicates over R and S, and two SteM operators. 49

Chapter 3. Data Stream Management

Figure 3.3: CACQ: Eddy, SteMs and Grouped Filter We use r and s to denote the signatures of base tuples from the R and S streams respectively. In the example, a tuple with signature r and has to be built and probed into the R and S SteMs respectively. In this example, the adaptivity features of CACQ play no role, as there is only one join to be performed. Figure 3.4 shows the various operations performed while processing the r and s tuples in CACQ for this example. The figure shows that a base tuple goes through Build, GSFilter, Probe, and Out operators in sequence. In the figure, an r is first built in the R SteM, and then probed in S SteM. Similarly, an s tuple is first built in the S SteM, and then probed in the R SteM. For simplicity, we assume that the predicates in question are not expensive and so CACQ always orders the GSFilter before a Probe. Note that while this figure shows an Out operation, in CACQ there is no explicit Out operator and all output processing is a responsibility of the shared Eddy. In a shared Eddy any 50

Chapter 3. Data Stream Management

intermediate tuple could satisfy a query. Thus, in CACQ, the shared Eddy checks every such intermediate tuple to see if it satisfies any queries in order to deliver tuples to query outputs. Out

Q3 ,Q4 hQQQ m6 QQQ mmm m Q m m ProbeR kXXXXX 3 ProbeS XXXXXfffffffff f X f X f X XXX ffff

GSFilterr1 ,r2

GSFilters1 ,s2

(Build ) O R

Build O S

R

S

O

O

Figure 3.4: Effective tuple dataflow in CACQ

Correctness Considerations For correctness reasons, CACQ [Madden et al., 2002b] requires that, “a singleton tuple must be inserted into all its associated SteMs before it is routed to any of the other SteMs with which it needs to be joined”. This requirement is realized by a system constraint that forces tuples to be built directly into their associated SteMs right after they have been scanned. If this constraint is not enforced, a race condition is possible. We now describe how the race condition can be triggered. Consider the following sequence of events that can be the result of the scheduling policy of a shared Eddy that does not enforce this constraint: 1. Scan:(R)→r. Scan the stream R and produce a tuple r. 2. Probe:(S.a,r)→X. Probe a SteM built on S.a with the tuple r and produce a set of join tuples X. 3. Scan:(S)→s. Scan the stream S and produce a tuple s. 51

Chapter 3. Data Stream Management

4. Probe:(R.a,s)→Y. Probe a SteM built on R.a with the tuple s and produce a set of join tuples Y. 5. Build:(S.a,s)→φ Build the tuple s in the SteM on S.a. 6. Build:(R.a,r)→φ Build the tuple r in the SteM on R.a. In this example, it is easy to see that X and Y (the sets of join tuples produced by the probes on the S.a and R.a SteM operators respectively), will both not contain the r o n s join tuple. This is a race condition that is caused by the fact that the probe operations for the tuples from the R and S stream take place before their respective build operations. Note that this race condition can also be avoided by enforcing the following different constraint. “Once a tuple has been scanned from a stream, it (and all tuples derived by processing it such as join tuples) must be fully processed through all operators of the system before another tuple can be scanned.” We call this the “draining constraint”, since it requires the system to “drain” all operators of any outputs caused by processing a tuple, before scanning and processing another tuple. These points in time when the system has drained all operators are also ideal for adding and removing queries in the system. The draining constraint is used in the TelegraphCQ system described in Appendix A. The correctness constraints described above play an important role in understanding the shortcomings of CACQ, and in the development CAR (an alternate shared adaptive approach) in Chapter 4 of this dissertation. In this section we explained in detail some important mechanisms that are used in unshared and shared adaptive streaming dataflows.

52

Chapter 3. Data Stream Management

3.3

Summary

In this chapter, I provided background on data stream management that is necessary to understand the contributions that are reported in this dissertation. While I presented a broad survey of related work in Chapter 2, here I drilled down into the area of shared data stream management, the core contributions of this thesis. In particular, this chapter described operator dataflows, a fundamental execution mechanism for a DSMS. The explanation covered static and adaptive dataflow technology in both non-streaming and streaming systems. Furthermore, background material on shared processing in static and adaptive dataflows was also presented. With this background on data stream management, I can now consider the three shared query processing problems that were identified in Chapter 1. The next chapter begins this process by exploring an important trade-off in shared query processing.

53

Chapter 4 Precision Sharing for Joins with Varying Predicates In the last chapter I provided a detailed description of the mechanisms underlying static and adaptive dataflows for data streaming systems. With this background, we are ready to explore the major issues that are the subject of this thesis work, i.e., shared processing of multiple concurrent queries in data streaming systems. In this chapter I present the first of three major contributions (summarized in Section 1.4) that comprise the work in this dissertation, namely resolving a fundamental tension in shared processing of certain kinds of queries (e.g., join queries with varying predicates) that exists between the savings due to avoiding repeated work and the overhead that is due to unnecessary wasted work.1 This tension has thus far gone unnoticed since existing approaches to sharing such queries have only considered simpler cases where the tension does not arise. It turns out, however, that the tension does arise when these existing approaches are applied to more complex and general cases. The consequence of this tension is a significant performance degradation that 1

The work reported in this chapter was published in [Krishnamurthy et al., 2004].

54

Chapter 4. Precision Sharing for Joins with Varying Predicates

can severely limit the advantages of shared processing. I resolve this tension by developing novel shared processing techniques that marry ideas from both static and adaptive dataflow technology (reviewed in Chapter 3). These techniques are focused on join queries because joins are one of the most important features of any data management system. The subsequent chapters of this dissertation focus on sharing aggregate queries. I begin this chapter by formally defining the notion of precision sharing as a way to characterize any approach to shared query processing. Next, I use a series of examples to demonstrate how prior work (in static and adaptive systems) on sharing join queries violates the requirements for precision sharing. I then propose TULIP and CAR, shared query processing techniques for static and adaptive systems respectively that do not violate the requirements for precision sharing. TULIP extends static dataflow technology with the idea of “tuple lineage” that is borrowed from adaptive query processing (see Chapter 3). In contrast, CAR modifies a continuously adaptive system by imposing ordering constraints that are inspired from static dataflows. Finally, I report on an experimental study that evaluates the benefits of precision sharing in the context of TULIP and CAR.

4.1

Introduction

This section first introduces a fundamental tension that can arise in existing static and adaptive approaches to shared stream processing. Next, we propose the notion of Precision Sharing, a way to characterize a sharing scheme without this tension. Finally, we outline a strategy for effective precision sharing in both static and adaptive dataflows.

55

Chapter 4. Precision Sharing for Joins with Varying Predicates

4.1.1

The Tension between Repeated and Wasted Work

As described in [Sellis, 1988], shared query processing in non-streaming systems aims “to limit the redundancy due to accessing the same data multiple times in different queries.” The goal of limiting redundancy is also true for sharing in stream query processing.

Figure 4.1: Sharing 2 queries: redundancy and waste We illustrate the potential for redundancy with an example involving two join queries, each with different individual predicates. Figure 4.1 shows two diagonally shaded rectangles that represent the qualifying join tuples that form the results of each query. The intersection of these two rectangles, shown as the rectangle shaded with a cross hatch pattern, represents the set of qualifying join tuples that form the results of both queries, i.e, the intersection of the result sets of both queries. Without sharing, the overlapping tuples are produced twice - a redundancy. In attempting to avoid redundancy, however, current shared schemes produce too much data. A shared scheme from the literature (such as NiagaraCQ) would produce the join tuples in the entire enclosing rectangle, including the useless tuples, shown in the two darkly shaded rectangles. This example will be revisited in more detail later in this chapter (in Section 4.2.2). From this example, it would appear that sharing has to balance the inherent tensions of: 56

Chapter 4. Precision Sharing for Joins with Varying Predicates

• Repeated work caused by applying an operation multiple times for a given tuple, or for its copies. • Wasted work caused by the production and removal of “useless tuples”. As explained later in this chapter, this tension was not noticed in previous approaches (e.g., NiagaraCQ and CACQ) that only consider simpler cases. The goal of this chapter is to show that this tension is not, in fact, irreconcilable. In particular, we demonstrate how to design and implement techniques that resolve the tension in both static and adaptive dataflows.

4.1.2

Precision Sharing

Precision sharing is a way to characterize shared query processing schemes that avoid the overhead of repeated work as well as that of wasted work. Precision sharing is defined in terms of certain requirements associated with the overheads of repeated and wasted work that a shared processing scheme needs to satisfy. Furthermore, precision sharing can be supported in streaming and non-streaming systems, and can be used with static and adaptive dataflows. For convenience we use the terms “precisely shared” and “imprecisely shared” to describe shared plans that satisfy and do not satisfy the the requirements of precision sharing respectively. Static shared dataflows NiagaraCQ was the first streaming system with a static shared dataflow. There are, however, cases where static plans produced by NiagaraCQ are imprecisely shared. The solution developed in this chapter for a precisely shared version of such a dataflow relies on the idea of tuple lineage from the adaptive query processing literature. While tuple lineage has been developed for highly variable environments (as described in Section 3.2), an important insight of this thesis is that tuple lineage is more generally 57

Chapter 4. Precision Sharing for Joins with Varying Predicates

applicable. Specifically, tuple lineage can be used to make static shared dataflows precisely shared. This approach is called TULIP, or TUple LIneage in Plans. Adaptive shared dataflows CACQ was the first shared adaptive dataflow system. It turns out that for the cases where static plans produced by NiagaraCQ are imprecisely shared, the adaptive dataflows produced by CACQ are also imprecisely shared. The strategy toward developing a technique for adaptive precision sharing is based on ideas borrowed from static query processing. The insight is to place constraints on how tuples are routed in an adaptive scheme to ensure that sharing is precise. This approach is called CAR, or Constrained Adaptive Routing.

4.1.3

Research Contributions

The main contributions in this chapter are to: 1. Define the notion of precision sharing to characterize shared processing schemes that can avoid the overhead of repeated work without performing unnecessary wasted work. 2. Demonstrate the general utility of tuple lineage beyond adaptive query processing, and show how it can be used to achieve static precision sharing. 3. Show how to implement adaptive precision sharing, with appropriate operator routing by imposing constraints on adaptivity. 4. Experimentally evaluate the static and adaptive techniques for precision sharing that are mentioned above. The rest of this chapter is organized as follows. First, in Section 4.2, we formally define precision sharing and explain pitfalls in prior art. This is followed by a descrip58

Chapter 4. Precision Sharing for Joins with Varying Predicates

tion of TULIP, a solution for precision sharing in static dataflows in Section 4.3. Next, in Section 4.4, we consider the problems of precision sharing in adaptive dataflows, and describe CAR, a solution for precision sharing in such adaptive dataflows. We then present studies that evaluate the performance of TULIP and CAR in Section 4.5 and 4.6 respectively. Finally, we conclude the chapter with a summary of findings in Section 4.7.

4.2

Precision Sharing

In this section we introduce and explain the importance of precision sharing, a way to characterize the overheads of shared query processing. Precision sharing can be defined in terms of all operations performed on tuples in a shared dataflow. Precision sharing is a characterization of a sharing scheme where for all stream inputs, the following properties both hold: • PS1. For each tuple processed, any given operation may be applied to it or any copy of it at most once. • PS2. No operator shall produce a tuple whose presence or absence in the dataflow has no effect on the result of any query, irrespective of any other possible input. PS1 is fairly self-explanatory, and is really the motivation behind all shared query processing schemes. PS2 can be clarified by considering the future of a tuple emitted by an operator. There are three possibilities: (1) The tuple is part of the result of at least one query, irrespective of any condition that it may encounter in its future, (2) The tuple is part of the result of at least one query, only if it matches some condition in its future, and (3) The tuple is definitely not part of the result of any

59

Chapter 4. Precision Sharing for Joins with Varying Predicates

query irrespective of any condition in its future. PS2 prohibits the production of tuples of the third kind, which are referred to as “zombies”. A plan that does not satisfy PS1 suffers from redundancy. A plan that does not satisfy PS2 results in the wasteful production and subsequent elimination of zombies. We say that a given plan is precisely shared if it satisfies both the properties PS1 and PS2 for all possible inputs, and imprecisely shared otherwise. Approaches in the MQO literature [Sellis, 1988; Tan and Lu, 1995; Dalvi et al., 2001] have all assumed that reducing redundancy is paramount, without considering its side-effects. This definition of precision sharing lets us characterize the nature of such side-effects, and is essential to limiting unnecessary work for the query processor. In the following sections, we will demonstrate how existing shared techniques can lead to imprecisely shared query plans.

4.2.1

Imprecise sharing in action

We now consider examples of imprecise sharing of join queries in the presence of selections on individual sources. We begin with Example 4.1, which builds on a scenario studied in [Chen et al., 2000; Chen et al., 2002] in the context of the NiagaraCQ project. Example 4.1 Consider the following scenario involving two queries, Q1 and Q2 , each of which join the streams R and S, and apply differing selection predicates on R. • Q1 : σr1 (R) o nS • Q2 : σr2 (R) o nS NiagaraCQ considers two alternate query plans for this example. We now explain, in detail, how each plan works. In what follows, these join operators are assumed to be symmetric hash joins (see Section 3.1.1). 60

Chapter 4. Precision Sharing for Joins with Varying Predicates

The first plan uses a selection-pull-up strategy and is shown in Figure 4.2(a). Recall that we used this plan previously in Section 3.1.2, while describing static shared dataflows. In this plan, tuples are scanned from streams R and S to a symmetric hash join operator (depicted with o n). The join tuples produced are then sent to a “Copy” operator. This operator, copies every tuple in its input stream to each of its output streams. Thus, an identical stream of join tuples (produced by a common sub-plan that evaluates the sub-query (R o n S)) is fed to the selection operators of each individual query. These operators (shown as σr1 and σr2 ) apply their respective filter predicates and send the qualifying tuples to the output operators (shown as OutQ1 and OutQ2 ) of each query. Thus, the selection pull-up strategy lets the join operation (R o n S) be shared between both queries, thereby saving repeated work. The second plan uses a selection-push-down strategy and is shown in Figure 4.2(b). In this plan, tuples are scanned from the streams R and sent to a Copy operator, which copies all of its inputs to two separate output streams.2 Each output is fed to a different selection operator of each individual query (shown as σr1 and σr2 ). The outputs of these operators are fed to two separate join operators. Each of these join operators has an additional input. These inputs are fed from the outputs of another Copy operator, which is in turn fed from tuples scanned from the other stream S. The selection push-down strategy opts to reduce the amount of work performed by the joins by pushing selections down. In the process, this strategy can end up performing repeated work in the join operators. Selection push-down (Figure 4.2(b)) violates PS1 in two ways. First, a tuple rx from R that passes both predicates r1 and r2 will be processed in both join operators, producing identical join tuples. Second, every tuple from S will be inserted twice – once in each join operator (assuming the use of symmetric hash joins). Multiple 2

In practice, NiagaraCQ combines the Copy operator and its immediate downstream filters together, using an index for the filter predicates. We separate them for ease of exposition. Also, we use Out to represent a generic output operator that is equivalent to TriggerAction in NiagaraCQ.

61

Chapter 4. Precision Sharing for Joins with Varying Predicates

OutQ1

Out

OutQ1

O Q2

O

σr1 dJ σr2 t9 JJ t t JJ t JJ ttt JJ t t t Copy O

R

n eKK t9 o KKK tt t KKK tt t KKK tt t K t

OutQ2

O

O

n 9o tt C 88 tt t 88 t tt 88 tt 88 σrO 1 88 : σr2 t 8 t t tt 888 tt 8 t t o nO [8

Copy O S

Copy O

S R (b) Selection push-down

(a) Selection pull-up

Figure 4.2: Imprecise sharing of joins with selections operations on such a tuple rx , and all tuples in S are examples of PS1 violation. Each such example is an instance of redundancy in the plan. Note that in this selection push-down example, PS2 is obeyed as each tuple from each join operator must satisfy at least one query. Selection pull-up (Figure 4.2(a)), on the other hand, can violate PS2. For example, the output of the join operator can include an (rx , sx ) tuple where rx fails both predicates, r1 and r2, satisfying neither query. The tuple (rx , sx ) is an example of a zombie tuple, and shows how increased sharing can cause wasteful work. Note that this plan has only one join operator that produces the common sub-expression R o n S and has no redundancy. Since no operation is applied on any tuple more than once, PS1 is satisfied in this case. We have shown how both selection pull-up and selection push-down violate at least one of the properties of precision sharing. A third alternative, however, was proposed in later work on NiagaraCQ [Chen et al., 2002]. This is a variant of pullup called filtered pull-up which creates and then pushes down predicate disjunctions ahead of the join. 62

Chapter 4. Precision Sharing for Joins with Varying Predicates OutQ1

OutQ2 O

O

σ σr1 dJ : r2 JJ tt JJ t JJ tt JJ tt tt Copy O n eJJ t9 o JJ tt t JJ tt JJ t t JJ t t J

σr1O∨r2

S

R Figure 4.3: Filtered pull-up with no PS2 violations Applying the filtered pull-up technique to the example considered above, leads to a plan that is shown in Figure 4.3. This plan is identical to selection pull-up plan (Figure 4.2(a)), except that it contains another operator, a disjunctive predicate (r1 ∨ r2 ), which is pushed down between the join and the scan on R. Pushing down this disjunction serves to reduce the input of the join operator, thereby reducing the amount of work that needs to be performed. Notice that the disjunction is a conservative predicate and is required only for performance and not for correctness, much like a bloom filter [Bloom, 1970]. The results of a query are not affected if the disjunction lets through “false positives”. In the presence of a large number of queries, the disjunction can be implemented using a predicate index, much like a Grouped Selection Filter (see Section 3.2.2). If, in the presence of a large number of predicates, the evaluation of the disjunction is expensive, the operator can be turned on or off adaptively. In other words, if |σr1 ∨r2 (R)| is approximately |R| in the recent history, then the system can have the

63

Chapter 4. Precision Sharing for Joins with Varying Predicates

operator simply pass its input on to its output. Unlike selection pull-up, the filtered pull-up plan for this example satisfies PS2. This is because every R tuple rx that reaches the join operator must have passed at least one of the r1 and r2 predicates. So every join tuple (rx , sx ) must also satisfy at least one of the queries Q1 and Q2 . Strictly speaking, filtered pull-up does not satisfy PS1 since the predicate r1 may be applied twice for a given tuple (before and after the join). Since the filtered pull-up plan does not satisfy PS1 it is imprecisely shared despite satsifying PS2. The effects of these PS1 violations are, however, not very severe, and it is not surprising that the experimental and simulation results in NiagaraCQ [Chen et al., 2002] generally show this plan as the most efficient. It is reasonable to ask if a filtered pull-up plan will always eliminate PS2 violations. It turns out that the answer is no, and we explain why in the next section.

4.2.2

Why filtered pull-up is not good enough

We now show why a filtered pull-up strategy may not, in general, eliminate PS2 violations. We demonstrate this with Example 4.2. Notice that the only differences between this example and Example 4.1 are the selection predicates on S. Example 4.2 Consider the following scenario involving two queries, Q3 and Q4 , each of which join the streams R and S. Each query also has different selection predicates on both R and S. • Q3 : σr1 (R) o n σs1 (S) • Q4 : σr2 (R) o n σs2 (S) The filtered pull-up technique suggests that we pick the plan in Figure 4.4. Notice that this plan is very similar to the filtered pull-up plan for Example 4.1 shown in Figure 4.3. In this plan, tuples are scanned from the streams R and S, and respectively 64

Chapter 4. Precision Sharing for Joins with Varying Predicates

sent to filter operators with predicates r1 ∨ r2 and s1 ∨ s2 . The qualifying tuples from both filters are then fed into a join operator, whose results are read by a Copy operator. The Copy operator copies its input stream into two output streams, each of which is fed to a filter with a conjunctive predicate corresponding to the individual queries Q3 and Q4 . The output of these filter operators are the outputs of the queries, and so are sent to the appropriate Out operators. OutQ3

OutQ4

σr1 ∧s1dJ

: r2 ∧s2 tt t tt tt tt

O

O

σ

JJ JJ JJ JJ

Copy O

n Je J t9 o JJ tt JJ t JJ tt t JJ t t t

σr1O∨r2

σs1O∨s2

R

S

Figure 4.4: Filtered pull-up with PS2 violations The behavior of this query plan is shown in Figure 4.5 (TBD: figure needs more visio work). The figure shows a large rectangle that represents a two dimensional coordinate space. The points on the x-axis and y-axis represent tuples belonging to windows of streams S and R respectively. These windows are represented by the thin rectangles, which are labeled S and R. For ease of exposition, we use S and R to represent the set of tuples in these windows in the rest of this discussion. Thus, the points in the outer rectangle correspond to tuples in the relational algebra expression Ro n S, i.e., the results of joining the two stream windows. Furthermore, the figure 65

Chapter 4. Precision Sharing for Joins with Varying Predicates

R

Q1 = R1

S

S1

R1 R R2 Q2 = R2

S2

S2

S1 S

Figure 4.5: Filtered pull-up and zombies shows four other thin rectangles: R1 , R2 , S1 and S2 , each of which represents sets of tuples. The sets of tuples in R1 and R2 are respectively defined with the relational algebra expressions σr1 (R) and σr2 (R). Similarly, the sets of tuples in S1 and S2 are respectively defined with the relational algebra expressions σs1 (S) and σs2 (S). Observe that the inputs to the join operator shown in Figure 4.5, are the sets R1 ∪ R2 and S1 ∪ S2 . This join operator produces the set (R1 ∪ R2 ) o n (S1 ∪ S2 ). Notice that this is a superset of (Q1 ∪Q2 ), the desired result. Thus, the join produces extra tuples (defined as the set (R1 ∪ R2 ) o n (S1 ∪ S2 ) − (Q1 ∪ Q2 )) that are not in the results of any query. These extra tuples are zombies and are indicated in the figure as the two darkly shaded areas inside the smaller rectangle. With two queries, it is easy to see the relationship between result set commonality and waste. When the intersection of Q1 and Q2 (result set commonality) is larger, the wasted work is less and vice versa. When more queries are added to the system, however, situations with high commonality and high waste become increasingly likely. In Figure 4.6, we show an illustration of such a scenario. The lightly shaded areas represent results of individual queries. The darkly shaded areas denote zombie tuples 66

Chapter 4. Precision Sharing for Joins with Varying Predicates

that are produced for no utility. In such cases, when there is both redundancy and waste, both the push-down and pull-up models are expensive.

Figure 4.6: Zombies with many queries The upshot of this example is that in spite of pushing down disjunctions, in the presence of sharing, a join can produce unnecessary zombie tuples that have to be eliminated later in the dataflow. In Example 4.2, the worst case overhead of lost precision is the maximal area of the region identified as the output of the shared join operator, i.e., |R1 ∪ R2 | × |S1 ∪ S2 |. With two streams, the overhead is quadratic. As the number of streams increase, the overhead becomes more significant. In other words, these overheads are exponential in the number of participating streams. We see more examples of such high overheads in the next section.

4.2.3

Disjunctions on intermediate results

We have shown how filtered pull-up can cause the production of zombie tuples, violating property PS2. Now, we will show how zombies cause further inefficiencies when they participate in later join work, producing even more zombies. We demonstrate this with Example 4.3. Example 4.3 Consider the following scenario involving two queries, Q5 and Q6 ,

67

Chapter 4. Precision Sharing for Joins with Varying Predicates

each of which join the streams R, S, and T. Each query also has different selection predicates on all three of R, S, and T. • Q5 : σr1 (R) o n σs1 (S) o n σt1 (T ) • Q6 : σr2 (R) o n σs2 (S) o n σt2 (T ) A solution based on the pull-up strategy is to reuse the shared plan of Q3 and Q4 from Figure 4.4 and attach a join operator with T to each of OutQ3 and OutQ4 . That approach, however, could result in substantial duplicate join processing if there is significant overlap in the result sets of Q3 and Q4 . This causes a PS1 violation, which was not present in either of the pull-up schemes of the previous section. Given that the push-down plan already suffered from a PS2 violation, the resultant plan would be very inefficient. The alternative is to discard the split from the plan shown in Figure 4.4 and use its input, complete with zombies, in another shared join with T . This, however, exacerbates the zombie situation as the zombies that are input to the join cause even more zombies to be produced. These tuples will still ultimately be eliminated by the conjuncts evaluated at the top of the plan. In contrast, note that in this situation’s worst case, the number of zombie tuples is the product of the cardinality of the filtered sets of each source. In this situation, the effects of zombies can be ameliorated by pushing a partial disjunction down between the RS and ST join operators. In Example 4.3 for instance, the partial disjunctive predicate can be (r1 ∧ s1 ) ∨ (r2 ∧ s2 ) as shown in the query plan in Figure 4.7. Note that this plan still produces zombies after the RS join operator and so is in violation of PS2. In addition, a careful examination of this plan reveals that the predicates r1 , r2 , s1 and s2 are each applied three times and t1 and t2 two times. This is a violation of PS1. With more streams being joined, the disjunction push-down 68

Chapter 4. Precision Sharing for Joins with Varying Predicates Out

Out

O Q5

O Q6

σr1 ∧s1 ∧t1hQ

σ

QQQ QQQ Q

Copy O

r2 ∧s2 ∧t2 p8 p p ppp

n gNNN l6 o lll NNN l l N lll

σ(r1 ∧s1 )∨(r2 ∧s2 )

σt1O∨t2

n hRRRR ll6 o RRR lll l RR l ll

T

O

σr1O∨r2 R

σs1O∨s2 S

Figure 4.7: Eliminating zombies using disjunctions scheme becomes increasingly complicated, suggesting that this approach is not very scalable. Now suppose further that we are executing the queries Q5 and Q6 along with queries Q3 and Q4 . In keeping with the stated aim to share aggressively without generating zombies, we need to modify the plan in Figure 4.7 to produce the plan shown in Figure 4.8. Clearly the plan gets increasingly complicated with a lot of work being spent repeatedly re-evaluating predicates – the predicates on R and S are each potentially evaluated four times for a given tuple. In addition to these violations of precision sharing, an efficient implementation of the Copy operator is also more complicated. Recall, from Section 4.2.1, that in practice the Copy operator is combined with all the predicates that are executed immediately after it. These predicates are built into a query index that the Copy consults to route tuples. When the predicates involve more than one attribute, as is the case here, this index will have to be multi-dimensional.

69

Chapter 4. Precision Sharing for Joins with Varying Predicates Out

Out

O Q5

O Q6

σr1 ∧s1 ∧tgO1

σ

OOO OOO OOO

OutQ4

Copy O

O

OutQ3

σr2O∧s2

σr1 ∧s1 o

Copy O

O

r2 ∧s2 ∧t2 oo7 o o ooo ooo

o nO gOOO

OOO OOO OOO

/ σ(r1 ∧s1 )∨(r2 ∧s2 )

n hPPP s9 o PPP ss s PPP ss PP s s

σt1O∨t2 T

σr1O∨r2

σs1O∨s2

R

S

Figure 4.8: Eliminating zombies using disjunctions: Many queries, many disjunctions

4.2.4

Reviewing imprecisely shared static plans

We now review the imprecisely shared static plans that we considered earlier in this section. In particular, we identify the problems that cause each of the violations of precision sharing that were present in these earlier imprecisely shared plans. We first summarize these problems, and then consider how they can be solved. • Problem (a). PS1 violation in selection push-down. In the selection pushdown plan (Figure 4.2(b)) a tuple from a stream is copied and sent to two different join operators, where the same build and probe operations are applied to each copy of the tuple. • Problem (b). PS1 violation in filtered pull-up. In the filtered pull-up plan (Figures 4.3 and 4.4), a predicate may be evaluated multiple times for a tuple. 70

Chapter 4. Precision Sharing for Joins with Varying Predicates

• Problem (c). PS2 violation in selection pull-up and filtered pull-up. In both the selection pull-up (Figure 4.2(a)) and filtered pull-up (Figure 4.4) plans, join operators can produce zombie tuples that have to be subsequently processed and eliminated. With problem (a), the only time we can expect a selection push-down strategy to be competitive is when a tuple from a stream is sent to very few join operators, a situation that arises when the sets of qualifying tuples that satisfy the individual predicates of each query have very few overlaps with each other. An identical observation was also made in NiagaraCQ [Chen et al., 2002]. Thus, in general, the pull-up strategies are the best way to reduce this overhead of repeated work (join operations). Problem (b) arises because in static plans we throw away the results of earlier predicate evaluations. This makes sense in classical non-shared systems when predicates are generally conjuncts and the presence of a tuple above a filter is enough to deduce that the tuple has passed every conjunct of the filter. An alternative approach is to memoize the effect of each predicate evaluation with each tuple (i.e., for every tuple store the results of applying each predicate for later reuse) and avoid multiple predicate evaluations. Problem (c) is again the result of discarding information on predicate evaluation. If, for each tuple, the information on each predicate evaluation is memoized with the tuple, then a smart join operator can easily avoid producing zombie tuples. Thus, the key factor in a solution that avoids PS1 and PS2 violations is the ability to memoize the effects of early predicate evaluation (i.e., predicates that are evaluated near the leaves of query plans). To summarize, we have shown that there are situations where standard techniques of shared query processing are not precise. In an attempt to efficiently reuse common work, they can end up producing useless data that can be exponential in the number

71

Chapter 4. Precision Sharing for Joins with Varying Predicates

of streams involved. Not only is the production of such useless tuples wasteful, the work done to subsequently eliminate them is a further waste. We have also identified a key aspect (memoization of predicate evaluation) of an alternate solution that can generate precisely shared static query plans. In the next section we develop this alternate solution.

4.3

TULIP: Tuple Lineage in Plans

In the last section, we formally defined the concept of precision sharing and showed how existing sharing techniques can generate imprecisely shared query plans. In this section, we propose TULIP (TUple LIneage in Plans), a solution for static precise sharing. TULIP uses the idea of “tuple lineage” from adaptive dataflow processing (see Section 3.2).

4.3.1

Tuple Lineage

The analysis in Section 4.2.4 identified the problem of memoizing predicate evaluation as key to precision sharing. We now consider the use of “tuple lineage” for this purpose. We reviewed the notion of tuple lineage earlier in this thesis (see Section 3.2.1) in the context of adaptive dataflow techniques. Although, tuple lineage has thus far only been used with adaptive dataflows, a key insight is that it is more generally applicable, and in fact, is also useful in static dataflows. As described in CACQ (see Section 3.2.2), all tuples that flow through the system carry lineage information that consists of: (1) a steering vector [Avnur and Hellerstein, 2000] that describes the operators in the dataflow that have been visited (done) and are to be visited (ready) and (2) a completion vector [Madden et al., 2002b] that describes the queries in the system that are “dead” for this tuple, i.e., those that this

72

Chapter 4. Precision Sharing for Joins with Varying Predicates

tuple cannot satisfy. In CACQ, the distinction between these parts of lineage was blurred, while in truth they have two distinct roles. The steering vector is exclusively used as a tuple routing mechanism. Apart from the routing infrastructure, such as an Eddy operator, no other operator must use its contents. In contrast, the completion vector is a query sharing mechanism and thus, need not be visible to the routing fabric (the Eddy).

4.3.2

The TULIP solution

In this section we present TULIP, a solution for static shared query processing. This solution must address all of the three problems – (a), (b), and (c) – that were identified in Section 4.2.4. We solve Problem (a) (i.e., PS1 violations in selection push-down) by ensuring that join operators are shared, and predicate disjunctions are pushed below the shared joins (in a manner similar to filtered pull-up). That is, we rely on an appropriate MQO scheme to evaluate a shared plan that uses a filtered pull-up strategy for a set of queries. We solve Problem (b) (i.e., PS1 violations in filtered pull-up) by using tuple lineage to memoize the results of predicate evaluations. Since we only need the completion vector part of a tuple’s lineage, for the rest of this section we refer to this portion as the “lineage vector”. The insight for the solution is from Rete [Forgy, 1982], a discrimination network for the many pattern/many match problem, the most timeconsuming part of which is the match step. To avoid performing the same tests repeatedly, the Rete algorithm stores the result of the match in temporary state associated with the object being matched. Similarly, in TULIP, each tuple carries a lineage vector that is used by the system to keep track of the queries that this tuple cannot satisfy based on predicates and joins that have already been evaluated.

73

Chapter 4. Precision Sharing for Joins with Varying Predicates

Furthermore, the TULIP operators that evaluate predicates and joins update the lineage vectors of tuples that they output. Finally, we solve Problem (c) (i.e., PS2 violations in filtered pull-up), by using a special join operator that suppresses the production of zombie tuples by exploiting tuple lineage. We now explain how TULIP evaluates predicate filters, join expressions, and output operations. We use three kinds of operators, one for each of these operations, which all exploit tuple lineage. These operators are called lineage sensitive operators, and we describe them below. 1. Grouped Selection Filters (GSFilter). The GSFilter is an operator (first proposed in CACQ) that implements the disjunction of multiple predicates and memoizes the results of each predicate into the tuple’s lineage vector.3 In contrast, a disjunctive predicate of the sort used in the filtered pull-up strategy only checks to see if a tuple satisfies at least one of its predicates, and does not retain any information on which particular predicates were not satisfied. Thus, a GSFilter can set up a tuple’s lineage vector so that the predicates of the disjunction associated with the GSFilter need never be re-evaluated for the tuple later in the plan. This approach is similar to index OR-ing strategies [Mohan et al., 1990] for disjunctive predicates that are used in classical systems. 2. Zombie-killing symmetric join (ZKSJoin). We use a novel ZKSJoin operator in order to avoid producing zombies while evaluating a join expression. This operator is a simple extension of the symmetric hash join operator that was described earlier in Section 3.1.1. The ZKSJoin operator depends on the system to ensure that tuples go through grouped filters prior to entering the join. The operator preserves the lineage 3

Section 3.2.2 has more details on how a GSFilter works.

74

Chapter 4. Precision Sharing for Joins with Varying Predicates

vector of tuples that it inserts into (builds) in its index. Further, when the operator evaluates a join-expression by probing the index with a probe tuple and finds a matching build tuple, it computes the union of the lineage vectors of the build and probe tuples. If this union includes all queries that the operator is used by, then the match is a zombie tuple, and is promptly discarded. Note that since a lineage vector (the completion vector in this case) is implemented as a bitmap, the union of two lineage vectors can be efficiently computed as a bitwise OR operation over the vectors. 3. Output (Out). We use the Out operator to direct tuples to the results of different queries. The Out operator is similar to those used with classic static plans (see Section 3.1), except that it sends its input tuples to specific queries based on the completion vectors of the input tuples. That is, the Out operator uses the completion vector of a tuple to determine the set of queries that are live for that tuple. To summarize, TULIP involves the following: 1. Any appropriate MQO scheme that results in filtered pull-ups can be used to generate an initial shared plan. 2. The initial plan is modified so that all disjunctions that are pushed down are replaced with GSFilter operators. 3. The join operators of the modified plan are replaced with ZKSJoin operators. 4. The output operators of the modified plan are replaced with Out operators.

Putting it all together

75

Chapter 4. Precision Sharing for Joins with Varying Predicates

We now revisit the driving example of this chapter, the scenario that shares queries Q3 ,Q4 , Q5 and Q6 (from Example 4.2 and Example 4.3). This scenario can be processed using a shared static query plan based on the TULIP model as shown in Figure 4.9. In this example, tuples are scanned from the R, S and T streams and fed to respective GSFilter operators. These operators are shown with the predicates they evaluate. The outputs of the GSFilters are then fed into two ZKSJoin operators based on a fixed join order. Here, the outputs from the GSFilter operators that processed the R and S streams are first joined together. The output of this ZKSJoin operator is then fed to the second ZKSJoin operator whose other input is the output from the GSFilter operator that processes the T stream.

OutQ5 ,Q6 O

OutQ3 ,Q4

ZKSJoingPP PPP oo7 o o PPP o o PPP o oo PP o o o GSFiltert1 ,t2 ZKSJoin gOOO O o7 o O o OOO oo o O o OOO oo O ooo O

GSFilterr1 ,r2

GSFilters1 ,s2

R

S

O

O

T

Figure 4.9: Using TULIP for precision sharing We now consider the precision sharing properties of this approach. First, PS1 is satisfied as this plan does not perform any operation on a given tuple more than 76

Chapter 4. Precision Sharing for Joins with Varying Predicates

once. As shown in the figure, join operations are shared in the ZKSJoin operators and so build and probe operations are not repeated for any tuple. Similarly, each predicate is only evaluated once in GSFilters that are pushed down below the ZKSJoin operators. Next, PS2 is also satisfied since the ZKSJoin operators receive input tuples with lineage vectors that memoize the results of predicate evaluation (in GSFilter operators) and can thus avoid producing zombie tuples of any kind. It is instructive to compare this plan with the equivalent traditional shared plan in Figure 4.8. Not only is the TULIP plan an example of precision sharing, it is easy to see that the plan for sharing many queries looks very similar to a plan for a single query (simply replace GSFilter with an ordinary filter σ, and a ZKSJoin with an ordinary join o n). This makes it easy to use TULIP with multiple queries. In contrast, recall (from Section 4.2.3) that the filtered pull-up plan gets increasingly complicated as we deal with more queries and streams. The main insight in TULIP is that the use of lineage helps to memoize predicate evaluation and avoid repetitive computations, `a la Rete networks. Furthermore, lineage sensitive operators like GSFilter, ZKSJoin and Out can exploit lineage to recognize and eliminate potential zombie tuples even before they are produced. These uses of tuple lineage ensure that TULIP does not violate properties PS1 and PS2 respectively. It is important to note that there can be many precisely shared plans and there is no guarantee that any of them is optimal and results in the lowest computational costs. When an optimizer estimates the cost of a plan, it uses the number of tuples at each stage of the plan to determine the cost of each operator in accordance with the cost model. With TULIP, a new set of plans that emit fewer tuples between operators can now be considered during plan enumeration. The estimated cost of each operator is slightly higher because of the overhead of lineage manipulation. The key issue is the expected number of zombies produced at each stage. If this number 77

Chapter 4. Precision Sharing for Joins with Varying Predicates

can be estimated, then the optimizer can choose between TULIP and other plans in its pursuit of an optimal solution. The use of tuple lineage has an associated overhead, which is the storage and manipulation costs of the completion vectors of tuples that can be particularly profligate in memory consumption – a bit per tuple per query results in space overhead that is linear in the product of the number of queries and currently active tuples. The net result of this overhead is that the use of tuple lineage in a static query plan does not come for free. Thus, there can be cases where TULIP underperforms existing shared techniques. We investigate this issue in the performance study reported in Section 4.5. In this section we outlined TULIP, a solution to process multiple concurrent queries with a shared static dataflow. We used an example to demonstrate how a TULIP query plan can be precisely shared. Later in this chapter, we evaluate TULIP with a performance study in Section 4.5.

4.4

Adaptive Precision Sharing

The previous two sections examined precision sharing in static shared dataflows, and outlined TULIP, a solution to this problem. In this section, we study precision sharing in a shared adaptive dataflow system like CACQ, and present CAR: Constrained Adaptive Routing, a solution for this problem. We begin this section by showing how CACQ is susceptible to the violations in precision sharing. Recall the detailed review of CACQ that we presented earlier in this thesis, in Section 3.2.2. We then develop CAR, a solution technique for this problem. Just as we used ideas from the adaptive approach to make static sharing precise, it turns out that we can use techniques from the static world to remove the precision sharing violations from the adaptive approach. 78

Chapter 4. Precision Sharing for Joins with Varying Predicates

4.4.1

Precision sharing violations in CACQ

We now show how CACQ violates the precision sharing rules we outlined in Section 4.1.2. We are concerned only with sharing and do not address any of the considerable benefits that an adaptive system may have in volatile scenarios. Two kinds of precision sharing violations are possible in CACQ: 1. Zombie production (PS2 violation): As we explained in Section 3.2.2, CACQ requires that as soon as a tuple is scanned from a given stream, it is always immediately built into any SteM associated with the stream, in order to enforce a correctness constraint. That is, in CACQ the build tuples that are sent to a SteM have no useful lineage since their completion vectors have not been updated by GSFilter operators. Thus, when a SteM produces join tuples, there is no way to combine the lineage vectors of the probe and build tuples to eliminate zombies. To see why this is so, recall the description of a ZKSJoin operator that does not produce zombies (from Section 4.3.2). The ZKSJoin eliminates zombies by computing the union of the lineage vectors of the build and probe tuples. Since, in CACQ, the build tuples carry no useful lineage, the SteM cannot eliminate any zombies and therefore CACQ violates the PS2 property of precision sharing. Explained in another way, the production of zombies depends on where individual selection predicates are placed in a plan with respect to join operators. With a conventional binary join operator there are the two choices explored by NiagaraCQ and discussed in Section 4.2 - pushing the selections below the join in “selection push-down” and pulling them above the join in “filtered pull-up”. In an adaptive approach like CACQ, however, the internal build and probe operations of a join are decoupled (as shown in Figure 3.4) and there are three choices for locating selection predicates (as GSFilter operators): after the probe, 79

Chapter 4. Precision Sharing for Joins with Varying Predicates

between the build and the probe, and before the build. Since, in CACQ, the build and scan are performed together, only two choices are possible – either between the build and probe, or after the probe – with the routing policy guiding the decision in an adaptive fashion on a per-tuple basis. Unfortunately both choices result in the production of zombies. Note that it is possible to use a draining constraint (Section 3.2.2) as is the case in TelegraphCQ (described in Appendix A). With a draining constraint, there is no need for tuples to be built into their SteM as soon as they have been scanned. With such a constraint, once a tuple is scanned the system is free to route it to any operator from a set of candidates. These candidates include GSFilters, probe SteMs, and build SteMs. The order in which these operations are applied to a tuple does not affect the correctness of the results. Thus, with a draining constraint, all the three choices for locating selection predicates are possible. The problem, however, is that such a routing policy can still not guarantee that every tuple passes through a GSFilter before being built in a SteM. Thus, this approach will only allow some fraction of zombies to be eliminated. 2. Repeated output processing (PS1 violation): In CACQ, the main code in the eddy is organized with a loop that continuously drives the invocation of operators. Thus, output processing is done every time a tuple is returned to the eddy as a result of an operator invocation, i.e., in each iteration of the loop. An intermediate tuple’s steering vector is compared with the completion requirements of each query. If the tuple satisfies any query it is immediately delivered. Not only is this an expensive operation, especially in the presence of a large number of queries, a given tuple may be processed repeatedly as an output for multiple queries. This is a violation of the PS1 property. The performance study of TULIP in Section 4.5, shows that repeated processing of the same 80

Chapter 4. Precision Sharing for Joins with Varying Predicates

tuple in the outputs of multiple queries (PS1 violation) can drastically hurt performance. What we really need is a way to route tuples to output operators only when they are finally ready for them.

4.4.2

CAR: Constrained Adaptive Routing

Here, we propose as an alternative to CACQ, Constrained Adaptive Routing, or CAR. We will show that this scheme has almost all the adaptivity benefits of CACQ and still satisfies precision sharing. As explained in Section 4.4.1, CACQ violates precision sharing by producing zombies and repeating output processing operations. The former is because of a hidden constraint (build along with scan) that causes poor selection placement. The latter is because output processing is performed in an unconstrained ad hoc fashion. The root of the problem is that there are multiple constraints that must be satisfied in an adaptive dataflow. Some, such as build before probe are for correctness, and others such as filters before build and output only when done are for performance. Thus, the goal in this section is to design an architecture where such constraints can be expressed explicitly and ensure correctness and performance. In CAR, we achieve this with operator precedence, a routing mechanism that is an alternative to the unconstrained routing of tuples in an eddy. In this approach, we record precedence relationships between operators in a precedence graph. As with CACQ, this mechanism is used to generate a set of candidate operators to which tuples must be routed. In its simplest form, this is a graph with nodes that are sets of operators (called “candidates”) and edges that represent legal transitions from one node to the other. When a tuple is routed through the candidates of a particular node it is subject to a routing policy such as the lottery scheme in CACQ. This routing mechanism ensures that CAR can adaptively respond to changes in selectivity, data rates etc.

81

Chapter 4. Precision Sharing for Joins with Varying Predicates

There are, however, certain ways in which CAR is less adaptive than CACQ. We explain these cases later in this section. OutputQ3 , OutputQ4

OutputQ3 , OutputQ4

(Probe)StemS

(Probe)StemR

(Build)StemR

(Build)StemS

GSFilterr1 ,r2

GSFilters1 ,s2

R

S

O O

O

O

O

O

O

O

Figure 4.10: Operator precedence graph for CAR In Figure 4.10 we show an operator precedence graph for the queries Q3 and Q4 . There are 8 nodes in the graph and operators (such as the SteMs and Outputs) appear in more than one node. Clearly, with this routing scheme in CAR, tuples are filtered and then built into SteMs. This routing policy enables the early recognition and elimination of zombies and preserves the PS2 property. A given tuple is subject to output processing only once – when it is ready. This preserves the PS1 property. Thus, in this example, the CAR strategy is precisely shared. Effects on adaptivity Adaptive dataflow schemes like CACQ have had a significant impact because they make it easy for a system to respond to changes in its environment. These changes can be in terms of data (e.g., a spike in the input rate of a stream) as well as in terms of queries (e.g., a new query was just added to the system). We now consider how CAR and CACQ differ in terms of adaptivity. Note that statically fixing predicate placement can hurt adaptivity. In order to reduce zombies, GSFilters ought to be processed before build operations. If, however,

82

Chapter 4. Precision Sharing for Joins with Varying Predicates

the filters in question are expensive and cost more than join operations then reducing zombies may be less important. Adaptivity in CACQ allowed for efficient join ordering as well as delayed execution of expensive filters at the cost of zombies. In contrast, with CAR, while joins are still ordered efficiently (and without zombies), this comes at the cost of early evaluation of expensive filters. If there are no expensive predicates, however, there is no downside to the CAR approach of pushing their evaluation early in query plans. In these situations, CAR will still provide the same adaptivity benefits as CACQ. On the other hand, if there is a filter that is known to be expensive, it is easy to fix the CAR precedence graph to revert to CACQ behavior. This leads to an interesting question: “is it possible to make this choice adaptively ?” While it is not yet clear how to devise such a routing policy, we believe that in practice, simple filters are very much more common and the heuristic of reverting to CACQ in the presence of expensive predicates should be enough for most applications. The main insight of this approach is the use of techniques from the static world. A purely adaptive approach makes routing decisions every step of the way. In contrast, placing constraints on the adaptivity makes it possible to ensure that for any tuple its associated predicates are evaluated (by GSFilter operators) before it is built into any SteM. We used an example to demonstrate that the CAR approach of constrained adaptivity can satisfy the requirements of precision sharing.

4.5

Performance of TULIP

In this section we study the performance of TULIP, the static precision sharing approach and compare it to the static filtered pull-up and the selection push-down schemes described in NiagaraCQ.

83

Chapter 4. Precision Sharing for Joins with Varying Predicates

4.5.1

Experimental setup

The experiments were performed on a 2.8 GHz Intel Pentium IV processor with 512 MB of main memory. TULIP was implemented in the TelegraphCQ system [Krishnamurthy et al., 2003; Chandrasekaran et al., 2003]. Since TelegraphCQ does not have a static multiple query optimizer, we programmatically hooked up static plans using the TelegraphCQ operators. We also used the same technique to evaluate the NiagaraCQ schemes in TelegraphCQ. To fairly evaluate the static NiagaraCQ plans, we set up the system so that no lineage information is stored in intermediate tuples and TelegraphCQ’s operators do not perform any unnecessary work manipulating lineage. For instance, the disjunctions of filtered pull-up are realized with a GSFilter that does not set lineage. Similarly, the symmetric join operator ignores lineage. We emphasize here, that the intermediate data structures in the NiagaraCQ measurements we present have no space overhead for lineage. We also emphasize that in these experiments, we follow the NiagaraCQ approach of implementing Copy operators and not what is shown in the static plans in Section 4.2 where the Copy operators are separate from the predicate filters that follow them. More specifically, in these experiments we use a Copy operator that probes its input tuples into a predicate index implemented by a GSFilter. This approach lets Copy operators send tuples only to those plan elements of queries that passed the probe. The top of each plan has one Out operator for each query. In the TULIP implementation, TelegraphCQ’s intermediate tuples have lineage turned on. TULIP plans use GSFilters, zombie-killing symmetric hash joins, and output operators that manipulate lineage. In both implementations, the output operator makes a tuple available for delivery to a query by queuing it to the process managing the query’s connection. The queue is in shared memory, access to which can be expensive. So, for all of these experiments

84

Chapter 4. Precision Sharing for Joins with Varying Predicates

we suppress output production. Even so, output processing is still not trivial since latency computations involve a system call to find the current time for each output tuple. The cost of output processing is still, however, cheaper than the actual system overhead of sending the same tuple multiple times through shared memory. It is important to see where the savings of zombie elimination come from. In TelegraphCQ, where all the operators execute in a single thread of execution in one process, the cost of operator invocation is minimal - a function call and a pointer copy. The real savings is the avoidance of unnecessary zombie production and elimination. In other systems where operators are often invoked in different threads (e.g., Aurora) the savings are even greater as having fewer zombies leads to fewer operator invocations that in turn mean less context switching overheads.

Figure 4.11: Experimental setup: Query result sets with fewer overlaps The experiments reported here all share a set of queries that are joins on streams R and S with individual predicates on each stream. The queries have identical structure and correspond to queries Q3 and Q4 from Section 4.2.2. The template of these queries

85

Chapter 4. Precision Sharing for Joins with Varying Predicates

Figure 4.12: Experimental setup: Query result sets with fewer overlaps Query 4.1 Experimental setup: query template SELECT R.a, R.b, S.a, S.b FROM R, S WHERE R.a = S.a AND R.b > const_0 AND R.b < const_1 AND S.b > const_2 AND S.b < const_3; is shown in Query 4.1. We generate 256 queries for the experiments by supplying values for the constants in each of the queries in two setups. We show these visually in Figures 4.11 and 4.12. As before, shaded areas represent results of queries and darkly shaded pieces are zombies that would be generated by selection pull-up. We used TULIP to log the number of zombies actually eliminated. This is shown for both cases in Figures 4.13 and 4.14. In the first setup, shown in Figure 4.11, the result set of each query overlaps with those of two other queries. In this case, as queries are added to the system, more and more zombies are produced, as shown in Figure 4.13.

86

Chapter 4. Precision Sharing for Joins with Varying Predicates

Figure 4.13: Experimental setup: Zombies with fewer overlaps in query result sets Conversely, in the second setup, shown in Figure 4.12, the result set of each query overlaps with many other sets. To achieve this, the first two queries are arranged so that they have almost no overlap (i.e., they are the two queries farthest apart) and every query that is subsequently added overlaps with one or both of the first two queries. Since each such query contributes no extra zombies, the effect of adding queries is to steadily reduce the number of zombies produced, as shown in Figure 4.14. In these experiments, we measure the average latency of each of the results of each query. Synthetic data is generated (from a uniform distribution) and piped into TelegraphCQ by an external process. Each tuple arriving at the system is timestamped on entry before it is read by any scan operator. When a tuple arrives at an output operator, we examine its components and compute the difference between the current time and the time it originally entered the system. This represents the latency of the tuple; we present the average latency in the experimental results in this section. We considered the 4 static approaches that we investigated earlier: (a) selection pull-up (SPU), (b) filtered pull-up (FPU) (Figure 4.4), (c) selection push-down (SPD) 87

Chapter 4. Precision Sharing for Joins with Varying Predicates

Figure 4.14: Experimental setup: Zombies with greater overlaps in query result sets and (d) TULIP. In these graphs, we do not report the SPU case as it is dominated by FPU.4

4.5.2

Performance results

For each setup, we plot (in Figures 4.15 and 4.16) the average latency of result tuples for each approach against the number of queries being shared. Note that the number of queries is shown in a log2 scale on the x-axis. In both setups, the average latency for all plans is very small (under 25ms) for 2 queries and increases steadily as queries are added. In each approach, there is a certain number of queries at which there is a knee in the graph showing each scheme’s scalability limits. The following overheads affect average latencies: • PS1 violations: Repeated work for the same tuples in intersecting result sets: 4

Plans for selection pull-up and push-down with predicates on only one source are shown in Figure 4.2 and the multiple predicate case is just a simple extension.

88

Chapter 4. Precision Sharing for Joins with Varying Predicates

Figure 4.15: Static query plans: Average query latencies with fewer overlaps in query result sets – (SPD): In the various separate join operators. – (FPU): In output processing. • PS2 violations: (FPU) Unnecessary work caused by the production of zombies in joins and removal afterward. • Other: (TULIP) CPU instructions for lineage management. The state overhead was negligible in the experiments.

Setup 1 (Fewer overlaps): As seen in Figure 4.15, for 32 or fewer queries the performance of all three plans remains similar. Latencies increase steadily from 6ms to 17ms, while the number of zombies produced by SPD increases from 14 to 9133. At 64 queries, the latency for FPU jumps to 72 ms while that of SPD and TULIP stay at 30ms. In this case, doubling the number of queries leads to a four-fold increase

89

Chapter 4. Precision Sharing for Joins with Varying Predicates

Figure 4.16: Static query plans: Average query latencies with greater overlaps in query result sets in the number of zombies (≈ 39000). The zombie overheads of FPU slow it materially and it does not scale well for more than 64 queries. For 128 and 256 queries, the average latency of FPU is 430ms and 4300ms respectively, and these data points are not shown in the figure. Returning to SPD and TULIP, for more than 64 queries the performance of both approaches starts degrading. As queries are added, each new query causes more tuples that cannot be easily eliminated before joins. TULIP is, however, slightly more expensive than SPD and at 256 queries its latency is 147ms as opposed to SPD’s 125ms. In general, sharing does not have much advantage when the results of the queries being shared have fewer overlaps. This is exactly what we observe in this case and the minimally shared SPD scheme does slightly better overall. The repeated work overhead in SPD is less than that of lineage management in TULIP. Both costs are

90

Chapter 4. Precision Sharing for Joins with Varying Predicates

dwarfed by the zombie overheads of FPU. Setup 2 (Greater overlaps): As seen in Figure 4.16 in the “Greater Overlaps” scenario, all three plans behave similarly for 4 or fewer queries with latencies ≈ 25ms. For 2 queries, FPU is the outright winner as the two initial queries have no overlap. From 4 to 32 queries, the performance of FPU and SPD both degrade quickly. As queries are added, more tuples overlap causing repeated work. One instance of increased repeated work is in output processing, for which SPD and FPU behave similarly. These new tuples, however, also cause: (1) repeated join overheads in SPD and (2) overheads resulting from zombies in FPU. As zombies decrease from ≈ 49000 to plateau at ≈ 25000 the former overheads increase and the latter decrease. From 32 to 64 queries, both SPD and FPU perform the same. Beyond 64 queries, the join overheads of SPD become much worse, leading to SPD having a latency of 8.02 seconds for 128 queries as opposed to 1.7 seconds for FPU (these are not shown in the graph). The TULIP scheme performs very well, however, gracefully degrading in performance as the number of queries are added. At 256 queries, the latency of TULIP is 113ms. In contrast, the FPU and SPD schemes reach a comparable overhead of 111ms and 102ms for only 16 queries. For the same latency, TULIP scales to 16 times as many queries as traditional schemes. Summary: The insights of the performance analysis of static approaches are as follows: 1. The overheads of both repeated work and unnecessary work are significant. 2. The two setups (greater and fewer overlaps) demonstrate two extreme cases, each favoring one of the two traditional approaches (FPU and SPD).

91

Chapter 4. Precision Sharing for Joins with Varying Predicates

3. In each extreme case, the TULIP solution of precision sharing performs very well. While in the case of minimal sharing it is competitive with the ideal FPU, in the face of high sharing it is more than an order of magnitude better than either traditional scheme. These experiments demonstrate the robustness of TULIP. When sharing is useful, TULIP gives significant improvements over the best known approaches. When, however, there is not much use in sharing, the extra overheads of TULIP (that of lineage management) are measurable but modest. This suggests that TULIP is capable of giving very good benefits in many cases while staying competitive otherwise.

4.6

Performance of CAR

In this section we examine the performance of CAR, the constrained adaptive routing technique we described above. The experimental setup and methodology is identical to that described for static plans in Section 4.5. For each of the two setups we report the average latencies of query results for each of CAR and CACQ in Figures 4.17 and 4.18. Note that as in the previous section, the number of queries is shown in a log2 scale on the x-axis. As in the static case, for both setups, the average latency of CACQ and CAR with 2 queries is small (5-30 ms) and increases steadily with query addition until scalability limits are reached. The following overheads can affect latencies: • PS1 violations: (CACQ) Repeated output processing of the same tuple in different queries. • PS2 violations: (CACQ) Unnecessary work caused by the production and removal of zombies. 92

Chapter 4. Precision Sharing for Joins with Varying Predicates

Figure 4.17: Adaptive query plans: average query latencies with fewer overlaps in query result sets • Other: (CAR,CACQ) CPU instructions involving lineage management. In this experiment, for CACQ the tuples produced by probes into SteMs are immediately ready for output. There are no additional filtering steps and so there are, in fact, no PS1 violations causing output processing overheads. In both setups, the performance of CAR comfortably outstrips that of CACQ. Just like TULIP, the performance of CAR gracefully degrades with the addition of new queries. In the fewer overlaps case, with 2 queries there are actually no overlaps. Despite this, the production of 14 zombies is enough to cause CACQ’s latency to be 21 ms as opposed to 6ms for CAR. This result shows the savings in output processing (PS1 preservation) for CAR. At 256 queries, the latency of CAR is ≈ 151 ms. An equivalent latency of CACQ supports only 18 queries. For this latency, CAR supports 14 times (an order of magnitude) more queries than CACQ. In the greater overlaps case, CACQ scales more gracefully than with fewer over93

Chapter 4. Precision Sharing for Joins with Varying Predicates

Figure 4.18: Adaptive query plans: average query latencies with greater overlaps in query result sets laps. Note that in this case, the relative overheads of zombies actually drop with more queries. With 256 queries CACQ has a latency of 550 ms as opposed to 131 ms of CAR. Note that CACQ can support a latency of 131 ms for only 48 queries, while CAR handles 5 times as many. The performance difference in both the overlap setups is related to the number of zombies produced. With fewer overlaps, the production of zombies cripples CACQ. It is also important to note that in comparison to the static schemes, CAR performs almost as well as TULIP. With 256 queries, the latency of CAR in the greater overlap case is 131 ms as opposed to 113 ms for TULIP. In the fewer overlap case it is 151 ms for CAR as opposed to 147 with TULIP. These results are not surprising as the only difference between CAR and TULIP is the cost of adaptivity. Since there are no choices to be made in these experiments, the latency differences we observe lets us reckon the baseline cost of adaptivity. In summary, the experiments reported here on adaptive approaches indicate that:

94

Chapter 4. Precision Sharing for Joins with Varying Predicates

1. The overheads of producing zombies, or unnecessary work, are significant in adaptive dataflows even when relatively fewer zombies are produced. 2. In both overlap setups, the CAR approach of adaptive precision sharing performs very well. 3. In these scenarios, the baseline costs of adaptivity are not very significant.

4.7

Chapter Summary

The development of shared query processing techniques has focused on reducing the overheads of redundancy. Aggressive reduction of repeated work can, however, cause additional wasted work due to the processing of useless data. In previous work, this inherent tension between repeated work and wasted work has either been taken for granted, or not even noticed. The major contributions in this chapter are: (1) to show that this tension is not irreconcilable and (2) to develop both static and adaptive techniques that balance the tension gracefully. We defined precision sharing as a way to characterize any sharing scheme having neither repeated work, nor wasted work. We then showed how previous approaches to shared stream processing led to imprecisely shared plans. Armed with these observations we first charted a strategy to make static shared plans precise. The insight is that tuple lineage, an idea from adaptive query processing, is actually more generally applicable. We then proposed TULIP, or “TUple LIneage in static Plans”, a technique to make static shared plans precisely shared by using tuple lineage. The next contribution was to show how adaptive shared query processors can also violate precision sharing. Here we reversed the strategy used with TULIP, and adopted the idea of operator ordering from static dataflows to make adaptive approaches precisely shared. Using this strategy, a new approach CAR, or “Constrained 95

Chapter 4. Precision Sharing for Joins with Varying Predicates

Adaptive Routing”, is able to achieve precision sharing while possibly sacrificing a small degree of adaptivity. Furthermore, in many common cases (e.g., where predicates are inexpensive) CAR is as adaptive as CACQ, the previous state of the art approach. Finally we reported the results of a performance study of the various schemes: precise and imprecise, static and adaptive. Key results from this study include (1) TULIP can support roughly an order of magnitude (between 8x and 16x) more queries than a traditional approach (e.g., filtered pull-up from NiagaraCQ) with comparable latency in producing results, and (2) CAR can offer a dramatic reduction over the previous state of the art (CACQ) in the average latency of query results. In one scenario, the reduction was seen to be as much as 75%, and in another scenario CACQ is so overwhelmed that it cannot even run more than 32 queries. Thus, these experiments show that the precision sharing approaches either significantly outperform, or are competitive with, the other schemes under different conditions. In summary, there are two major contributions in this chapter. First, this chapter shows that previous research on shared processing of join queries is incomplete as they consider an over simplified scenario (the only variations are predicates on stream in a join). Second, and more importantly, this chapter uncovers a fundamental issue (the tension between repeated work and wasted work) in shared query processing of join queries that has thus far not been noticed. Now that we have addressed how to share join queries, we consider the best ways to share the processing of aggregate queries in the next two chapters.

96

Chapter 5 Sharing for Aggregation with Varying Windows The previous chapter presented precision sharing, a fundamental issue in shared query processing that is particularly relevant while concurrently executing multiple streaming join queries in a single-site system like TelegraphCQ. Now that we have addressed the sharing of join queries, we turn to the problem of processing aggregate queries in both single-site systems like TelegraphCQ and hierarchical systems like HiFi.1

5.1

Introduction

Enterprises are deploying receptors such as network monitors (for intrusion detection) and RFID readers (for asset tracking) across wide geographies, driving the need for infrastructure to manage and process the data streams produced. As described in Appendix B, we have built HiFi [Franklin et al., 2005], a general purpose system to manage, process, and query data streams from widely dispersed receptors. Figure 5.1 shows distributed intrusion detection powered by HiFi where edge nodes monitor 1

The work reported in this chapter was published in [Krishnamurthy and Franklin, 2005].

97

Chapter 5. Sharing for Aggregation with Varying Windows

network traffic.

Figure 5.1: Using HiFi for distributed intrusion detection In large-scale monitoring applications there can be a large number of users executing similar concurrent queries. Using a na¨ıve approach to execute such queries can lead to scalability and performance problems as each additional query can add significant load to the system. Instead, it is vital to share computational and communication resources by exploiting the similarities in these queries. The techniques developed in this chapter let a HiFi system exploit such sharing in order to support a large number of concurrent queries.

5.1.1

Challenges in Monitoring Distributed High-Volume Streams

In the intrusion detection example, administrators can pose periodic windowed aggregate queries over network monitor data. Such queries have two parameters: a range which is an interest interval over which the aggregate is computed, and a slide which is a periodic (the slide is the period) interval controlling when results are reported. Thus, HiFi deployments with the following properties are considered: 1. Periodic: Reporting results at specific intervals avoids flooding the network with data. This technique is used to save power in sensor databases like TAG [Mad98

Chapter 5. Sharing for Aggregation with Varying Windows den et al., 2002a], and to reduce computational costs in single site streaming systems like TelegraphCQ. 2. Overlapping: Large ranges and small slides result in aggregates over overlapping windows. Tuples that are aggregated in multiple windows makes efficient execution challenging. Li [Li et al., 2005a; Li et al., 2005b] and Arasu [Arasu and Widom, 2004] have proposed ways to address these challenges. 3. Sharing: Resource sharing is a common way to scale streaming systems to many queries. For example, [Arasu and Widom, 2004] describes how to share on-demand aggregate computation in a single node system. 4. Hierarchical: Bottom-up computation, across space and time, lets HiFi scale with streams from widely deployed receptors. Note that normally, the characteristics of the different nodes in a HiFi system are heterogeneous, with lightweight receptors at the leaves of the hierarchy and bigger systems near the top. TAG [Madden et al., 2002a] uses a hierarchical approach with homogeneous sensor networks but does not support overlapping windows. Challenge: The goal in this chapter is to efficiently share the processing of a large number of aggregate queries with periodic, overlapping windows over high volume data streams from widely distributed receptors that are organized in a heterogeneous hierarchy. While some of these four properties have been addressed in other systems, all of them are together necessary in emerging large scale receptor-based systems like HiFi. Furthermore, the contributions in this chapter are the first attempt to support all of these properties together. For instance, while TAG [Madden et al., 2002a] supports periodic hierarchical aggregates, it does not support overlapping windows and sharing.

99

Chapter 5. Sharing for Aggregation with Varying Windows Similarly, STREAM [Arasu and Widom, 2004] shares aperiodic overlapping windows in a single-site system.

Figure 5.2: High-level overview of partial push-down It turns out that these properties are mutually incompatible for plans that either push down or pull up aggregates. Aggregate pull-up moves raw data up the hierarchy and increases communication overheads. On the other hand, while aggressive aggregate push down reduces the communication costs of individual queries, it lowers opportunities for sharing, leading to high overall computation and communication costs. This tension has important implications for distributed monitoring systems like HiFi. Bandwidth consumption in a wired network affects real operating costs with usage-based pricing. In a wireless sensor network it determines battery life. In contrast, computing capacity is plentiful in high end servers and sparse in sensors and edge nodes. Thus, it is vital to distribute the load across the HiFi resource hierarchy. We resolve this tension using a partial push down technique that has the following benefits: 1. Communication, and some computation, is shared across the hierarchy. 2. The remaining computation that cannot be shared is pulled up the hierarchy where resources are abundant. 100

Chapter 5. Sharing for Aggregation with Varying Windows

The partial push down technique is implemented as a series of operations on a set of queries as shown in Figure 5.2. First, a new paired window rewrite extracts non-overlapping parts of each query. These non-overlapping parts are then composed to form a common sub-query. Finally, we pull up the overlapping parts of each query and push down the non-overlapping common sub-query.

5.1.2

Contributions

Efficient shared processing of aggregate queries in a hierarchy is hard because the standard push-down and pull-up approaches are both unsuitable. Thus, we need to invent techniques like new query rewrites and partial push-down in order to let a HiFi system exploit sharing and support many concurrent queries. The specific contributions of this chapter are the following: 1. Paired Windows. A novel way to extract and execute non-overlapping components of an aggregate over an overlapping window. The paired window rewriting is provably optimal and superior to previous work. (Section 5.3). 2. Shared Time Slices (STS). A new technique to share the processing of aggregates with varying windows that chops the input stream into time slices. We prove that the new paired window technique is an optimal method to form such slices. (Section 5.4). 3. Partial Push-down. An innovative way to share communication resources in processing aggregates over non-overlapping windows. (Section 5.5). 4. Performance study. A validation of these techniques with a performance study with real data. (Section 5.6). 5. Putting it all together. A principled way to overlay partial push down across a heterogeneous hierarchy. (Section 5.7). 101

Chapter 5. Sharing for Aggregation with Varying Windows

We first present necessary background in Section 5.2 below.

5.2

Windowed Aggregation

We begin this section with a discussion of windowed aggregation and a motivating example to show its use in the problem of network intrusion detection. We then review earlier work on unshared and shared processing of windowed aggregate queries. In this chapter we restrict the focus to aggregate functions that are classified as distributive (e.g., max,min,sum, and count) and algebraic (e.g., avg), as proposed in [Gray et al., 1996]. These are typical aggregates in database systems, and can be evaluated with constant state independent of the size of their input data set. Further, they can be computed using partial aggregates over disjoint partitions of their input,2 a technique used with parallel databases (e.g., Bubba [Bancilhon et al., 1987]), sensor networks (e.g., TAG [Madden et al., 2002a]) and streams (e.g., STREAM [Arasu and Widom, 2004], PSoup [Chandrasekaran and Franklin, 2002]). Example: Distributed Intrusion Detection Consider the problem of distributed intrusion detection in an enterprise. Here, each node at the edge of an enterprise monitors all the incoming network traffic (i.e., incoming TCP SYN packets) using a tool such as Snort [Roesch, 1999]. The edge nodes stream this data to a higher-level server. Figure 5.1 shows a two-level hierarchy that can be extended with additional levels if there are too many edges. Let the union of data from edges be a single virtual stream Snort, with attributes SrcIP, and time. Detecting a Distributed Denial of Service (DDOS) attack requires the ability to compute the total number of incoming packets from each unique source IP address. In the example HiFi-based system, this counting can be done with an 2

The functions used for the partial aggregates can, in general, be different from those for the overall aggregate.

102

Chapter 5. Sharing for Aggregation with Varying Windows

appropriate windowed aggregate query. Since data streams are unbounded, an aggregate query over a data stream must have a window specification. In CQL [Arasu et al., 2006], a window is specified with a RANGE, and an optional SLIDE clause. For example, Query 5.1 computes the total number of packets from each unique source IP in “12 minute” windows (range), reporting these results at “5 minute” intervals (slide). Query 5.1 Count packets from each unique source IP SELECT SrcIP, count(*) FROM Snort S [RANGE ‘12 min’ SLIDE ‘5 min’] GROUP BY S.SrcIP In general, the actual range and slide will depend on the requirements of the user. For instance, Figure 5.3 shows the user interface for MYSQLIDS [Lewis], a postprocessing tool that uses static Snort data in a relational database with an interface that lets a user pick a range interval for analysis. The top half of the figure shows a pane listing a set of sensors (Snort sources) from which the user selects a subset. The bottom half of the figure is more interesting for us. It shows various options that the user can set to restrict the data that is queried. In particular, the left side of the bottom pane has a set of radiobuttons indicating an interval of interest where the user must pick on of a set of choices such as “Last 15 minutes”, “Last hour”, “Last 2 hours” etc. Now if we have Snort data streams feeding into a HiFi system, this tool can be easily reused to run continuously on the streams in the fashion of a dashboard. In this situation, the tool would generate a continuous query based on the user’s requirements, submit the query to the system, and continuously update its visualization as it keeps fetching results from the system. Here, the intervals of interest that the user specifies with the radiobuttons will correspond precisely to RANGE clauses in the queries the tool generates. With a dashboard interface, however, the user will generally also specify a reporting frequency (e.g., “30 seconds”) to indicate how often 103

Chapter 5. Sharing for Aggregation with Varying Windows

Figure 5.3: UI of an intrusion detection tool. the visualization must be updated. This requirement will correspond precisely to SLIDE clauses in the query. In a large system it is likely that many concurrent users of such a tool could issue queries that compute over and report at different intervals. There are two main types of windows, aperiodic and periodic, which have been used with streaming aggregate queries. While aperiodic windows have been studied extensively in single-site systems, periodic windows can reduce communication costs in a distributed system.

104

Chapter 5. Sharing for Aggregation with Varying Windows

Aperiodic windows. A window is aperiodic if it has no SLIDE clause. For such a query the system must return an aggregate computed over the specified range whenever a client application demands it. Here successive client requests can result in computing aggregates over overlapping window intervals of the input. Periodic windows. A window is periodic if it includes a SLIDE clause. A periodic window can be classified based on its range r and slide s as follows: 1. Hopping: when r < s, each window is disjoint. 2. Tumbling: when r = s, the windows are disjoint and cover the entire input. 3. Overlapping: when r > s, each window overlaps with some others. We focus on queries that compute aggregates over periodic overlapping windows of streams, such as Query 5.1. With non-overlapping (hopping or tumbling) windows, such aggregates are easily computed and only require constant space, as a tuple can be discarded after being accumulated in the aggregate. In contrast, with overlapping windows a tuple is included in multiple windows and cannot be discarded in this way. While we focus on the harder problem of sharing the processing of such overlapping window aggregates,3 the techniques developed in this chapter apply equally well to non-overlapping windows. Sharing windowed aggregates Shared processing of aggregates with varying aperiodic windows was explored in [Arasu and Widom, 2004] by building on the work in [Chandrasekaran and Franklin, 2002]. While this work lets users specify different range intervals, it cannot exploit the cases where users are ready for results to be pushed to them periodically, and thus incurs heavy space and time overheads. Further, a shared aggregate that produces results on demand is not suitable for a streaming view whose results are used in other 3

While windows can be time-based or count-based we only study the former here.

105

Chapter 5. Sharing for Aggregation with Varying Windows

queries, since its on-demand nature forces it to be only used as the final operator in a string of queries. In fact, a composition of multiple queries is likely to get more specialized downstream (especially with selections) where there are fewer sharing opportunities. There is no known earlier work on the problem of sharing streaming aggregates with varying periodic windows. This problem is considered in the context of single-site and hierarchical systems in the rest of this chapter.

5.3

Slicing a stream with Paired Windows

In this section, we show how to efficiently process a single aggregate query with a periodic overlapping window such as Query 5.2 below. Here I demonstrate a new approach called paired windows that chops an input stream into non-overlapping slices of tuples, that can be combined to form partial aggregates, which can in turn be aggregated to answer each query. The paired windows approach is a significant improvement over earlier work. Query 5.2 Count packets from all source IP addresses SELECT count(*) FROM Snort S [RANGE ‘r’ SLIDE ‘s’]

The idea of chopping a stream into multiple slices was first introduced as the paned window approach in [Li et al., 2005a] where all the slices are of equal sizes and are called “panes”. We improve on this with paired windows which chops a stream into pairs of possibly unequal slices. The paired windows technique is superior to paned windows as it can never lead to more slices. While both paned and paired windows can be used in the shared slices approach for processing multiple aggregates, we prove in Section 5.4 that paired window always produces better, or at least as good, sharing

106

Chapter 5. Sharing for Aggregation with Varying Windows

than paned windows. Paned and paired windows are both special cases of what we call non-overlapping sliced windows. We now define overlapping, and non-overlapping sliced windows. Definition 5.1 (Overlapping) An overlapping window W with range r and slide s (r > s) is denoted by W [r, s] and is defined at time t, as the tuples in the interval:   [t − r, t] if t mod s = 0,  φ

otherwise.

Definition 5.2 (Sliced) A sliced window W that has m slices is denoted by W (s1 , . . . , sm ). We say that W has |W | slices, a period s = s1 + · · · + sm , and that each slice si has an edge ei = s1 + · · · + si . At time t, W is the tuples in the interval:   [t − si , t] if t mod s = ei , 1 ≤ i ≤ m  φ

otherwise.

Intuition. An aggregate over an overlapping window W [r, s] can always be computed by a two-step process that aggregates partial aggregates over a sliced window V (s1 , . . . , sk , . . . , sm ) with period s, if and only if, sk + · · · + sm mod s = r. These sliced windows can be paned or paired, defined as: 1. Paned: X(g, .., g); g is greatest common divisor of r and s. 2. Paired: Y (s1 , s2 ); s2 = r mod s and s1 = s − s2 . This intuition is based on the following Lemma 5.1 and proved in Corollaries 5.1 and 5.2 below.

107

Chapter 5. Sharing for Aggregation with Varying Windows

Lemma 5.1 An aggregate over a window W [r, s] can be computed from partial aggregates of a window V (s1 , . . . , sk , . . . , sn ) with period s if and only if: sk + · · · + sn = r mod s Proof: Aggregates over W [r, s] are computed over intervals of |r| at times 0 mod s. We first note that: r = r mod s + br/scs

(5.1)

(If ): Suppose sk + · · · + sn is r mod s, from (5.1): r = (sk + · · · + sn ) + br/sc(s1 + · · · + sn )

(5.2)

Thus an aggregate over an interval |r| can be computed at times 0 mod s with partial aggregates over these br/scn + n − k + 1 contiguous slices of V : {sk , . . . , sn }, {s1 , . . . , sn }, . . . , {s1 , . . . , sn } | {z } br/sc times (Only if ): Suppose W can be computed from partial aggregates over V . To aggregate all input tuples we must use a set of contiguous slices from V . Since V has period s, these slices must include an integral number (b) of complete slices for each period and a residual set of slices (sa , . . . , sn where a ≤ n) from the previous period: {sa , . . . , sn }, {s1 , . . . , sn }, . . . , {s1 , . . . , sn } | {z } b times

108

Chapter 5. Sharing for Aggregation with Varying Windows

The sum of these slice widths must be:

r = (sa + · · · + sn ) + b(s1 + · · · + sn ) Since sa +· · ·+sn ≤ s and b = br/sc, it must be true that sa +· · ·+sn = r−br/scs. By assigning a ≤ n to k there exists some k where 1 ≤ k ≤ n, and sk +· · ·+sn = r mod s. Corollary 5.1 (Paned Windows) An aggregate over W [r, s] is computable from partial aggregates over V (g, . . . , g) where g is gcd(r, s), s = gn, r = gp, and n < p. Proof: From number theory we know that g = gcd(r mod s, s). Thus, r mod s = mg where m < n. So ∃k, k = n − m + 1 < n, 0 < k < n and sk + · · · + sn = r mod s. (From Lemma 5.1).

Corollary 5.2 (Paired windows) An aggregate over W [r, s] can be computed from partial aggregates over V (s1 , s2 ) where s1 + s2 = s and s2 = r mod s. Proof: Set k = n = 2 in Lemma 5.1.

While paned windows break a window of period s into s/g slices of equal size g, paired windows split a window into a “pair” of exactly two unequal slices. Let A be the aggregate function we are computing over W [r, s]. The partial aggregation step uses a function G, and the final aggregation step uses a function H.4 In the partial aggregation step, we apply the function G over all the tuples in each slice of V . In the final aggregation step, an overlapping window aggregate operator buffers these partial aggregates (that we call “sliced aggregates”) and successively applies the function H on these sets of sliced aggregates to compute the query results. For example, 4

Since A is distributive or algebraic, G and H always exist [Gray et al., 1996]. For example, if A is count, then G is count and H is sum.

109

Chapter 5. Sharing for Aggregation with Varying Windows

Figure 5.4 shows how an aggregate over a window W [18, 15] can use X(3, 3, 3, 3, 3), a paned window of 5 slices, or a paired window Y (12, 3).

Figure 5.4: Paned vs Paired Windows We now analyze the relative costs of executing a single sliced window aggregate using the paired and paned window approaches. The analysis focuses on the costs of partial and final aggregation. Let T denote the set of tuples in each period s of the non-overlapping window W [r, s]. We measure costs in terms of the number of aggregate operations needed to process the tuples in T and summarize it for paned windows, and worst-case paired windows in Table 5.1. Technique Paned Paired (worst-case)

Partial |T | |T |

Final (1/g)r (2/s)r

Table 5.1: Complexity of Paned and Paired Windows In both cases the partial aggregate step requires |T | operations. The cost of final aggregation depends on the number of partial aggregates, and so the number of slices, in a window period (its slide). If there are m such slices in a period s, the number of final aggregations in each period, i.e., the number of partial aggregates that are buffered, is m(r/s). In a period s, paned windows always have s/g slices while paired windows have either 2 slices (worst-case when r mod s 6= 0) or 1 slice. This results in a final aggregate cost of (s/g)(r/s) for paned windows, and 2r/s for paired windows 110

Chapter 5. Sharing for Aggregation with Varying Windows

in the worst case (and r/s in the best case). Since the paired window option never has more slices than paned windows, it is always faster than, or at least as fast as, paned windows. This is proved formally in Lemma 5.2 below. Lemma 5.2 (Paired vs Paned) If an aggregate over W [r, s] is computable from partial aggregates over a paned window Wp (p, . . . , p) with |WP | slices as well as over a paired window WS (s1 , s2 ) with |Ws | slices, then |WS | ≤ |WP |. Proof: If g is gcd (r, s), then |Wp | = s/g. If g 6= s, then |Ws | is 2, s/g ≥ 2 and |Ws | ≤ |Wp |. Otherwise, if g = s, then s2 = r mod s = 0 and |Ws | = 1 = |Wp |.

For the rest of this chapter we focus on paired sliced windows. We will also compare the performance of paired and paned windows in Section 5.6.

5.4

Sharing Aggregates with Varying Time Windows

The last section described “paired windows”, an efficient scheme to evaluate a single sliding window aggregate query. Now that we know how to evaluate one such query efficiently, in this section we address the next phase of the problem, i.e. sharing computational resources in processing aggregates with varying windows. Towards this end, we develop a new technique that we call Shared Time Slices (STS)5 , which can offer over an order of magnitude improvement over the state of the art unshared techniques. The queries that we consider here have identical predicates, and differ only in their periodic time windows, like Query 5.2 (Section 5.3). Later in this dissertation (Chapter 6) we relax this constraint of queries having identical predicates. We first show how to combine paired windows of multiple queries to produce partial aggregates 5

STS will be revisited in Section 6.3 when addressing sharing aggregates with varying windows and predicates.

111

Chapter 5. Sharing for Aggregation with Varying Windows

over slices of the input stream that can still answer each query. We then show why it is not feasible to combine paired windows statically, and instead present a Slice Manager that accomplishes this on-the-fly.

5.4.1

Combining multiple sliced windows

Here we show how to combine the paired, or paned, sliced windows of a set of queries in order to efficiently answer each individual query. We will prove that the paired window approach is the optimal way of using non-overlapping sliced aggregates to process multiple aggregate queries with differing periodic windows. We start with Q, a set of n queries that compute the same aggregate function over a common input stream, where each query has different range and slide parameters. More precisely, each query Qi in Q has range ri and slide si . These queries can be like Query 5.2, with different values for r and s, even where r and s are relatively prime. For simplicity we further assume that for all i, ri mod si 6= 0.

Figure 5.5: Possible plans for multiple queries The queries in Q can be processed in either an unshared or shared fashion. We 112

Chapter 5. Sharing for Aggregation with Varying Windows

consider each alternative in turn: 1. Unshared sliced aggregation: We process each query separately using paired windows, as shown in the query plan in Figure 5.5(a). The stream scan input is replicated to each of the n operator chains with a single sliced window aggregate that produces partial aggregates, which are fed to an overlapping window aggregate operator. 2. Shared sliced aggregation: We compose the paired windows of each query in Q into a single common sliced window (details below). Figure 5.5(b) shows the input stream processed by a shared sliced window aggregate producing a stream of partial aggregates, that is replicated to the n overlapping window aggregates. 5.4.1.1

Explicit common sliced window composition

We now show how to compose multiple sliced windows of the queries in Q to form a common sliced window. Partial aggregates computed with the common sliced window can then be used to answer each individual aggregate query. The main requirement is that the partial aggregates over the common sliced window common must be computed at every unique slice edge of each individual sliced window. Sliced windows can be composed only if they have the same period. Thus the period of a composite sliced window is the least common multiple (lcm) of the periods (or slides) of individual windows. With unequal periods, windows are stretched to the common period (lcm) by repeating their slice vectors. For example, Figure 5.6 shows how to compose two sliced windows U (12, 3) and V (6, 3). Here U and V have differing periods (15 and 9), and we stretch them respectively by factors of 3 and 5 to produce U 3 (12, 3, 12, 3, 12, 3) and V 5 (6, 3, 6, 3, 6, 3, 6, 3, 6, 3). We then compose U 3 and U 5 to produce a new composite sliced window W (6, 3, 3, 3, 3, 6, 3, 3, 3, 3, 6, 3). Note that ovals show shared edges in U 3 and V 5 . 113

Chapter 5. Sharing for Aggregation with Varying Windows

Figure 5.6: Composing sliced windows We can now analyze the relative complexities of the unshared and shared approaches in processing the queries in Q. Let S represent the composite period (lcm) of {s1 , . . . , sn }, the slides of the queries in Q, and let T be the set of input tuples processed in the composite period S. Let E represent the number of slices, and the number of partial aggregates, formed in the common sliced window with period S. The partial aggregation step costs n|T | for unshared sliced, and |T | for shared sliced aggregation. In the unshared sliced approach, the cost of the final aggregation step for the query Qi at each period si is, (2/si )ri from Table 5.1. Over a composite period S, there are S/si such steps, resulting in the total per-query cost of (S/si )(2r/si ). In the shared sliced case, the cost of the final aggregation step for the query Qi at each period si is (E/S)si (ri /si ), leading to the total per-query cost of (S/si )(E/S)ri over the composite period S. These costs are summarized in Table 5.2 below. Technique Unshared sliced Shared sliced

Partial n|T | |T |

Final P

(S/si )(2ri /si ) iP i

E(ri /si )

Table 5.2: Unshared versus Shared Aggregation

114

Chapter 5. Sharing for Aggregation with Varying Windows

5.4.1.2

To share, or not to share

While the total final aggregation cost without sharing is always less than that with sharing, the total partial aggregation cost in the unshared case is always more than in the shared case. Let λ represent the input data rate and γ represent the rate at which partial aggregates are produced by the shared sliced window. Over a composite period S, λ is |T |/S and γ is E/S. We can solve for λ and say that the shared approach is always better, as long as the input rate λ is high enough, as required by the following inequality (5.3). λ > (γΣi ri /si − 2Σi ri /(si 2 ))/(n − 1)

(5.3)

The critical factor is the “extent” of sharing in the common sliced window, and is reflected by the common partial aggregate data rate γ. We can formally define this as the number of partial aggregate tuples (number of unique slice edges) in each period of the common sub-query. The value of γ depends on the query workload. It is lowest when one sliced window subsumes all others and highest when only the final edge of all windows are shared: max (2/si ) ≤ γ ≤ Σni=1 (2/si ) − n + 1

1≤i≤n

(5.4)

In theory, with a very low input rate it may be better not to share. In practice, however, low input rates are unlikely. When they do occur, the shared and unshared approaches both cost so little, that the choice does not matter. As an example, let us consider the motivating example that uses network monitoring data. Let us consider two cases for the input rate λ: a realistic rate of 375 tuples a second, and a contrived rate of 0.0005 tuples/second. We analyze both cases with query workloads where the slide varies between 5 and 10 minutes and the range between 10 and 15 minutes in 115

Chapter 5. Sharing for Aggregation with Varying Windows

Figure 5.7 below. Clearly, for the realistic data rate, the shared approach heavily outperforms the unshared approach. For an artificial data rate that is very low (≈ 1.8 tuple/hour), the unshared approach is better for 83 or more queries. Input Rate: 1.8 tuples/hour

Total Aggregations (Partial+Final)

250000

200000

150000 Unshared

100000

50000

0

Shared

0

20

40

60

80

100

Queries Input Rate: 375 tuples/sec 1.4e+11

Total Aggregations (Partial+Final)

1.2e+11 1e+11 8e+10 6e+10

Unshared

4e+10 Shared

2e+10 0

0

20

40

60

80

100

Queries

Figure 5.7: Analysis: Shared vs Unshared

5.4.1.3

On the optimality of shared paired windows

While we already know that paired windows outperform paned windows for individual queries, we now show that shared sliced aggregation is faster using paired windows as compared to paned windows. In Theorem 5.1 we show that composing paired windows 116

Chapter 5. Sharing for Aggregation with Varying Windows

leads to a lower number of edges E, in a composite period S than composing paned, or any other sliced, window. We know from Table 5.2 that a smaller value for E, lowers the cost of shared sliced aggregation, and so using paired windows is always optimal. Theorem 5.1 Let W be {W1 (r1 , s), . . . , Wn (rn , s)}, a set of n windows. Let W be the common sliced window formed by composing the paired windows of each Wi in W. There exists no common sliced window W 0 formed by composing any other sliced window of each Wi in W, where |W 0 | < |W |. Proof: Without loss of generality, let each Wi have identical slide s (or else stretch as in Section 5.4.1). So every sliced window of each Wi has edges at 0 and s. The paired window for each Wi has only one other edge at s − ri mod s. From Lemma 5.1 every sliced window of each Wi must also have an edge at s − ri mod s. Thus, the edges of the paired windows for each Wi must exist in all possible sliced windows of Wi . Since the edges of a composite sliced window are the union of all edges of its constituents, any composition W 0 of arbitrary sliced windows must include every edge of W , the composition of paired-window rewritings and |W | ≤ |W 0 |.

For the rest of this chapter, unless mentioned otherwise, we always refer to paired windows when we talk about sliced aggregation.

5.4.2

On-the-fly sliced window composition

In Section 5.4.1 we saw how to explicitly compose sliced windows to form a common sliced window that can be used to efficiently process all queries in Q. While composing sliced windows is conceptually simple, the resulting common sliced window can have a very long period (lcm of individual periods) with a large number of slices. Even tens of queries each with periods under 100 seconds can produce a composite sliced window with a period of 106 seconds and hundreds of thousands of slices. Such a 117

Chapter 5. Sharing for Aggregation with Varying Windows

window with a large slice vector consumes a lot of space and is expensive to compute. Here we present an elegant alternative, that produces a stream of partial aggregates for the common sliced window “on-the-fly” without explicit composition.

Figure 5.8: Slice Manager: Partial Aggregates In this on-the-fly approach, we have a Slice Manager that keeps track of time and determines when to end the next slice, i.e., the time of the next slice edge. Figure 5.8 shows the Slice Manager demarcating the end of each slice with a heartbeat tuple. This heartbeat is a signal for the downstream sliced window aggregate operator, that it is time to emit partial sliced aggregates. This approach is very similar to the well-known sorting strategy for grouped aggregation [Graefe, 1993]. The pseudocode for the core routines of the Slice Manager are shown in Algorithm 5.1. For simplicity we assume that each individual window Wi (a, b) is a paired window, although the technique is easily extended to arbitrary sliced windows. The algorithm is initialized with a set of paired windows, W1 , . . . , Wn by calling addEdges to add the edges of the first slice of each paired window to a priority queue H with operations (enqueue,dequeue and peek). Each edge identifies its time, the window it belongs to, and a boolean that records if it is the last slice in the window. The queue is ordered by increasing edge times. The SliceManager repeatedly calls advanceWindowGetNextEdge which returns the time of the next slice edge. At each call, this function discards all edges that have the same time and that are at the top of the 118

Chapter 5. Sharing for Aggregation with Varying Windows

Algorithm 5.1 Slice Manager proc addEdges(H, ts , W (a, b)) enqueue(H, edge(ts + a, W, false)); enqueue(H, edge(ts + a + b, W, true)); end proc initializeWindowState({W1 , . . . , Wn }) initializePriorityQueue(H); for i := 1 to n addEdges(H, 0, Wi ); end end proc advanceWindowGetNextEdge(H) comment: Discard all edges at current time tc . comment: Add new edges of subsequent periods. var Edge ec ← peek(H) time; var Time tc ; while (tc == peek(H) time) ec ← dequeue(H); if (ec last = true) then addEdges(H, ec time, ec .window); fi end return tc ; end queue. If any edge belongs to the last slice of a window, it calls addEdges to add another set of edges for it. The Slice Manager passively outputs each input tuple, except when it receives a tuple with a timestamp greater than that of the next slice edge. At this point, it first inserts a heartbeat tuple in the output stream. When a query leaves the system, all edges corresponding to it are removed from the priority queue of the Slice Manager. Similarly, when a new query joins the system, appropriate edges are created for it by calling addEdges with the current time and the new paired window as an argument. To summarize this section, we showed how to efficiently process a set of streaming

119

Chapter 5. Sharing for Aggregation with Varying Windows

aggregate queries with identical selection predicates but varying periodic windows. A key feature of this approach is that it does not require any static analysis of the query set, and can easily accommodate the addition and removal of queries.

5.5

Shared Communication

Section 5.4 examined shared and unshared query plans that can be used to process multiple aggregate queries in a streaming system. The focus in that section was on sharing computation. This section explores ways to share communication resources while executing such queries across a hierarchy of stream processors. Consider executing Q, the set of n periodic aggregate queries from Section 5.4 in a two-level hierarchy (we consider a full hierarchy in Section 5.7). Both the shared and unshared plans from Section 5.4 have three levels of operators with the overlapping aggregate on top, the non-overlapping sliced aggregate in the middle, and the sensor scan at the bottom. The network interface can be between any of these three levels resulting in three choices for aggregate location: no push-down where all aggregation is at the central server and none at the edges, partial push-down where the overlapping aggregates are at the central server and the sliced aggregates are at the edges, and full push-down, where no aggregates are at the central servers and all are at the edges. This leads to six possible plans shown in Figure 5.9 where a box surrounding a query plan indicates a single node in which the included plan is processed. A diamond represents a network scan operator that reads tuples from the network interface. The communication cost of each of these six plans is shown in Table 5.3. Of these six plans, the two with no push down, (1) and (4) fetch raw data from lower to higher level nodes and are competitive only with low data rates, when any approach works well. Since we are interested in high data rates we do not consider these further. Since (2) involves the same computations that (3) does, and consumes 120

Chapter 5. Sharing for Aggregation with Varying Windows

Figure 5.9: Query plans for shared communication Table 5.3: Communication costs for different plans Num Aggregation Push-down Bandwidth 1 2 3 4 5 6

Unshared Unshared Unshared Shared Shared Shared

None Partial Full None Partial Full

nλ Σi 2/si Σi 1/si λ γ Σi 1/si

twice as much bandwidth as (3) does, we ignore (2) henceforth. With respect to communication cost, (3) and (6) are identical, and so we only consider the latter (6), as it also shares computation. Thus, the lowest communication costs are with either shared partial push-down (5) or shared full push-down (6). With partial push down we share communication and computation resources, while with full push-down we only share computation. Shared communication pays off when the common sub-query data rate is less than the total bandwidth of each fully pushed-down aggregate: γ < Σi (1/si )

121

(5.5)

Chapter 5. Sharing for Aggregation with Varying Windows

We know from (5.4) that maxi (2/si ) ≤ γ ≤ Σi 2/si . That is, the common sub-query is formed by composing paired windows, each of which consume double the bandwidth of their full push-down equivalent. If the extent of sharing is high enough, then γ can be less than the upper bound in (5.5) above. We examine the “extent” of sharing experimentally in the performance study in the next section.

5.6

Performance study

In this section we report the results of a detailed performance study that investigates the benefits of two techniques: shared communication of partial aggregates in a two-level hierarchy and shared computation of aggregates in a single node. For these experiments sliced aggregation was implemented in the TelegraphCQ system [Krishnamurthy et al., 2003] and deployed on a cluster of quad 500 MHz Pentium-III nodes.

5.6.1

Experimental setup

The experiments use the logs of a Snort [Roesch, 1999] sensor installed in the nodes of PlanetLab [Peterson et al., 2002] to track incoming TCP SYN packets.6 These logs were collected from a single node in a 24 hour period beginning at 4:00 am on May 1, 2005. There were 523761 tuples in this trace. The experiments process workloads of synthetic query sets in the TelegraphCQ implementation. Each workload is a query set characterized by the parameters in Table 5.4. The workloads consist of two kinds of queries - the grouped aggregate of Query 5.1 and the ungrouped aggregate of Query 5.2. All n queries in a given workload have identical slides (s) and varying ranges (r) generated uniformly from their respective 6

The inherent load on PlanetLab makes Snort drop some data.

122

Chapter 5. Sharing for Aggregation with Varying Windows

Param. q n r s

Table 5.4: Query workload properties Description Values Query Num. queries Window range Window slide

{Grouped, Ungrouped} Aggregate {2, 4, 8, 16, 32} (1500, 2500) seconds (1000, 2000) seconds

intervals. In addition, all queries are overlapping - i.e., no query where r is smaller than s is generated. For each query set size, we generate 50 individual query sets. For each workload we run the queries in each of the following ways. 1. Unshared aggregate (with paired windows) 2. Shared aggregate (with a common paned window sub-query) 3. Shared aggregate (with a common paired window sub-query) For each aggregate operator we measure the number of tuples it produces, as well as the total wall-clock time it consumes. From this we can compute: 1. Communication cost for full and partial push-down. This is in terms of the average number of tuples that have to be streamed from the lower level to upper level node in a hierarchy. 2. Computation cost for shared and unshared plans in a single node. This is in terms of the average total time consumed in aggregation.

5.6.2

Communication costs

We now present the results of a study of communication costs of the shared partial push-down plan (5) and the shared full push-down plan (6) from Section 5.5. Communication is shared only in the former. The results are plotted in Figures 5.10 and 123

Chapter 5. Sharing for Aggregation with Varying Windows

5.11. Here we also consider the full aggregate pull-up strategy where raw data gets streamed to the top level node. In the ungrouped aggregate case (shown in Figure 5.10) full push-down is slightly more efficient than paired partial push-down. We do not plot the cost of pull-up and paned partial push-down as they are very much more expensive (523761 and 86278 tuples respectively). For 2 queries, full push-down needs to stream only 118 tuples as opposed to 177 tuples of partial push-down. With more queries, however, the difference between the two plans drops. For instance, with 32 queries, full push-down streams 1898 tuples as opposed to 1920 tuples of partial push-down. In the grouped aggregate case (shown on the right in Figure 5.11), full push-down is cheaper than full pull-up up to 8 queries (512535 vs 523761 tuples) and cheaper than paned partial push-down up to 4 queries (257675 vs 455401 tuples). Paired partial push-down is however, far more efficient than all the other schemes, needing between 86472 tuples for 2 queries and 183910 tuples for 32 queries.

Figure 5.10: Ungrouped Aggregation Communication: Partial vs Full push-down

124

Chapter 5. Sharing for Aggregation with Varying Windows

Figure 5.11: Grouped Aggregation Communication: Partial vs Full push-down We now summarize the results of the study of communication costs for different query plans over a two-level hierarchy: 1. With ungrouped aggregation, partial push-down is competitive with (while slightly more expensive than) full push-down. As the number of queries increase, costs of full push-down approach those of partial push-down. In either case, however, the actual costs are very low - under 2000 tuples over a 24 hour period. Thus either approach would work well enough. 2. With grouped aggregation, partial push-down performs significantly better than the other approaches. In particular, full push-down scales very poorly with more queries.

5.6.3

Computation costs

Here we present the results of a study of the unshared aggregate and shared aggregate plans (from Figure 5.5), each using paired window sub-queries. We also considered 125

Chapter 5. Sharing for Aggregation with Varying Windows

shared aggregation using a paned window common sub-query. In addition, we measured the costs of the paired window common sub-query component of the shared aggregation plan. The results of the study are plotted in Figures 5.12 and 5.13. In the ungrouped aggregate case (shown in Figure 5.12), the average computation time of all three approaches increases with more queries, as expected. The shared aggregate with paired window sub-queries (“shared paired”), however, significantly outperforms the other two plans. For instance, at 2 queries the average time is 11 seconds for “shared paired” as opposed to 13 seconds for “shared paned” and 17 seconds for “unshared”. At 32 queries, these differences are magnified and the average time is 39 seconds for “shared paired”, 100 seconds for “shared paned” and 241 seconds for “unshared”. Note that the cost of the paired common sub-query remains stable from 8.9 seconds for 2 queries through 9.1 seconds for 32 queries. In the grouped aggregate case (shown on the right in Figure 5.13), the average computation time of all approaches increases with more queries as with ungrouped aggregation (albeit at a higher rate). Here “unshared” slightly outperforms “shared paned” for all query sizes, from 53 vs 94 seconds for 2 queries through 1013 vs 1120 seconds for 32 queries. The “shared paired” approach is significantly cheaper than the other two for all queries and costs from 33 seconds for 2 queries through 505 seconds for 32 queries. The cost of the paired common sub-query is stable from 18 seconds for 2 queries to 23 seconds for 32 queries. We now summarize the results of the study of computation costs for shared aggregates: 1. With grouped and ungrouped aggregation the “shared paired” approach significantly outperforms the “unshared” and “shared paned” approaches. 2. In the workloads studied, the paned common sub-query almost always reduces to a [SLICES ‘1 sec’] sub-query showing how the paned window rewriting 126

Chapter 5. Sharing for Aggregation with Varying Windows

Figure 5.12: Ungrouped Aggregation Computation: Shared vs Unshared can make sharing very difficult. 3. In both cases, the costs of the paired-sub-query component of the shared-paired aggregation remains stable with increasing numbers of queries.

5.7

Putting it all together

Now we apply the insights developed earlier in this chapter to a typical HiFi system that consists of distributed receptors on the “edge” of a network, intermediate nodes as well as a central enterprise server. The edges handle heavy loads and sharing is vital to reduce their overheads. The most important lessons learned from the experiments reported in Section 5.6 are: 1. Shared communication has real benefits with multiple aggregates using a partial push down strategy. These benefits grow with more queries.

127

Chapter 5. Sharing for Aggregation with Varying Windows

Figure 5.13: Grouped Aggregation Computation: Shared vs Unshared 2. Shared computation has benefits with multiple aggregates for high enough data rates. With such high rates (as shown in the performance study) the benefits of sharing outweigh the costs of sharing, which in turn increase with more queries. The second lesson in particular is striking as the overheads of sharing are normally fixed and not variable. Thus sharing can be a concern with low data rates in tiny devices where computation is expensive. The graphs in Figures 5.12 and 5.13, however, reveal that the common paired window sub-query has an almost fixed cost even with increasing queries. So it is only super-query computation (which is really unshared) that can worsen because of sharing, leading to the following strategy: 1. Partial non-overlapping aggregates across time are pushed down right into the leaves (receptors). 2. Intermediate nodes, from the edges to the root, aggregate share partial aggregates across space. 128

Chapter 5. Sharing for Aggregation with Varying Windows

3. The overall super-queries are only executed in the central server. An example of this strategy is shown in Figure 5.14 for a query monitoring overall consumption with a range and slide of 18 and 15 seconds. The leaf nodes process a partial push-down query with a sliced window (SLICES ’3 sec’,’2 sec’). The partial aggregates produced by the leaf nodes are sent to intermediate nodes that compute aggregates over space with NOW windows (from CQL[Arasu et al., 2006]). A NOW window collects multiple continuous tuples of a stream that have identical timestamps, as is the case with partial aggregate tuples produced by the leaf nodes. The tuples produced by the intermediate node are in turn sent to the root node which process the periodic window aggregate. Note that as new queries are added to the system, the only change is to send the corresponding sliced windows of each new query to the leaf nodes where they can be added on-the-fly.

Figure 5.14: Partial push-down in the hierarchy. This approach has the following very desirable properties: (1) communication is shared wherever it matters – from tiny sensors to wired networks, (2) intermediate nodes execute a single shared aggregate that would not be possible with full pushdown, (3) computation is unshared only at the highest level where capacity is higher, and (4) overall computation across time is only carried out at the level which issued the original query.

129

Chapter 5. Sharing for Aggregation with Varying Windows

5.8

Summary

In this chapter we focused on the problem of shared processing of aggregate queries with varying window specifications. We chose aggregate queries because they are extremely important in dealing with high volumes of data. Earlier (in Chapter 4), we considered shared processing of join queries, which are the other important query operation in the context of stream data management. In addition to studying this problem in the context of a single-site system, we also considered the situation where the data sources are widely dispersed monitoring networks that can be managed through hierarchical aggregation - a technique to successively collect and aggregate data from distributed sources receptors through a hierarchy of data stream processors. This is in the context of the HiFi system (described in Section B) that is aimed at monitoring applications over distributed data streams. Thus it is vital to share the communication as well as computation resources while executing concurrent windowed aggregate queries. More specifically, in this chapter we showed how to share the processing of many similar aggregate queries on periodic, overlapping windows across a heterogeneous hierarchy. It turns out that the obvious choices, aggregate push-down and pull-up, are both unsuitable. While push-down reduces sharing, pull-up increases communication costs. Instead, we develop a new approach of partial push-down of aggregates across a hierarchy. First, we developed the novel paired window rewrite (superior to prior work) to extract non-overlapping components of aggregate queries. Next, we show how to compose these non-overlapping components to share computation. Finally, we push the composed non-overlapping component down the hierarchy and pull-up the overlapping aggregates to the root. We also conducted a detailed performance study with real-world data in order to validate these new techniques. These performance experiments showed that a system that uses STS can support up to 8 times more

130

Chapter 5. Sharing for Aggregation with Varying Windows

queries than one that uses existing techniques, with a comparable latency in execution time. In addition, a hierarchical system that uses PPD is shown in experiments to require only half the bandwidth used by an unshared approach. Thus, the techniques developed in this chapter are crucial for widely distributed monitoring networks to scale and support large numbers of concurrent queries as demanded by emerging applications. Now that we have considered the effects of varying predicates for join queries (in the previous chapter), and varying windows for aggregate queries (in this chapter), we are ready to consider the problem of varying predicates and windows. We explore this in Chapter 6, next.

131

Chapter 6 Sharing for Aggregation with Varying Windows and Predicates In the previous two chapters I showed how to share the processing of join queries with varying predicates (Chapter 4), and aggregate queries with varying window specifications (Chapter 5) in the context of data stream management systems. In order to be able to combine the techniques in those last two chapters together, we need a way to share the processing of aggregate queries that can have varying window specifications and selection predicates. In this chapter, I develop techniques to solve this problem.1 In particular, this chapter considers shared processing in single-site systems for monitoring applications such as financial analysis and network intrusion detection.2 These systems often have to process many similar but different queries over common data. Since executing each query separately can lead to significant scalability and performance problems, it is vital to share resources by exploiting similarities in the 1

The work reported in this chapter was published in [Krishnamurthy et al., 2006]. Sharing aggregate queries with varying predicates in a distributed system involves a different set of challenges, and has been studied in [Huebsch et al., 2006]. 2

132

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

queries. In this chapter we present ways to efficiently share streaming aggregate queries with differing periodic windows and arbitrary selection predicates. A major contribution is that this sharing technique does not require any up-front multiple query optimization. This is a significant departure from existing techniques that rely on complex static analyses of fixed query workloads. This approach is particularly vital in streaming systems where queries can join and leave the system at any point. I also present a detailed performance study that evaluates the strategies developed in this chapter with an implementation and a real data set. In these experiments, the techniques developed in this chapter give as much as an order of magnitude performance improvement over the state of the art.

6.1

Introduction

We begin this section by presenting an example that motivates the problem that this chapter focuses on, i.e., shared processing of streaming aggregate queries with varying windows and predicates. Consider a system that monitors stock market trades and has multiple users interested in the total transaction value of trades in a sliding window. While some of these users might care only about stocks of a particular sector, or only about high volume trades, others might compute complex user-defined predicates (as suggested in CASPER [Denny and Franklin, 2005]) on fluctuating quantities like stock price. Similarly, the aggregation window that different users are interested in can vary widely. Money managers in financial institutions who run algorithmic trading systems might want aggregates over 5-10 minute windows reported every 60-90 seconds depending on the specific financial models they use. In contrast, day traders with individual investing strategies might only need these results every 5-10 minutes. Clearly such a system will have to support hundreds of queries, each with different predicates, and 133

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates sliding windows.3 Aggregate Queries Predicates Windows Same Different Different

Different Same Different

Techniques Shared Section Slices Fragments Shards

5.4 6.2 6.3

Table 6.1: A staged approach to shared aggregation In such a scenario, a stream processor needs to support hundreds (or more) of aggregate queries that differ in their predicates and windows. In this chapter, we attack this problem in stages, by considering in turn query sets with the characteristics shown in Table 6.1. The table lists the technique developed for each stage, and which section of the dissertation explains each technique. Note that the first stage, queries that differ only in their window specifications, was addressed earlier in Section 5.4 in Chapter 5. In the second stage, we show how to share aggregate queries that have varying selection predicates. Finally, we put these techniques together to solve the problem of sharing aggregate processing for queries with different windows and predicates. As discussed in Section 2.1, shared processing of aggregate queries has been studied for Multiple Query Optimization (MQO) in static databases [Harinarayan et al., 1996; Deshpande et al., 1998], as well as with streams [Arasu and Widom, 2004; Srivastava et al., 2005]. These earlier approaches statically analyze a fixed set of queries in order to find an optimal execution strategy. We argue that this “compiletime” approach is infeasible for two reasons: 1. Dynamic Environment. Queries can join and leave a streaming system at any time. A static approach would require expensive recompilation (Aurora [Carney 3

This scenario is similar to what was presented in Chapter 1, except that joins are excluded here.

134

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates et al., 2002]) at each such event, consuming precious system resources. 2. Complexity of Analysis. It turns out that for the specific problem that we consider, i.e., shared aggregation with varying windows and selections, static analysis is prohibitively expensive. This is because of the high cost of analysis for varying windows (Section 5.4.2), and because of the complexity and unknowns in isolating common sub-expressions in a set of arbitrary predicates (Section 6.2.2). In contrast to these previous approaches, the techniques developed in this chapter operate on-the-fly and focus on the data, with very little upfront query analysis. This lets a system support complex multi-query optimization in an extremely dynamic environment. Beyond streaming systems, the fragment sharing approach developed in this chapter can be used in traditional static databases, in order to share aggregate queries that have different predicates. As in the previous chapter, we restrict the focus to queries that compute aggregate functions that are classified as distributive (e.g., max,min,sum, and count) and algebraic (e.g., avg). We now summarize, the main contributions of this chapter: 1. Shared Data Fragments (SDF). A novel approach to share the processing aggregates with varying arbitrary selection predicates that breaks the input data into disjoint sets of tuples. (Section 6.2) 2. Shared Data Shards (SDS). An innovative way to combine STS and SDF to share aggregates with varying windows and predicates. We know of no other approach to sharing aggregates that supports more than one kind of variation in the queries. (Section 6.3) 3. On-the-fly MQO. A common feature of the STS, SDF, and SDS techniques is that they operate on data on-the-fly and require no static analysis of the queries

135

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

as a whole. This is vital in a dynamic streaming system, and very promising in static systems. 4. Performance study. We validate the approach presented above with a study of an implementation of the techniques (STS, SDF, and SDS) that evaluates their performance with a real data set. (Section 6.4).

6.2

Varying Selection Predicates

In this section, we consider sharing the processing of aggregates with varying predicates. For this, we develop Shared Data Fragments (SDF), a novel technique for this problem. Later, in Section 6.3, we will combine SDF with STS, the solution to the first half of the problem (i.e., sharing the processing of aggregates with varying windows), to achieve this chapter’s goal of shared processing of aggregate queries with varying windows and varying predicates. The SDF technique can offer up to an order of magnitude improvement over the state of the art unshared techniques. We begin with a precise problem formulation. Next, we present the intuition for shared fragments. Finally, we explain a novel on-the-fly scheme that obviates the need for static analysis of queries.

6.2.1

Problem Statement

We start with Q, a set of n streaming aggregate queries that each compute the same aggregate function with an identical sliding window over an input stream where each query applies an arbitrarily complex selection predicate. The predicate can include conjuncts, disjuncts, and even complex user-defined predicates. We say that the query Qi in Q has a complex predicate p(Qi ) (abbreviated to pi ) and an ungrouped aggregate A. For simplicity, we assume that each query Qi has a tumbling window W 136

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

(i.e., where the RANGE and SLIDE parameters are the same) and is similar to Query 6.1.

Query 6.1 Total value of low volume mid-cap xacts SELECT sum(T.price * T.volume) FROM Trades T [RANGE ‘5 min’ SLIDE ‘5 min’] WHERE (T.volume > 100) AND midcap(T.symbol)

Let W split the input stream into contiguous sets of tuples, and let T denote such a set. For ease of exposition, we focus on aggregation for a single set of tuples T for the rest of this section. Since we consider tumbling windows here, we merely have to apply the techniques we develop for each subsequent set of tuples. We represent the subset of T that satisfies pi by pi (T ). Thus, we need to compute for each query Qi , the aggregate A(pi (T )) that we denote by Ai (T ) over the set of tuples T .

Figure 6.1: Unshared Aggregation In a state-of-the art system, like TelegraphCQ [Krishnamurthy et al., 2003] or STREAM [Motwani et al., 2003], the tuples in T are processed by evaluating the predicates of all queries over each tuple in T . For many kinds of selections, systems like TelegraphCQ [Krishnamurthy et al., 2003] and NiagaraCQ [Chen et al., 2000] build an index of the predicates to efficiently process tuples. After these predicates are processed, however, the input set T is split into n subsets that are each aggregated 137

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

separately, a process we call unshared aggregation (Figure 6.1). In this section, the goal is to develop an alternate evaluation scheme that has fewer total aggregate operations, and hence a reduced associated cost.

6.2.2

The Intuition

The main intuition is to use the predicates {p1 , . . . , pn } to partition the tuples in a window of the input stream into disjoint subsets that we call fragments. The tuples in each fragment can then be aggregated to form partial fragment aggregates which can in turn be processed (via another aggregation) to produce the results for the various queries. In other words, a set of tuples T in a window of the input stream, is partitioned into {F0 , F1 , . . . , Fk }, a set of k + 1 disjoint fragments: T = F0 ∪ F1 ∪ F2 · · · Fk Each fragment Fi is associated with Q(Fi ) ∈ 2Q , a subset of the query set Q where every tuple in the fragment Fi satisfies the predicates of every query in Q(Fi ), and no other query. The convention followed here is that F0 is a special fragment whose associated query set Q(F0 ) is the empty set φ. The tuples in F0 satisfy none of the predicates {p1 , . . . , pn }, and thus do not need to participate in the aggregates of any query and can safely be ignored. Formally, each fragment Fi is created by applying the predicates p1 , . . . , pn on tuples in T : Fi = {t | (t ∈ p(q)∀q ∈ Q(Fi )) ∧ (t 6∈ p(q)∀q 6∈ Q(Fi ))} Once we partition tuples into fragments we can efficiently compute the individual aggregates in a two step process. A conceptual view of the entire process from partitioning tuples to computing final aggregates for each query is shown in Figure 6.2. 138

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Figure 6.2: Conceptual View of Shared Fragments The figure shows how a set of tuples T is first processed by an imaginary “fragment filter” (we explain how this would work in practice later in this section). The sole responsibility of the fragment filter is to partition the set of tuples in T into {F1 , . . . , Fk }, a set of k disjoint fragments. Next, in a partial aggregation step, we aggregate the raw data in each fragment Fi to produce a fragment aggregate with the value G(Fi ) and denoted by Gi . Finally, in the final aggregation step, we aggregate these fragment aggregates to produce each query’s results. As in Section 5.3, we denote the aggregation functions used in the partial and final steps by G and H respectively. Thus Ai (T ), the result of the aggregate query Qi is given by: Ai (T ) = H{G(Fj ) | ∀Fj , Qi ∈ Q(Fj )}

(6.1)

We now illustrate this concept of fragment aggregation with Example 6.1 below. Example 6.1 Consider a set of 3 queries Q1 ,Q2 , and Q3 with predicates p1 , p2 and p3 . These result in a set of 8 possible fragments {F0 , . . . , F7 } with signatures as shown in Figure 6.3.

139

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Figure 6.3: All possible fragments For the queries in Example 6.1, each aggregate can be computed from the fragments {F0 , . . . , F7 } as follows: A1 (T ) = A(p1 (T )) = H{G(F4 ), G(F5 ), G(F6 ), G(F7 )} A2 (T ) = A(p2 (T )) = H{G(F2 ), G(F3 ), G(F6 ), G(F7 )} A3 (T ) = A(p3 (T )) = H{G(F1 ), G(F3 ), G(F5 ), G(F7 )}

As we explain later in Section 6.2.3, the approach developed here uses a dynamic implementation of this conceptual notion of shared fragments. One can, however, conceive of a more traditional static implementation in the style of a traditional MQO system. We now explain why such a hypothetical static approach is not suitable for dynamic streaming environments. A static approach would use a priori analysis of the fixed set of n queries in Q, to determine which of the 2n possible fragments can actually occur. This analysis would, 140

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

however, involve computing the subsumption relationships between the predicates of various queries. Unfortunately, this process is known to be computationally expensive [Jarke, 1985]. Further, the system would need to manage this set of possible fragments, and for each tuple, efficiently compute which fragment it actually belongs to. Since there can be an exponential number of fragments, managing them can be expensive. Even if it were possible to statically analyze the queries, the high cost of fragment management is difficult to ameliorate. This is because a static technique may not reveal a tight upper bound on the number of fragments, and will hence have high overheads. Static techniques can make pessimistic estimates of the number of fragments for the following reasons: 1. The predicates that we target can be arbitrarily complex and include opaque user-defined functions that cannot, in general, be easily analyzed. 2. A static analysis may not have access to information such as functional dependencies, or correlations, between attributes and might overestimate the number of fragments. In streaming systems, these might even vary significantly over time. For instance, in the stock trading application, heavy price fluctuation may be more often accompanied by high volume trades when the market is falling, than when it is stable. 3. Most importantly, in a streaming context, the real number of fragments is actually bounded by |T |, the number of tuples in a window. This number is not fixed, especially when we consider in Section 6.3, aggregate queries with varying selections and windows. Beyond the fact that a static analysis of possible fragments is hard, and not complete in general, it is very unsuitable for the dynamic requirements of a data stream processor. A single query joining or leaving the system, can greatly affect the fragment 141

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Figure 6.4: Dynamic Shared Fragments: Partial Aggregation computations. Thus, while the conceptual model of shared fragments that we introduced here is useful, we believe that a traditional MQO-style static analysis is not feasible for a data streaming system, and inappropriate for a traditional system.

6.2.3

Dynamic Shared Fragment Aggregation

We now describe a dynamic implementation of Shared Data Fragments. The main insight is that we can use existing data streaming technology to efficiently, and dynamically, identify the fragment a tuple belongs to. Notice that since this approach is dynamic, it is entirely free of the pitfalls of static analysis listed above. Thus, this scheme is useful even in a traditional, non-streaming, MQO context. We next explain in detail, the partial and final aggregation steps of this approach, and then present an analysis of the relative costs of unshared and shared aggregation. 6.2.3.1

Partial Aggregation

Figure 6.4 shows a pipeline of stages representing partial aggregation in the dynamic shared fragment approach. In the first stage, a set of input tuples T is sent to an operator like a GSFilter (described in Section 3.2.2) that implements shared selections. This shared selections

142

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates operator produces an augmented stream of tuples T 0 , where each tuple carries along with it a signature that identifies the precise subset of queries that the associated tuple satisfies. While these selections may actually be applied in a shared fashion (e.g., the Grouped Selection Filter described in Section), sharing selections is not a requirement in this approach. Thus, given a set of tuples T , we can apply the predicates {p1 , . . . , pn } on each tuple ti in T to produce its signature bi , and generate the augmented set T 0 of pairs of the form (ti , bi ). The signature of a tuple is, in fact, the same as the completion vector portion of its lineage (as described in Section 3.1.2), and is typically implemented with a bitmap containing one bit for each of the n queries in Q. We use the same implementation in this approach. Since the signature of a tuple encodes the queries that it satisfies, it also identifies the unique fragment that the tuple belongs to, as well as the queries associated with that fragment. We represent the set of queries Q(Fi ) of a fragment Fi with a bit vector bi that has n bits. We call bi the fragment signature of Fi . Note that the j th bit of bi is set, if and only if, Qj ∈ Q(Fi ). We use |bi | to denote the cardinality of Q(Fi ), i.e., the number of queries satisfied by the tuples in the fragment Fi . In the second stage, these augmented tuples are then processed by a Fragment Manager that dynamically combines, and aggregates, all tuples with identical signatures, i.e., those that belong to the same fragment. Given each tuple-signature pair (ti , bi ), we look up the signature bi in a data structure such as a hash table, or a trie, and accumulate ti into an associated “in-place aggregate” using the partial aggregate function G. At the end of the partial aggregation step the fragment manager outputs a set of k fragment aggregate-signature pairs of the form (Gi , bi ), where Gi is G(Fi ), i.e., the result of applying the partial aggregation function G to the tuples in the fragment Fi . Notice that this strategy of partial aggregation is very similar to the well-known hash-based grouped aggregation [Graefe, 1993]. This observation means that a Frag143

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

ment Manager can be easily implemented by reusing a standard implementation of a hash-based grouped aggregate operator. 6.2.3.2

Final Aggregation

In the final aggregation step (not shown in Figure 6.4), we combine the set of k fragment aggregate-signature pairs, {(G1 , b1 ), . . . , (Gk , bk )}, as defined in equation (6.1). Algorithm 6.1 shows a straightforward technique for this step. We first initialize the final aggregate values A1 , . . . , An for each individual query. Then, we consider each fragment aggregate-signature pair in turn, and forward it to every query it is associated with, for aggregation. These queries are picked using the signature of each fragment aggregate. Algorithm 6.1 Final Aggregation proc FinalAggregation({(G1 , b1 ) . . . , (Gk , bk )}) for i := 1 to n, initialize(Ai ); end for i := 1 to k for j := 1 to n if (bi [j] = true) then Aj ← H(Aj , Gi ); end end

6.2.3.3

Analysis

We now analyze the cost of processing the set of n selective aggregate queries with the unshared and shared techniques. As in Section 5.4, this analysis focuses on the computational complexity of aggregation operations in the partial and final aggregation steps. We measure time complexity in terms of the number of aggregate operations carried out while processing the tuples in T . Table 6.2 summarizes the analysis parameters. Each tuple t ∈ T is augmented with a signature b to form the augmented 144

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates set T 0 . Let k be the number of unique signatures in T 0 , and let B be a set of k signature-frequency pairs {(b1 , f1 ), . . . , (bk , fk )} where each pair (bi , fi ) denotes that the signature bi occurs fi times in T 0 . We say that the expected cardinality of the signature of tuples in T 0 is α, and that the average cardinality of each signature in B is β. X α=

X

|b|f

(b,f )∈B

β=

|T |

|b|

(b,f )∈B

k

Symbol

Explanation

n T T0 k B α β

Number of queries Set of tuples in the window Augmented set of tuples in the window Number of unique signatures in T 0 Set of k signature-frequency pairs in T 0 Expected cardinality for each signature in T 0 Average cardinality for each signature in B Table 6.2: Parameters

Table 6.3 summarizes the costs of the unshared and shared techniques to process n selective aggregate queries. With unshared aggregation, there is only a single “final aggregation” step. Here, each tuple in T is subjected to as many aggregations as the number of queries it satisfies. Although this method only uses the input set T , its cost can be calculated by considering the augmented input set T 0 as follows: X (t,b)∈T 0

|b|

=

X

|b|f

=

|T |α

(b,f )∈B

With the shared approach, the partial aggregation step involves exactly |T | aggre145

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

gations and produces a set of k fragment aggregates each associated with a signature in B. As each fragment aggregate is aggregated as many times as the cardinality of its associated signature, the final cost is: X

|b| = kβ

(b,f )∈B

Technique

Partial

Final

Unshared Shared

0 |T |

α|T | kβ

Table 6.3: Unshared and Shared Aggregation Costs From this analysis, the shared approach is cheaper than the unshared approach when k << |T |, i.e., the number of fragment aggregates has to be significantly less the input data set. If this is not true, and k ≈ |T |,4 then the expected cardinality of each signature in T 0 approaches the average cardinality for each signature in B, i.e., α ≈ β. In this situation, the cost of shared aggregation approaches |T | + |T |α which exceeds |T |α, the cost of unshared aggregation. Consider for instance, an example based on a scenario with 256 queries from the performance study (the scenario is explained in detail in Section 6.4.3) that involves one hour of 10 minute intervals of stock market data. Here, |T |, the average size of the input set for 10 minute intervals over one hour is 189445. Let the number of fragments caused by the query workload be k. Also, let us assume for simplicity, that an identical number of tuples belong to each fragment, i.e., α is the same as β. From this workload we set α = β = 76. Figure 6.5 shows the simulated costs of shared and unshared aggregation for this scenario, where the number of fragments k, varies from 1 to 189445. In the actual 4

Recall that k cannot be greater than T .

146

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Simulated Total Aggregations Total Aggregations (Partial+Final) (in 1000s)

16000 14000 12000 10000 8000 6000 4000 2000

Unshared Shared Fragments

(k=12068) 0

0

20

40

60

80

100

120 140 Number of fragments (in 1000s)

160

180

Figure 6.5: Analysis: Shared vs Unshared workload, the average number of unique fragments was 12068. The arrow in the figure shows the point that represents the costs of shared aggregation for this case. While the number of maximum possible fragments is governed by the number and nature of the queries (especially the number of attributes they involve) in a workload, the actual number of fragments in a window is based on the nature of the data set. In this workload, which uses real data, the number of fragments (k=12068) is far smaller than the number of tuples in a window (|T |=189445). In this particular case, there are fewer fragments because the values found in the real stock market data, are not uniformly distributed in their entire domain. For example, in the first 10 minute interval of the stock market data discussed above, the volume attribute of each trade tuple has values between 10 and 1.6 million. There are, however, fewer than 2000 distinct values seen for this attribute in roughly 200000 tuples of the interval. Further, the most common value, 100, occurs half the tuples, and the 15 most common values together account for over 90% of the tuples. We expect that in practice, we will find this situation, i.e, the number of observed fragments being small relative to the number of tuples in a given window, true for many real applications. In the next section, we address situations where this hypothesis is not true, i.e., query and data workloads that lead to deeply fragmented input streams.

147

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

6.2.4

Partial Sharing: Deeply Fragmented Inputs

Here, we consider the effects of using SDF in scenarios where the query and data workload leads to large numbers of fragments. The analysis from Section 6.2.3.3 shows that in such a situation, one we refer to as having Deeply Fragmented Inputs (DFI), the SDF approach will perform very poorly when compared to a simple unshared approach. We then present “partial sharing”, the strategy used to find a solution to this problem. Finally we describe two specific partial sharing techniques and describe how they can be used in real-world situations. 6.2.4.1

The effects of deeply fragmented inputs on SDF

We now explain why workloads with too many fragments cause trouble for a system that is based on the SDF approach.

Figure 6.6: Deeply Fragmented Input Set Let us suppose that an SDF system uses a workload with n queries and an input data set T . If this is a DFI workload, it can be partitioned into as many as 2n unique fragments (of course, the number of fragments cannot exceed |T |). For example, Figure 6.6 shows how 5 queries can result in the input set having 28 fragments. 148

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

The actual extent of fragmentation depends on the specific nature of the predicates in the queries, as well as the distribution of the input data set. For instance, as we explained in the previous section, fewer fragments are possible when the values in the input data set are not uniformly distributed in the input domain. Similarly, when the queries involve predicates over fewer attributes of the input data, there are fewer fragments that are possible. To understand why this is so, consider an input data set T with exactly one numeric attribute with a domain D. Let us assume that each of the n queries has a single range predicate over the numeric attribute. Let us denote these predicates by p1 , . . . , pn , and the subsets of T that satisfy these predicates by p1 (T ), . . . , pn (T ). It is easy to see that each of these subsets can be represented by intervals in the domain D. Clearly, there can be at most O(n) disjoint fragments that can be formed by the intersection of n such intervals. In general, if the input data has j numeric attributes, then n queries with range predicates on these attributes can have up to O(nj ) possible fragments. 6.2.4.2

Partial sharing

With deeply fragmented input data, we must recognize that there are times when processing all queries in a single shared group is not a good idea. In these cases, an alternative approach is to divide the set of n queries into smaller groups of queries, and apply the SDF technique to each group separately. This is shown schematically in Figure 6.7 where {Q1 , . . . , Qn }, the set of n queries, is split into two groups, one with a set of queries {Q1 , . . . , Qm }, and the other with a set of queries {Qm+1 , . . . , Qn }. In what follows, we reuse the notation summarized in Table 6.2 and described in Section 6.2.3.3. The intuition here is that, instead of n queries producing a deeply fragmented data set with k ≈ |T | fragments, a smaller group of m queries may result in no more than 2m fragments where 2m << |T |. Notice, however, that this approach has a downside. 149

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Figure 6.7: Shared Fragmentation with 2 Groups While having fewer fragments results in a cheaper final aggregation phase, the raw input data set T has to be replicated to each query group, leading to more expensive partial aggregation. The key problem then, is to determine how to partition the n queries in Q into groups of queries. The insight is that by appropriately grouping queries, we can trade off higher partial aggregation costs for considerably lower final aggregation costs. Let the queries in Q be partitioned into R, a set of p disjoint subsets of Q where R is {R1 , . . . , Rp } such that Ri ⊂ Q for each Ri in R, and R1 ∪ · · · ∪ Rp = Q. For a set of input tuples T , let Ti ⊆ T be the set of tuples that satisfy at least one query in the group Ri (resulting in ki fragments). Further, let the average cardinality for each signature of these ki fragments be βi . The total cost of this alternative approach is the sum of the costs of partial and final aggregation. We use CR to denote this total cost and define it below in (6.2) based on the costs summarized in Table 6.3. CR =

X

(|Ti | + ki βi )

1≤i≤p

150

(6.2)

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Note that the SDF approach is a special case of this grouping scheme where p = 1, i.e., there is only one group. Conversely, the unshared aggregation approach is also a special case of this grouping scheme where p = n, i.e., there are n groups. 6.2.4.3

Grouping techniques for partial sharing

We now describe two grouping techniques, each involving a training phase where the behavior of the queries is observed over a window’s worth of data. Both techniques use the data in the training window to partition the queries in Q. Let T be the set of tuples in the training window, T 0 be the set of augmented tuples that are formed from the tuples in T for the set of queries in Q, and B be the set of k signature-frequency pairs in T 0 . We refer to B as the “training information” and this can be used in either technique. Note that both these training schemes can (and should) choose a single group corresponding to SDF if there are not enough fragments to justify partitioning. Random Equal Groups In this technique, we partition the n queries into Rp = {R1 , . . . , Rp }, a set of p groups, each of size n/p. Each query Q in Q is assigned to exactly one group at random. The main issue here is in choosing the right value for p ∈ {1, . . . , n}, the number of groups. We determine this, by considering in turn each possible value for p from 1 to n. For each case, we assign specific queries to each group in a random fashion. If we can compute CRp for each case, then it is simple to choose a value of p that results in a minimal value of CRp . We now explain how to efficiently compute CRp for a given query partition Rp from the pre-computed training information B. From (6.2), CRp can be determined provided we know the values of |Ti |, ki , and βi for each group of queries Ri in Rp . An 151

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

efficient method to compute CRp is presented in Algorithm 6.2 below. Algorithm 6.2 Computing CRp , the total cost of executing groups of queries in Rp proc compute-total-grouped-cost(B, G) comment: B is the set of signature-frequency pairs in training information comment: G is the set of group signatures for each group Ri in Rp var Integer kβ ← 0; var Integer |T | ← 0; var Bitmap bg; foreach (b, f ) ∈ B foreach g ∈ G bg ← mask(b, g); kβ ← kβ + |bg| if (|bg| = 6 0) then |T | ← |T | + 1; end end end return(|T | + kβ) end In this algorithm, the procedure compute-total-grouped-cost evaluates CRp using the training information B, and G, the set of group signatures corresponding to the queries in each group Ri in Rp . A group signature gi ∈ G is a bitmap of size n where the only bits that are set are those that represent the queries of the corresponding group Ri . For instance, if n is 8 and Ri is the set of queries {Q0 , Q3 , Q4 , Q7 }, then gi is the bitmap (10011001). The procedure begins by initializing the variables kβ and |T | to 0. It then performs a nested loop where the outer loop considers each signaturefrequency pair (b, f ) in B, and the inner loop considers each group signature g in G. In each iteration of the inner loop, the procedure evaluates a bitmap bg by masking the signature b with the group signature g. Then the variable kβ is incremented by |bg|, the cardinality of the bitmap bg. Similarly, the variable |T | is incremented by unity provided |bg| is not zero. Once the nested loop has been completed, the procedure evaluates CRp as the expression |T | + kβ. This algorithm performs |B| P iterations of the outer loop, and g∈G |g| (which is the same as n) iterations of the 152

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

inner loop. Thus this algorithm has a complexity of O(n|B|).

Step: 0, Clusters: 5

Step: 1, Clusters: 4

Step: 2, Clusters: 3

Step: 3, Clusters: 2

Step: 4, Clusters: 1

(Q0)

(Q1)

(Q0)

(Q1)

(Q2)

(Q3)

(Q2Q4)

Unshared

(Q3)

(Q0Q1) (Q2Q4) (Q3)

(Q0Q1Q2Q4)

(Q4)

Partial Sharing

(Q3)

(Q0Q1Q2Q3Q4)

Fully Shared

Figure 6.8: Example of agglomerative clustering process

Clustered Groups In this approach, we relax the restriction that all groups have the same size, and try to be smart at assigning queries to groups. We use an assignment technique, called agglomerative clustering [Johnson, 1967], which starts by creating n clusters, each with exactly one distinct query in Q. We then perform n − 1 iterations, where at the ith step we compose precisely two of the remaining n − i + 1 clusters until we are left with a single cluster with all queries. Figure 6.8 shows a tree representing the progress of an example of this clustering scheme which begins at Step 0 with 5 clusters of one query each, and ends at Step 4 with 1 cluster consisting of 5 queries. Note that the clustering schemes at Step 0 and Step 4 correspond to the unshared and SDF approaches respectively. The figure shows the composition of two clusters 153

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

with thick arrows. Note that while Step 0 and Step 4 correspond to the unshared and fully shared (SDF) approaches respectively, Steps 1, 2, and 3 are examples of partial sharing. The clustering procedure tries to choose the best approach from these options. At each step, we consider each possible new cluster (denoted by Rc ) in turn, and compute its cluster execution cost. For this computation, we use the procedure compute-total-grouped-cost in Algorithm 6.2 with inputs consisting of the training data B, and a singleton set G with the group signature of the cluster Rc . In each step we choose a cluster such that the total aggregation cost across all clusters is minimum. After all iterations, we compare the total aggregation cost at each step and choose the cluster assignment corresponding to the lowest total aggregation cost in these n − 1 steps. In each step of the iteration, the algorithm is conceptually performing an all-pairs computation in order to compute the cost of a new cluster and find the associated total execution cost. In practice, however, only the first iteration needs an all-pairs computation and has a cost of n2 |B| since it calls compute-total-grouped-cost n2 times. By recording the results of each of these n2 calls (e.g. in a two dimensional array), each subsequent iteration only needs to call compute-total-grouped-cost as many times as there are clusters to consider. This means that after the first step, the costs of the remaining iterations are (n − 1)|B|, (n − 2)|B|, . . . , 2|B|. Thus the total cost of the training phase is O(n2 |B|). Note that if the training window is very large we can use a random sample instead. In the stock market example, there is enough downtime overnight where the training phase can be used based on the previous days data to re-cluster, or regroup the query set. While this does involve a static phase to shared aggregation, we emphasize that the training phase is still quite operational. It does not examine and try to analyze the actual predicates of each query. The ability to efficiently share aggregates 154

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

with arbitrary selections by treating these selections as “black boxes” is an important contributions of this work. Although we could not find evidence of high fragmentation in the real workloads that we considered, such as the stock market data used in the performance study in Section 6.4, we must be prepared to handle such a situation in practice. In fact, for the stock market workload both the grouping techniques that we presented here always recommend a single group containing all queries, i.e., the SDF approach. In summary, we presented in this section the Shared Data Fragments (SDF) approach to efficiently process multiple streaming aggregate queries with varying selections and identical windows. We proposed an innovative dynamic implementation of Shared Data Fragments that leverages existing advances in stream query processing without any of the drawbacks of static analysis. We also considered the effects of workloads which can cause SDF to perform very poorly, and proposed modifications of SDF to handle such cases. We present a detailed experimental evaluation of these approaches in Section 6.4.

6.3

Putting it all together

In this section we are now ready to address this chapter’s main problem of shared processing of aggregate queries with varying predicates, and windows. The approach is to put together the techniques developed earlier in this dissertation (STS in Section 5.4 and SDF in Section 6.2) that solve simpler versions of this problem. We show here that these techniques work well together to form a new technique, Shared Data Shards (SDS), and solve the main problem. We now compare the SDS approach with Unshared Sliced (US), a scheme where each query has an operator chain with a sliced window aggregate followed by an overlapping window aggregate. For each input tuple, we can apply the predicates of 155

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

the queries, possibly in shared fashion. Then, for each query the tuple satisfies, we replicate the tuple to its operator chain.

Figure 6.9: Shards From a high-level, both STS and SDF partition an input data set into chunks of tuples that are partially aggregated, and then aggregate these partial aggregates to answer individual queries. Thus, when both windows and selections vary, we can conceive of an approach that partitions the input set to form what we call shards (these can be thought of as fragmented slices or sliced fragments) as shown in Figure 6.9, partially aggregating the tuples in each shard, and aggregating these partial aggregates to answer each query. We cannot, however, further partition a partial aggregate formed by sliced or fragmented aggregation. Instead, we must first apply one partitioning operation (slicing or fragmentation) to the input stream, and then apply the other operation to these partitions before computing any partial aggregates. Thus, the main question is:“which partitioning operation, slicing or fragmentation, should we perform first ?” To answer this question, we must look back to the implementations of STS and 156

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

SDF that we presented earlier. Recall that while partial aggregate computation is similar to the sorting based strategy for grouped aggregates in STS, it is similar to a hashing based strategy for grouped aggregates in SDF. The reason that the sorting based strategy is possible for STS, is that in a data stream, tuples are naturally sorted by time. In contrast, contiguous tuples do not necessarily have identical signatures (and hence do not belong to identical fragments), and so SDF needs to use a hashing based strategy. The consequence of this observation is that, while the partial aggregation step is performed “in-place” along with partitioning step in SDF, it can safely be separated from the partitioning step in STS.

Figure 6.10: Shared Data Shards Using this insight, we propose the Shared Data Shards (SDS) technique and uses elements from the STS and SDF techniques. This technique is shown as a pipeline of stages in Figure 6.10. Like in STS, we use a Slice Manager that is aware of the paired windows of each query in the system, to demarcate slice edges in an input stream using heartbeats. These slices of tuples are then passed on to a shared selections operator (e.g., GSFilter), to produce slices of augmented tuples, just as in STS. These augmented tuples are then sent to a SDF-style Fragment Manager that computes partial aggregates of shards. Next, these shard aggregates are processed using SDF-style Final Aggregation (Section 6.2.3.2) and sent to the appropriate perquery overlapping window aggregates. Finally, these operators produce result tuples for each query.

157

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

An analysis of the relative complexities of the unshared (US) and shared (SDS) schemes for processing aggregate queries with varying selections and windows is essentially the same as that for the unshared and shared (SDF) scheme in Section 6.2.3.3. The main difference is that, unlike STS and SDF, the SDS approach ends up splitting the input data into much smaller shards. Even so, in the performance study in Section 6.4, we will show that for the real data sets we used, such as the stock market trading data, the performance of SDS greatly exceeded that of US. Integrating SDS with TULIP plans for join queries We now consider how SDS can be modified to share queries that compute join operations in addition to aggregates with varying windows and individual predicates. As described above, the SDS approach involves passing tuples through a chain of operators beginning with a Slice Manager and a GSFilter. In particular, the output tuples produced by the GSFilter are augmented with lineage information consisting of a completion vector. This chain of operators can be reordered with the GSFilter being processed before the Slice Manager since the lineage vectors of tuples do not affect the Slice Manager in any way. Furthermore, if the queries being processed involve joins of multiple streams in addition to aggregates, then an arbitrary operator tree can be substituted in place of the GSFilter in the chain of operators of the SDS approach, as long as the substituted operator tree also produces tuples augmented with lineage vectors. An example of such a tree of operators is a TULIP plan (from Chapter 4) without Output operators. Thus SDS can be elegantly integrated with TULIP in order to share processing of queries that compute aggregate and join operations, with varying windows and individual predicates. In this section, we presented Shared Data Shards (SDS), a new approach to efficiently share the processing of multiple streaming aggregate queries that differ in selections and periodic windows. This solves an important problem in understanding

158

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

how to share aggregate queries that vary in more than one aspect. A key property of this scheme is that it achieves the objectives of shared processing, without requiring any prior analysis of the queries involved. This is important because of two reasons stated earlier: (1) the kinds of analysis required for sharing such queries is hard, and (2) in a data streaming context, queries are expected to join and leave at any time and system resources cannot be engaged in complex multi-query optimization while processing live data.

6.4

Performance Study

In this section, we present a detailed performance study of the various approaches that we developed in this chapter. This study uses real-world data as well as synthetic data. The real-world data is based on intra-day trading data from the NYSE and the NASDAQ that corresponds to the examples used in this chapter. The synthetic data set is used to force scenarios where the SDF technique for shared aggregation with varying predicates leads to a very large number of fragments. We focus on each of the three stages that are part of this chapter’s goal of shared aggregation with varying predicates and windows as summarized in Table 6.4. We first describe the experimental setup and then present and analyze the results. Workload

Predicates

Windows

Section

(A) (B) (C)

Same Different Different

Different Same Different

6.4.2 6.4.3 6.4.4

Table 6.4: Query Workloads

159

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

6.4.1

Experiment Setup

We built a prototype aggregate query processor for data streams in Java. With this prototype we can realize query plans for all the schemes described in this chapter. The data we use is summarized in Table 6.5 below. The real-world portion consists of a stream (Trades) with intra-day trading data from the NYSE [NYSE] and NASDAQ [NASDAQ] stock markets on December 1st 2004 during the regular trading hours of 09:30 AM to 4:30 PM, a static table (Close) with the previous day’s closing price for all stocks, a static table (Index) that reflects whether or not a given stock is in any of the Russell 3000, Russell 2000, or Russell 1000 indexes, and a synthetic stream (Four) that has four integer attributes each uniformly distributed in the interval [1, . . . , 10000]. Name

Schema

Type

Source

Trades Close Index Four

(Time, Symbol, Price, Vol) (Symbol, CP) (Symbol, R3000, R2000, R1000) (a,b,c,d)

Stream Table Table Stream

NYSE/NASDAQ NYSE/NASDAQ NYSE/NASDAQ Synthetic

Table 6.5: Data Schema Each real-world workload has a query set with {16,32,64,128,256} queries over one hour’s worth of data starting at 12:00 noon. The queries are all based on the template in Query 6.2. They involve a join on the Symbol attribute between the Trades stream, and the Close and Index tables. Each query computes the total transaction value of all trades, subject to possible restrictions, in a sliding window. The condition checked could be whether or not the trade represented an “outlier” in terms of its volume, or whether or not it revealed a significant price movement of the equity from the previous day’s closing price. These restrictions are summarized in Table 6.6. Queries in (A) have a range 160

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Query SELECT FROM WHERE

6.2 Query Template for Stock Market Data sum(T.price * T.vol) Trades T[RANGE r SLIDE s],Close C,Index X T.Symbol = C.Symbol AND T.Symbol = X.Symbol [AND Member AND Value]

and slide picked uniformly at random from [600,900] and [300,600] seconds respectively. Queries in (B) have a complex predicate with “Member” and “Value” conjuncts. The Member conjunct picks with uniform probability a market index from one of {R3000, R2000, R1000}, and with uniform probability checks whether or not the traded stock belongs to it. The Value conjunct picks with equal probability a quantity that is one of the volume, value, or % change for the day, and checks, with equal probability, if this quantity is greater than, or less than a constant. Note that queries in (A) have no predicates, and queries in (B) have the range and slide both set to 600 seconds. We explain the queries in (C) in Section 6.4.4. Each synthetic workload has a query set with {16,32,64,128} queries, which are all based on the template in Query 6.3. The queries are aggregates over tumbling windows of data, each with 10,000 rows, each with a range predicate. We run every experiment here in two ways. The range predicate is “Inequality-Two” in the first way, and “Inequality-Four” in the second way. The possible forms of both these predicates are defined in Table 6.6. In each case, the constant Z is picked uniformly at random from the interval [1, . . . , 10000]. Query SELECT FROM WHERE

6.3 Query Template for Synthetic Data count(*) Four T [RANGE 10000 ROWS SLIDE 10000 ROWS] [Inequality-Four | Inequality-Two]

In this chapter we focus on processing shared aggregates. We assume the periodic evaluation can be sped up by using well known techniques such as Rete [Forgy, 1982], 161

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Type Window

Name Range: Slide:

Values r s

Predicate

Member

Predicate

Value

Predicate

Inequality-Two

Predicate

Inequality-Four

[600,900] seconds [300,600] seconds X.R3000: true, false X.R2000: true, false X.R1000: true, false Vol: > V, < V Vol*Price: > W, < W abs(Price-CP)/CP: > F, < F a: > X, < X b: > X, < X a: > Z, < Z b: > Z, < Z c: > Z, < Z d: > Z, < Z

Table 6.6: Query Parameters CACQ [Madden et al., 2002b] and NiagaraCQ [Chen et al., 2000]. Thus, in the experiments reported in this section, we measure the actual time in processing the aggregates and any accompanying overheads, such as the use of hash-tables. In order to minimize any effects of I/O, we buffer the data in the hash tables in memory. Each value that we report for any workload with n queries is an average computed from 10 iterations each with a different set of n queries.

6.4.2

(A) Same Predicates, Different Windows

In this workload, we examine queries with identical selection predicates, and different periodic windows, over a real data set. We compare the execution time across 4 strategies (Unshared Paned, Unshared Paired, Shared Paired, Shared Paned) based on the unshared sliced and shared sliced approaches discussed in Section 5.4. 162

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Same Predicates, Different Windows 70

60

50

Unshared Paned Unshared Paired Shared Paned Shared Paired

Time

40

30

20

10

0 16

32

64

128

256

Queries

Figure 6.11: Sliced Aggregates The results for all 4 strategies are shown in Figure 6.11. Here, the shared approaches significantly outperform the unshared schemes by more than an order of magnitude. For instance, with 256 queries, unshared paired costs 50.48 seconds, while shared paired costs only 2.63 seconds. This is because, although unshared paired fewer final aggregations (6,813) compared to shared paired (1,462,391), unshared paired has a many more partial aggregations (291,490,816) compared to shared paired (1,138,636). Further, for all query sizes, the paired approach outperforms the paned approach. For this 256 queries case, shared paned costs 6.90 seconds - more than twice as much as shared paired. This is because, shared paned has far more final aggregations 163

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

(2,340,842), and has to buffer many more partial aggregate tuples. These results match the analysis from Section 5.4.1. First this analysis correctly predicted (Section 5.4.1.2) that with high data rates the shared approaches will heavily outperform the unshared approaches. Second, we proved in Section 5.4.1.3 that paired windows are are better than paned windows, and in fact are optimal, for shared processing. Again, this matches the experimental results.

6.4.3

(B) Different Predicates, Same Windows

In this workload, we examine queries with differing selection predicates, and identical tumbling windows. Since these are tumbling windows, the paired and paned options are identical and there is no final overlapping window aggregate step. We consider two types of workloads, one with the stock market data, and the other with synthetic data. 6.4.3.1

Workload with real-world data

Here we consider the workload that uses real-world stock market data, and compare the execution time for the Unshared and Shared Data Fragments strategies from Section 6.2. The results are shown in Figure 6.12 with a split of the partial and final aggregation costs for the Shared Data Fragments scheme. For all query sizes, the Shared Data Fragments (SDF) approach vastly outperforms the state of the art Unshared Sliced (US) approach. For example, with 256 queries, the US scheme uses 27.25 seconds, while the SDF scheme uses only 3.01 seconds, a savings of 89% and almost an order of magnitude. In this case, US has to compute 116,101,112 aggregations, while SDF has to compute only 7,771,956 total aggregations (1,094,947 partial and 6,677,009 final). Note that although SDF has only 6% the number of aggregations as US, it

164

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Different Predicates, Same Windows 30

25

Time (seconds)

20

Shared Fragment(final) Shared Fragment(partial)

15

Unshared 10

5

0 16

32

64

128

256

Queries

Figure 6.12: Unshared/Shared Data Fragments has a total time about 11% of US. This is because of the overheads of SDF, such as hash table and bitmask operations. To understand why the SDF scheme gives us such outstanding results, we focus on the number of tuples and average number of unique fragments. For instance, with 256 queries, the window from 12:20 to 12:30 has 218583 tuples, but only 11944 unique fragments. In fact, over a 1 hour time period from 12:00 to 1:00, the average number of tuples per 10 minute window is 192,945 and the average number of fragments per window is 11,407. More interestingly, over the entire 1 hour period (with six 10 minute windows), there are 1,181,901 tuples, but only 34327 unique fragments. Notice that while any static analysis scheme would have analyzed the query workload as splitting

165

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

the data set into at least 34,327 fragments, the dynamic approach gives us all the benefits of such a static approach while only having to manage on average 11,407, and at most 11,893 fragments. In summary, we found that SDF can provide an order of magnitude improvement over the Unshared approach. The SDF scheme performs very well because of the small number of unique fragments that occur in each window. 6.4.3.2

Workload with synthetic data

Here we consider workloads that are based on synthetic data. We study the two grouping strategies for partial sharing from Section 6.2.4.3. These strategies are Random Equal Groups (REG) and Clustered Groups (CG). We consider two separate workloads, one where the queries have predicates that involve only 2 attributes, and the other where the queries have predicates that involve 4 attributes. In each case we compare the Unshared (US) approach with the SDF, REG and CG techniques by running the queries with 20 successive windows of 10,000 tuples each. With both grouping techniques (REG and CG) we use the first window of 10,000 tuples as a training window and thus report the measurements for the remaining 19 windows. Queries with predicates involving 2 attributes The results for the query workload with predicates involving 2 attributes are shown in Figure 6.13. The figure shows a split of the partial and final aggregation costs for the US, SDF, REG and CEG techniques. For all query sizes the US approach is outperformed by the other three shared techniques. For instance, with 128 queries the US scheme uses 4.06 seconds, while the most expensive shared scheme (SDF) only uses 1.19 seconds, a savings of over 70%. The big difference from the workloads using real data is that while SDF is marginally cheaper than REG and CG for workloads with 32 and fewer queries, SDF starts getting progressively more expensive for work-

166

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

loads with 64 and greater queries. In particular, for workloads with 64 queries, SDF uses 0.28 seconds in comparison to 0.30 seconds and 0.24 seconds for REG and CG respectively. For workloads with 128 queries, SDF (with 1.19 seconds) is a clear loser while compared to REG (with 0.53 seconds) and CG (with 0.23 seconds).

Predicates chosen from 2 attributes, 19 windows of 10000 tuples 4.5 4 3.5

Time (seconds)

3

Clustered(final) Clustered(partial) Random Equal(final) Random Equal(partial) Shared Data Fragments(final) Shared Data Fragments(partial) Unshared

2.5 2 1.5 1 0.5

12

8

64

32

16

0

Queries

Figure 6.13: Synthetic Data: Queries with predicates involving 2 attributes To understand the poor performance of SDF in this workload, we focus on the relative costs of partial aggregation and final aggregation. We see that for the SDF technique in this workload, the final aggregation cost (1.02 seconds) dominates that of partial aggregation (0.17 seconds) by almost an order of magnitude. This is in sharp contrast to the experiments reported in Section 6.4.3.1 where even for workloads with 256 queries the final aggregation cost was no more than that of the partial aggregation cost. We now focus on the number of partial aggregations (corresponding to |T |) and 167

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

the number of final aggregations (corresponding to kβ) for a specific representative experiment in this workload. For this case, we found that the values of |T | and kβ were 190,000 and 4,569,209 for SDF, 570,000 and 777,648 for REG, and 380,000 and 117,390 for CEG respectively. The high value of kβ relative to |T | means that is a DFI situation for SDF, i.e., one where the workload has led to the input data set being very deeply partitioned. Notice that the values for |T | for REG and CG are respectively triple and double that for SDF. This shows that REG and CG have chosen 3 and 2 groups respectively and have thus taken on a higher number of partial aggregations. In return, they have substantially lower numbers of final aggregations (nearly 1/6th and 1/39th that of SDF for REG and CG respectively) that they have to perform. Thus in this case, although SDF is cheaper than US even for 128 queries, it suffers from the effects of deep fragmentation as the number of queries in a workload increases. Further, REG and CG offer significant performance improvements over SDF in such DFI situations. Notice also that decreasing the number of final aggregations is not just a matter of increasing the number of groups. For instance, CG has fewer groups (2) than REG (3) but performs ≈ 1/6th the number of aggregations. This is because while REG assigns a query to a group in a random fashion, CG performs a much smarter assignment of queries to groups and thus does a better job at trading the costs of partial aggregation for those of final aggregation. Queries with predicates involving 4 attributes We now consider experiments with workloads of queries having predicates involving 4 attributes. The results of these experiments (shown in Figure 6.14) reinforce the trend of SDF performing poorly in synthetic data workloads with greater numbers of concurrent queries. As in the previous case, the figure shows a split of the partial and final aggregation costs for the US, SDF, REG and CEG techniques. Here, the

168

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Predicates chosen from 4 attributes, 19 windows of 10000 tuples 6

5

Time (seconds)

4

Clustered(final) Clustered(partial) Random Equal(final) Random Equal(partial) Shared Data Fragments (final) Shared Data Fragments (partial) Unshared

3

2

1

12 8

64

32

16

0

Queries

Figure 6.14: Synthetic Data: Queries with predicates involving 4 attributes total costs of US are 58% more than that of SDF for 32 and fewer queries, the same as that of SDF for 64 queries, and approximately 18% less than that of SDF for 128 queries. In all the experiments presented in this chapter, this 128 queries case is the only instance of a shared approach performing worse than an unshared approach. With workloads having both 64 and 128 queries the REG and CG techniques, however, perform much more efficiently than US. For instance, with 128 queries US uses 4.12 seconds while REG and CG use 2.03 seconds (savings of 50%) and 0.63 seconds (savings of 85%) respectively. The reasons why REG and CG outperforms SDF for 32 and greater queries are the same as that in the previous case for queries with predicates involving 2 attributes: while these workloads have the DFI property for SDF, they are handled by REG and

169

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

CG by trading higher partial aggregation for final aggregation. It is also instructive to examine the case with 16 queries where SDF uses 0.13 seconds and is marginally better than REG and CG that both use 0.15 seconds. This slight performance advantage of SDF is despite the fact that REG and CG choose a single group of queries and end up performing the exact same number of partial and final aggregations as SDF. The marginal higher costs of REG and CG can be attributed to the extra bit masking operations that these techniques perform on a per-tuple basis. As with the previous case (predicates involving 2 attributes) CEG outperforms REG as the number of queries increases. This is because CG is able to assign queries to groups in a smarter fashion. For example, while a typical 128 query experiment results in REG choosing 6 groups and performs 1,460,120 final aggregations, CG chooses only 3 groups and needs only 524,409 final aggregations. In summary, here we found that synthetic data workloads can be used to create scenarios with deeply fragmented input sets where SDF performs increasingly poorly with greater numbers of concurrent queries. Some of these situations are so extreme that SDF is even outperformed by the na¨ıve unshared approach. In these situations, the partial sharing approaches of REG and CG shine. In particular, CG outperforms REG in every instance and provides significant advantages over the other approaches.

6.4.4

(C) Different Predicates, Different Windows

We now consider workloads that represent this chapter’s main problem, i.e., aggregate queries with varying predicates and windows. We consider in turn, a regular workload, and a low sharing workload, and for each case we compare the performance of the Shared Data Shards (SDS) with the Unshared Paired (from Section 6.3) approaches. In the first regular case, we consider query sets with sizes in {16,36,64,144,256}. √ Here, a query set with n queries has predicates chosen from n distinct predicates

170

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

(based on (B) above), and windows chosen from

√

n distinct windows (based on (A)

above). These sets are constructed so that all queries in a set of n queries are unique, √ but with exactly n unique predicates and windows. In the second low sharing case, we have sets of queries with sizes in {16,32,64,128,256}. Here, each query set is constructed by combining the properties of the sets used in (A) and (B) below. Thus, a given query set will have neither any duplicate windows, nor any duplicate predicates. 6.4.4.1

Regular Workload

The results for the regular workload are shown in Figure 6.16. We do not separate out the final aggregation cost in the UP scheme as it is negligible. For all query sizes, the SDS approach consistently outperforms the UP method. For instance, for 256 queries, the UP method costs 31.19 seconds, while the SDS approach costs only 3.67 seconds, a savings of 89%. Notice that here, SDS provides nearly an order of magnitude improvement over UP, for a workload where every query is different. This is because in this case, UP has to perform 100,909,915 (100,902,960 partial + 6,955 final) aggregations, while SDS only has to perform 3,395,132 (1,138,636 partial + 2,256,496 final) aggregations. Again, while SDS only performs about 3.3% of the aggregations of US, its cost is about 11% of US. The difference can be attributed to the overheads of SDS. Note that the difference here is about 7% and less than the 5% difference in the SDF case. This is because SDS has all the overheads of SDF, as well the additional overheads of the Slice Manager. 6.4.4.2

Low Sharing Workload

We now consider the low sharing workload where every query is unique, and every predicate and window is also unique across all queries. This is a low sharing workload,

171

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Regular: Different Predicates, Different Windows 35

30

25 Time (seconds)

Shared Data Shards (final) 20

Shared Data Shards (partial) 15

Unshared Paired(partial)

10

5

0 16

36

64

144

256

Queries

Figure 6.15: Regular: Unshared/Shared Data Shards where at first blush it might seem that there are no easy sharing opportunities. Even so, these results are very encouraging and are shown in Figure 6.16. The the partial and final aggregation costs for SDS are split and we omit the final aggregation costs of UP. For all query sizes, SDS significantly outperforms UP. For instance, for 256 queries, the total cost for UP is 34.56 seconds, while that for SDS is only 17.73 seconds. This is a savings of approximately 51%, or a factor of 2. Again, these savings can be attributed to the fewer aggregations that SDS performs (65,278,107 total = 1,294,064 partial, 63,984,043 final) as compared to UP (12,067,922 total = 120,672,210 partial, 6,812 final). Notice that this savings is less than the order of magnitude improvements we see

172

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

Low Sharing: Different Predicates, Different Windows 40

35

30

Time (seconds)

25

Shared Fragment(final) 20

Shared Fragment(partial) Unshared Paired(partial)

15

10

5

0 16

32

64

128

256

Queries

Figure 6.16: Low Sharing: Unshared/Shared Data Shards in the regular workload above. Essentially, when both predicates and windows vary, the input stream gets partitioned into small “shards” (Figure 6.9). Since there is more variation in this workload, there are more shards, each with fewer tuples. We emphasize that in such a low sharing workload where there are no repeated windows or predicates, the other schemes (STS and SDF) cannot be used here. In summary, these experiments show that when both predicates and windows vary, the opportunities for sharing can be small or large. In either case, SDS can exploit these opportunities and provide improvements of between a factor of 2 (for fewer opportunities) and a factor of 10 (for greater opportunities) over UP. Summary

173

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

The performance study examined different approaches to processing sets of aggregate queries with varying predicates and windows. In all the experiments with real-world data, the dynamic sharing approach developed in this chapter gives large benefits over the state of the art. In the experiments with synthetic data, however, the basic shared scheme performs poorly. In these situations the partial sharing techniques perform very well and provide large benefits over the unshared and shared approaches. The specific conclusions we can draw are: 1. Paired beats Paned. Paired windows are always superior to the paned windows, whether for a single query or for multiple queries where only windows vary. This result reinforces the proof on the optimality of paired windows in Section 5.4.1. 2. Shared Data Fragments beats Unshared (with no DFI). In the experiments with real-world data when only predicates vary, we found that there are far fewer unique fragments than tuples in any given window. That is, the input data set is not very deeply fragmented in these scenarios. As a result, SDF offers up to an order of magnitude improvement over the unshared technique. 3. Partial Sharing beats Shared Data Fragments and Unshared (with DFI). In the experiments with synthetic data when only predicates vary, we found the input data set is highly fragmented. As a result, SDF performs poorly and in some case even worse than the unshared approach. In these situations, however, the partial sharing approaches of REG and CG perform very well and offer up to an order of magnitude improvement over the unshared approach. 4. Shared Data Shards beats Unshared Sliced. In the experiments, we found that when predicates and windows both vary, the SDF approach can offer improvements of between a factor of 10 and a factor of 2 over UP. The latter case 174

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

is for extremely aggressive stress tests where no window or predicate occurs in more than one query.

6.5

Summary

In this chapter we explored the problem of sharing the execution of multiple concurrent aggregate queries that can have varying predicates as well as windows. This work builds on the contributions from earlier chapters where we considered the effects of varying predicates for join queries (Chapter 4) and varying windows for aggregate queries (Chapter 5). Recall that in the last chapter we developed “Shared Time Slices”, a technique for shared aggregation for queries with varying windows. Here, we first considered sharing aggregate queries that have identical windows but varying predicates. We proposed a dynamic sharing approach called “Shared Data Fragments” for this problem. For situations where this approach performs worse than the unshared techniques, we developed highly efficient schemes that are based on the idea of “partial sharing”, i.e., sharing the execution of groups of queries instead of sharing all queries in one group. Finally, we showed how these two techniques can be combined together in the “Shared Data Shards” approach, in order to solve the main problem of the chapter. These strategies mark a significant departure from the state of the art in shared processing of aggregate queries, where queries are normally optimized together statically to produce an efficient execution plan. The traditional approach is not suitable in a streaming system where queries join and leave at any time. Further, we showed why the static approach is computationally expensive for queries with varying selections or varying windows. Instead, we proposed innovative ways to share the processing of queries on-the-fly by examining the data as it streams in, with very little upfront query analysis. Not only is this a very efficient scheme, it lets us elegantly handle 175

Chapter 6. Sharing for Aggregation with Varying Windows and Predicates

environments with lots of churn, i.e., where queries come and go very often. Finally, we evaluated an implementation of the approach developed in this chapter with real-world and synthetic data sets. In the former case, the experiments based on stock market trading data showed that this approach can perform excellently in the real world. In the latter case, the experiments based on synthetic data showed that the partial sharing schemes that we developed can provide excellent performance even in cases where the fully shared schemes perform poorly. To summarize, in this chapter we considered the effects of varying predicates as well as windows for aggregate queries. We use building blocks from previous chapters in designing the techniques in this chapter. With this chapter we have completed the major contributions of the thesis work in this dissertation. In the next chapter we present concluding remarks and an outlook for future work.

176

Chapter 7 Conclusions Data stream management has emerged as an important area of research in response to the large volumes of live data feeds that are being produced by a new wave of pervasive data sources. These sources can be of many different types, and provide data either from other applications (e.g., stock market tickers, network monitors), or from the physical environment (e.g., wireless sensors, RFID readers). Systems that manage streaming data from such sources often provide declarative query-based interfaces. These interfaces have enabled new classes of applications that are being used to monitor, summarize, raise alerts on, and clean data streams in real time. As these applications become ubiquitous, they generate workloads for streaming systems that involve large numbers of concurrent queries. Separately executing each query of such a workload can lead to excessive resource usage in a data streaming system, and severely affect its utility. This thesis prescribes an alternative approach, shared processing, to enable the efficient execution of large numbers of concurrent queries in a data streaming system. Previous techniques for shared query processing typically handle only a static set of queries, and do so in a batch-oriented fashion. This static approach is not feasible

177

Chapter 7. Conclusions

in a real world scenario where queries join and leave the system at unpredictable times. In this thesis, I reject this static approach and instead argue in favor of onthe-fly sharing, a more dynamic approach where a new query can easily “hop on” to a shared query plan that processes the existing queries. The main problem considered in this thesis is that of sharing join and aggregate queries, with varying windows and predicates. The approach taken is to find solutions to smaller sub-problems that can be carved out of the higher-level main problem. The sub-problems are that of sharing the following different kinds of queries: (1) joins with varying predicates, (2) aggregates with varying windows, and (3) joins and aggregates, with varying predicates and windows. The solutions to these sub-problems can then be combined, along with other related techniques developed by others, to solve the main problem. A unifying theme across these techniques is they all exploit the idea of on-the-fly sharing where queries can be added to and removed from the system in an ad hoc manner. In particular, in this thesis I developed various technique to solve each of the three sub-problems described above. These techniques can be used to share both computation resources in a single-site system and communication resources in a distributed system. I now summarize these techniques below: 1. First, I explored a fundamental tension in shared query processing, that of sharing repeated work versus avoiding wasted work, in the context of join queries with varying predicates. I defined what it means for a shared query processing to reconcile this tension, and characterized such schemes as satisfying the conditions of precision sharing. I then developed TULIP and CAR, precision sharing schemes for static and adaptive dataflows respectively. A system that uses these precision sharing techniques was shown in experiments to support up to 16 times more queries than one that uses existing shared processing techniques, with no

178

Chapter 7. Conclusions

increase in latency. 2. Second, I studied the problem of sharing aggregate queries with varying windows. For this problem, I developed Shared Time Slices (STS) to share computation resources in a single-site system, as well as Partial Push-Down to share communication resources in a distributed system. A system that uses STS was shown in experiments to support up to 8 times more queries than one that uses existing techniques, again with no increase in latency. Furthermore, a distributed system that uses partial push-down is shown in experiments to require only half the bandwidth used by a non-shared approach. 3. Third, I studied the problem of sharing aggregate queries with both varying windows and varying predicates. For this problem, I first developed Shared Data Fragments (SDF) to solve the problem of sharing aggregate where only the predicates vary. I then developed Shared Data Shards (SDS) a technique that solves the problem of sharing aggregates with varying windows and predicates by putting the STS and SDF techniques together. A system that uses the SDS technique was shown in experiments to support between 2 to 8 times more queries than one that uses existing techniques, even in cases where the queries have neither any windows nor any predicates in common. Looking beyond the specific contributions of this thesis, there is much interesting work that can be done. I focus on two ares of data management, streaming and non-streaming, where shared processing can have the greatest potential impact. As described earlier in the thesis, data streaming systems have certain intrinsic qualities that provide compelling scenarios for shared query processing. Apart from these intrinsic qualities, many streaming systems are being built from scratch and thus offer fresh opportunities to try out new ideas in shared processing. This thesis has argued heavily in favor of on-the-fly sharing as opposed to static sharing in such 179

Chapter 7. Conclusions

streaming systems. There is, however, another aspect of existing shared processing schemes, which are part of most of the techniques developed in this thesis. This aspect is the idea that either all queries are shared together, or there is no sharing whatsoever. There are possibly many situations where a middle ground is feasible, where the set of queries can be partitioned in two smaller groups of queries that can be shared together. This thesis presented two such approaches to partial sharing, which were modifications of the SDF approach. I believe that this philosophy of partial sharing can be a very important and fruitful area of further research. In this thesis, I have argued that the traditional static approach to sharing has been a primary impediment to a more widespread use of shared query processing in real world systems. In fact, apart from the techniques specific to window clauses, all the other on-the-fly techniques developed in this thesis are applicable in non-streaming systems. The real common thread in these techniques is that of tuple lineage. Since tuple lineage was first proposed in the context of adaptive query processing, it has generally been considered too disruptive a change in the existing query processors of traditional data management systems. As shown in the TULIP technique, however, some parts of tuple lineage can be used for significant benefit, and in a very nondisruptive fashion, in static systems. I believe that techniques like TULIP can be a key enabler for the adoption of on-the-fly sharing in traditional systems. In summary, there are two major contributions in this thesis. The first is to demonstrate the significant performance and scalability benefits of shared query processing in data streaming systems. The second is to show that it is possible to get these benefits with an on-the-fly approach that can make shared query processing feasible in real world systems. These contributions are crucial in developing data streaming systems that can handle extremely dynamic and heavy workloads. It is very important to support such workloads in order to make data stream management viable in real world scenarios. 180

Bibliography [Agrawal et al., 2000] Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. Automated selection of materialized views and indexes in sql databases. In Proceedings of VLDB Conference, 2000. [Altinel and Franklin, 2000] Mehmet Altinel and Michael J. Franklin. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of VLDB Conference, pages 53–64, 2000. [Altinel et al., 2003] Mehmet Altinel, Christof Bornhvd, Sailesh Krishnamurthy, C. Mohan, Hamid Pirahesh, and Berthold Reinwald. Cache tables: Paving the way for an adaptive database cache. In Proceedings of VLDB Conference, pages 718–729, 2003. [Arasu and Widom, 2004] Arvind Arasu and Jennifer Widom. Resource sharing in continuous sliding-window aggregates. In Proceedings of VLDB Conference, pages 336–347, 2004. [Arasu et al., 2006] Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL continuous query language: Semantic foundations and query execution. VLDB Journal, 15(2):121–142, June 2006. [Arnold et al., 2000] Ken Arnold, James Gosling, and David Holmes. The Java Programming Language. Addison-Wesley, 3rd edition, 2000. [Avnur and Hellerstein, 2000] Ron Avnur and Joseph M. Hellerstein. Eddies: Continuously adaptive query processing. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 261–272, 2000. [Babcock et al., 2003] Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani. Chain: Operator Scheduling for Memory Minimization in Data Stream Systems. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2003.

181

BIBLIOGRAPHY [Bancilhon et al., 1987] Francois Bancilhon, Ted Briggs, Setrag Khoshafian, and Patrick Valduriez. FAD, a powerful and simple database language. In Proceedings of VLDB Conference, pages 97–105, 1987. [Benz et al., 2000] Harley Benz, John Filson, Walter Arabasz, Lind Gee, and Lisa Wald. ANSS: Advanced National Seismic System. U.S.G.S Fact Sheet 075-00, United States Geological Survey, 2000. [Bizarro et al., 2005] Pedro Bizarro, Shivnath Babu, David DeWitt, and Jennifer Widom. Content-based routing: Different plans for different data. In Proceedings of VLDB Conference, pages 757–768, 2005. [Bloom, 1970] Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970. [Carney et al., 2002] Don Carney, Ugur Cetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Greg Seidman, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Monitoring streams – a new class of data management applications. In Proceedings of VLDB Conference, 2002. [Chandrasekaran and Franklin, 2002] Sirish Chandrasekaran and Michael J. Franklin. Streaming queries over streaming data. In Proceedings of VLDB Conference, 2002. [Chandrasekaran and Franklin, 2004] Sirish Chandrasekaran and Michael J. Franklin. Remembrance of streams past: Overload-sensitive management of archived streams. In Proceedings of VLDB Conference, pages 348–359, 2004. [Chandrasekaran et al., 2003] Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel R. Madden, Vijayshankar Raman, Fred Reiss, and Mehul A. Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Proceedings of Conference on Innovative Data Systems Research, 2003. [Chen et al., 2000] Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang. NiagaraCQ: a scalable continuous query system for Internet databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2000. [Chen et al., 2002] Jianjun Chen, David J. DeWitt, and Jeffrey F. Naughton. Design and evaluation of alternative selection placement strategies in optimizing continuous queries. In Proceedings of IEEE Conference on Data Engineering, 2002.

182

BIBLIOGRAPHY [Cherniack et al., 2003] Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Don Carney, Ugur Cetintemel, Ying Xing, , and Stan Zdonik. Scalable distributed stream processing. In Proceedings of Conference on Innovative Data Systems Research, 2003. [Clark and DeRose, 1999] James Clark and Steve DeRose. (xpath), 1999. http://www.w3.org/TR/xpath.

XML path language

[Cooper et al., 2004] Owen Cooper, Anil Edakkunni, Michael J. Franklin, Wei Hong, Shawn R. Jeffery, Sailesh Krishnamurthy, Frederick Reiss, Shariq Rizvi, and Eugene Wu. Hifi: A unified architecture for high fan-in systems. In Proceedings of VLDB Conference, pages 1357–1360, 2004. [Cranor et al., 2003] Charles D. Cranor, Theodore Johnson, Oliver Spatscheck, and Vladislav Shkapenyuk. Gigascope: A stream database for network applications. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 647–651, 2003. [Dalvi et al., 2001] Nilesh N. Dalvi, Sumit K. Sanghai, Prasan Roy, and S. Sudarshan. Pipelining in multi-query optimization. In ACM Symposium on Principles of Database Systems, 2001. [Dar et al., 1996] Shaul Dar, Michael J. Franklin, Bj¨orn Th´or J´onsson, Divesh Srivastava, and Michael Tan. Semantic data caching and replacement. In Proceedings of VLDB Conference, pages 330–341, 1996. [Denny and Franklin, 2005] Matthew Denny and Michael J. Franklin. Predicate result range caching for continuous queries. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2005. [Deshpande et al., 1998] Prasad M. Deshpande, Karthikeyan Ramasamy, Amit Shukla, and Jeffrey E. Naughton. Caching multidimensional queries using chunks. In Proceedings of ACM SIGMOD International Conference on Management of Data, 1998. [Deshpande et al., 2003] Amol Deshpande, Suman Kumar Nath, Phillip B. Gibbons, and Srinivasan Seshan. Cache-and-query for wide area sensor databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 503–514, 2003. [Diao et al., 2003] Yanlei Diao, Mehmet Altinel, Michael J. Franklin, Hao Zhang, and Peter Fischer. Path sharing and predicate evaluation for high-performance XML filtering. ACM Transactions on Database Systems, 28(4):467–516, December 2003. 183

BIBLIOGRAPHY [Fernandez, 1994] Phillip M. Fernandez. Red brick warehouse: A read-mostly rdbms for open smp platforms. In Proceedings of ACM SIGMOD International Conference on Management of Data, page 492, 1994. [Forgy, 1982] Charles L. Forgy. Rete: A fast algorithm for the many pattern/many object match problem. Artifical Intelligence, 19(1):17–37, September 1982. [Franklin et al., 2005] Michael J. Franklin, Shawn R. Jeffery, Sailesh Krishnamurthy, Shariq Rizvi, Fredrick Reiss, Eugene Wu, Owen Cooper, Anil Edakkunni, and Wei Hong. Design considerations for high fan-in systems: The HiFi approach. In Proceedings of Conference on Innovative Data Systems Research, 2005. [Garofalakis et al., 2002] Minos N. Garofalakis, Johannes Gehrke, , and Rajeev Rastogi. Querying and mining data streams: you only get one look. In Proceedings of ACM SIGMOD International Conference on Management of Data, page 635, 2002. [Golab and Ozsu, 2003] Lukasz Golab and M. Tamer Ozsu. Processing sliding window multi-joins in continuous queries over data streams. In Proceedings of VLDB Conference, pages 500–511, 2003. [Graefe and McKenna, 1991] Goetz Graefe and William J. McKenna. Extensibility and search efficiency in the volcano optimizer generator. Technical Report CU-CS91-563, University of Colorado at Boulder, 1991. [Graefe, 1993] Goetz Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73–170, June 1993. [Gray et al., 1996] Jim Gray, Adam Bosworth, A. Layman, and Hamid Pirahesh. Data Cube: A relational aggregation operator generalizing group-by, cross-tab and sub-total. In Proceedings of IEEE Conference on Data Engineering, February 1996. [Green et al., 2004] Todd J. Green, Ashish Gupta, Gerome Miklau, Makoto Onizuka, and Dan Suciu. Processing XML streams with deterministic automata and stream indexes. ACM Transactions on Database Systems, 29(4), Dec 2004. [Hammad et al., 2003] Moustafa A. Hammad, Michael J. Franklin, Walid G. Aref, and Ahmed K. Elmagarmid. Scheduling for shared window joins over data streams. In Proceedings of VLDB Conference, 2003. [Harinarayan et al., 1996] Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. Implementing data cubes efficiently. In Proceedings of ACM SIGMOD International Conference on Management of Data, 1996.

184

BIBLIOGRAPHY [Hellerstein et al., 2000] Joseph M. Hellerstein, Michael J. Franklin, Sirish Chandrasekaran, Amol Deshpande, Kris Hildrum, Sam Madden, Vijayshankar Raman, and Mehul A. Shah. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 2000. [Huebsch et al., 2006] Ryan Huebsch, Minos Garofalakis, Joseph M. Hellerstein, and Ion Stoica. Sharing aggregate computation for distributed queries. Submitted for publication, 2006. [Jarke, 1985] Matthias Jarke. Common subexpression isolation in multiple query optimization. In Query Processing in Database Systems. Springer Verlag, 1985. [Johnson, 1967] Steven Johnson. Hierarchical clustering schemes. Psychometrika, 2:241–254, 1967. [Josifovski et al., 2002] Vanja Josifovski, Peter M. Schwarz, Laura M. Haas, and Eileen Tien Lin. Garlic: A new flavor of federated query processing for DB2. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 524–532, 2002. [Krishnamurthy and Franklin, 2005] Sailesh Krishnamurthy and Michael J. Franklin. Shared hierarchical aggregation over receptor streams. Technical Report UCBCSD-05-1381, UC Berkeley, 2005. [Krishnamurthy et al., 2003] Sailesh Krishnamurthy, Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Samuel Madden, Fred Reiss, and Mehul A. Shah. TelegraphCQ: An architectural status report. IEEE Data Engineering Bulletin, 26(1), 2003. [Krishnamurthy et al., 2004] Sailesh Krishnamurthy, Michael J. Franklin, Joseph M. Hellerstein, and Garrett Jacobson. The case for precision sharing. In Proceedings of VLDB Conference, 2004. [Krishnamurthy et al., 2006] Sailesh Krishnamurthy, Chung Wu, and Michael J. Franklin. On-the-fly sharing for streamed aggregation. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2006. [Law et al., 2004] Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. Query languages and data models for database sequences and data streams. In Proceedings of VLDB Conference, 2004. [Lewis, ] Robert K. Lewis. MYSQLIDS – A quick look approach to intrusion deteection systems. http://www.codeproject.com/internet/mysqlids.asp. 185

BIBLIOGRAPHY [Li et al., 2005a] Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. No pane, no gain: Efficient evaluation of sliding-window aggregates over data streams. SIGMOD Record, March 2005. [Li et al., 2005b] Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. Semantics and evaluation techniques for window aggregates in data streams. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2005. [Madden and Franklin, 2002] Samuel R. Madden and Michael J. Franklin. Fjording the Stream: An Architecture for Queries over Streaming Sensor Data. In Proceedings of IEEE Conference on Data Engineering, 2002. [Madden et al., 2002a] Samuel R. Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong. TAG: a tiny aggregation service for ad-hoc sensor networks. In Proceedings of Usenix Symposium on Operating Systems Design and Implementation, 2002. [Madden et al., 2002b] Samuel R. Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously adaptive continuous queries over streams. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2002. [Manjhi et al., 2005] Amit Manjhi, Suman Nath, and Phillip B. Gibbons. Tributaries and deltas: Efficient and robust aggregation in sensor network streams. In sigmod, 2005. [Melton and Simon, 2002] Jim Melton and Alan R. Simon. SQL:1999 – Understanding Relational Language Components. Morgan Kaufmann, 2002. [Mistry et al., 2001] Hoshi Mistry, Prasan Roy, S. Sudarshan, and Krithi Ramamritham. Materialized view selection and maintenance using multi-query optimization. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2001. [Mohan et al., 1990] C. Mohan, Don Haderle, Yun Wang, and Josephine Cheng. Single table access using multiple indexes: optimization, execution, and concurrency control techniques. In Proceedings of Conference on Extensible Database Technology, pages 29–43, 1990. [Motwani et al., 2003] Rajeev Motwani, Jennifer Widom, Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Gurmeet Manku, Chris Olston, Justin Rosenstein, and Rohit Varma. Query processing, resource management, and approxi186

BIBLIOGRAPHY

mation in a data stream management system. In Proceedings of Conference on Innovative Data Systems Research, 2003. [NASDAQ, ] NASDAQ. NASTRAQ: North American Securities Tracking and Quantifying System. http://www.nastraq.com/description.htm. [NYSE, ] NYSE. NYSE TAQ: Daily Trades and Quotes http://www.nysedata.com/info/productdetail.asp?dpbid=13.

Database.

[Peterson et al., 2002] Larry Peterson, Tom Anderson, David Culler, and Timothy Roscoe. A blueprint for introducing disruptive technology into the internet. In HotNets, 2002. [Plagemann1 et al., 2004] Thomas Plagemann1, Vera Goebel1, Andrea Bergamini1, Giacomo Tolu1, Guillaume Urvoy-Keller1, and Ernst W. Biersack. Using data stream management systems for traffic analysis - a case study. In 5th International Workshop on Passive and Active Network Measurement, 2004. [Raman et al., 2003] Vijayshankar Raman, Amol Deshpande, and Joseph M. Hellerstein. Using state modules for adaptive query processing. In Proceedings of IEEE Conference on Data Engineering, 2003. [Reiss and Hellerstein, 2006] Frederick Reiss and Joseph M. Hellerstein. Declarative network monitoring with an underprovisioned query processor. In Proceedings of IEEE Conference on Data Engineering, 2006. [Rizvi et al., 2005] Shariq Rizvi, Shawn R. Jeffery, Sailesh Krishnamurthy, Michael J. Franklin, Nathan Burkhart, Anil Edakkunni, and Linus Liang. Events on the edge. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 885–887, 2005. [Roesch, 1999] Martin Roesch. Snort – lightweight intrusion detection for networks. In USENIX LISA Conference, 1999. [Roy et al., 2000] Prasan Roy, S. Seshadri, S. Sudarshan, and Siddhesh Bhobe. Efficient and extensible algorithms for multi query optimization. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2000. [Sellis, 1988] Timos K. Sellis. Multiple-query optimization. ACM Transactions on Database Systems, March 1988. [Shah et al., 2001] Mehul A. Shah, Samuel R. Madden, Michael J. Franklin, and Joseph M. Hellerstein. Java support for data-intensive systems: Experiences building the telegraph dataflow system. SIGMOD Record, 30(4):103–114, December 2001. 187

BIBLIOGRAPHY [Srivastava et al., 2005] Divesh Srivastava, Nick Koudas, Rui Zhang, and Beng Chin Ooi. Multiple aggregations over data streams. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2005. [Stillger et al., 2001] Michael Stillger, Guy M. Lohman, Volker Markl, and Mokhtar Kandil. LEO – DB2’s learning optimizer. In Proceedings of VLDB Conference, pages 19–28, 2001. [Sullivan and Heybey, 1998] Mark Sullivan and Andrew Heybey. Tribeca: A system for managing large databases of network traffic. In USENIX Annual Technical Conference, 1998. [Sullivan, 1996] Mark Sullivan. Tribeca: A stream database manager for network traffic analysis. In Proceedings of VLDB Conference, 1996. [Tan and Lu, 1995] K. Tan and H. Lu. Workload scheduling for multiple query processing. Information Processing Letters, 55(5):251–257, 1995. [Tatbul, 2003] Nesime Tatbul. Load shedding in a data stream manager. In Proceedings of VLDB Conference, September 2003. [Terry et al., 1992] Douglas B. Terry, David Goldberg, David Nichols, and Brian M. Oki. Continuous queries over append-only databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 321–330, 1992. [van Renesse et al., 2003] Robbert van Renesse, Kenneth P. Birman, and Werner Vogels. Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining. ACM Transactions on Computer Systems, 21(2):164–206, 2003. [Viglas et al., 2003] Stratis Viglas, Jeffrey F. Naughton, and Josef Burger. Maximizing the output rate of multi-way join queries over streaming information sources. In Proceedings of VLDB Conference, 2003. [Waldspurger and Weihl, 1994] Carl A. Waldspurger and William E. Weihl. Lottery scheduling: Flexible proportional-share resource management. In Proceedings of Usenix Symposium on Operating Systems Design and Implementation, 1994. [Wilschut and Apers, 1993] Annita N. Wilschut and Peter M. G. Apers. Dataflow query execution in a parallel main-memory environment. Distributed and Parallel Databases, 1(1):103–128, 1993.

188

BIBLIOGRAPHY [Xu et al., 2005] Wei Xu, Joseph L. Hellerstein, Bill Kramer, and David Patterson. Control considerations for scalable event processing. In Distributed Systems: Operations and Management Workshop, pages 233–244, 2005. [Zaniolo et al., a] Carlo Zaniolo, Yijian Bai, Vladimir Braverman, Ivy Yan-Nei Law, Richard Luo, Hyun Jin Moon, Frank Myers, Hetal Thakkar, and Xin Zhou. Expressive stream language (ESL) manual. http://magna.cs.ucla.edu/streammill/doc/index.html. [Zaniolo et al., b] Carlo Zaniolo, Yijian Bai, Vladimir Braverman, Ivy Yan-Nei Law, Richard Luo, Hyun Jin Moon, Frank Myers, Hetal Thakkar, and Xin Zhou. The StreamMill project. http://magna.cs.ucla.edu/stream-mill/. [Zilio et al., 2004] Daniel C. Zilio, Calisto Zuzarte, Sam Lightstone, Wenbin Ma, Guy M. Lohman, Roberta Cochrane, Hamid Pirahesh, Latha S. Colby, Jarek Gryz, Eric Alton, Dongming Liang, and Gary Valentin. Recommending Materialized Views and Indexes with IBM DB2 Design Advisor. In ICAC, 2004.

189

Appendix A The TelegraphCQ System In this chapter, I describe the TelegraphCQ data stream management system that was built by a team of researchers (including myself) at UC Berkeley. TelegraphCQ served as the implementation vehicle for many of the techniques that were developed in this dissertation. My focus in this chapter is to describe the major architectural aspects of TelegraphCQ. The TelegraphCQ system is aimed at processing large numbers of concurrent continuous queries over high-volume and highly-variable data streams. This motivation led to the following two major design goals for the TelegraphCQ system: (1) shared query processing in order to support large numbers of concurrent queries, and (2) adaptive query processing in order to handle highly-variable data streams. The major contribution of the TelegraphCQ project was the demonstration that a shared adaptive query processor is feasible. While some of the novel query execution techniques used in TelegraphCQ are described earlier in Section 3.2 of this dissertation, a more complete description of the research that motivated the TelegraphCQ project is presented in [Chandrasekaran et al., 2003]. The TelegraphCQ system is also a building block for the HiFi system (also built at Berkeley) that is described in Appendix B of this dissertation. I begin this chapter with a brief overview of the major challenges that were faced in the design and development of TelegraphCQ. Next, I describe the architecture of the TelegraphCQ system. Finally, I summarize the current status and the major features of TelegraphCQ.

A.1

Challenges

This section describes the major challenges in the design of the TelegraphCQ data stream management system.

190

Chapter A. The TelegraphCQ System

Before developing the TelegraphCQ system, the TelegraphCQ team built a system for adaptive dataflow processing in the Java programming language [Arnold et al., 2000] as part of earlier research efforts [Avnur and Hellerstein, 2000; Madden et al., 2002b; Raman et al., 2003; Chandrasekaran and Franklin, 2002]. Although this research heavily influenced the design of TelegraphCQ, the experience in using Java in a systems development project was not positive as described in [Shah et al., 2001]. After considering a few alternatives, the team decided to use the PostgreSQL open source database system as a starting point for the new implementation. Although a continuous query system like TelegraphCQ is quite different from a traditional query processor, the team was able to reuse a significant amount of PostgreSQL code. In particular, the extensible nature of PostgreSQL (e.g., the ability to load user-defined code, and the availability of rich data types such as intervals) was very useful. As described earlier in this chapter, a major design goal of TelegraphCQ was to implement a shared adaptive continuous query processor. Accomplishing this goal in the codebase of a conventional database posed a number of challenges: • Adaptivity: While the TelegraphCQ query executor uses a lot of existing PostgreSQL functionality, the actual interaction between streaming operators is significantly different and does not follow the iterator model (described earlier in Section 3.1.1) used in PostgreSQL. Instead, the TelegraphCQ system relies on prior work on adaptive query processing with operators such as an Eddy[Avnur and Hellerstein, 2000] for adaptive tuple routing and operator scheduling, as well as fjords [Madden and Franklin, 2002] for inter-operator communication (please see Section 3.2 for more details). Note that unlike in the original CACQ paper, TelegraphCQ uses the draining constraint described in Section 3.2.2. • Shared continuous queries: TelegraphCQ aims at supporting large numbers of queries by sharing the system’s resources among the queries that are processed concurrently. The general approach is based on the design of CACQ [Madden et al., 2002b] and is described earlier Section 3.2.2 of this dissertation. This requirement resulted in a large-scale change to the conventional process-perconnection model of PostgreSQL. • Data ingress operations: Traditional federated database systems fetch data from remote data sources using an operator in a query plan. Typically, such an operator uses user-defined wrapper functions that fetch data across the network. These wrappers are often built “in the factory”, i.e., they are bundled as part of the system itself. For a shared query processing system however, it is vital that all operators are non-blocking, and it is not possible to guarantee this with traditional wrappers.

191

Chapter A. The TelegraphCQ System

A.2

System Overview

This section presents an overview of the TelegraphCQ system. We begin this discussion by outlining the architecture of PostgreSQL at a high level, and describe the components of PostgreSQL that we leverage in engineering TelegraphCQ. We then present the architecture of TelegraphCQ itself.

Figure A.1: PostgreSQL Architecture

A.2.1

The PostgreSQL Architecture

Figure A.1 shows the basic process structure of PostgreSQL. PostgreSQL uses a process-per-connection model. The figure shows a a client application (top right), a Postmaster process (small rectangle in the top left), a PostgreSQL server process (the big rectangle in the center), a shared memory segment which is used to organize the data structures that are shared by multiple processes (e.g., buffer pool, latches), and a disk that serves as secondary storage. The Postmaster process forks new server processes in response to new client connections. Among the different components in each server process, the Listener is responsible for accepting requests on a connection and returning data to the client. 192

Chapter A. The TelegraphCQ System

When a new query arrives, it is parsed, optimized, and compiled into an access plan that is then processed by the query Executor component.

A.2.2

Modifying PostgreSQL to build TelegraphCQ

The components that have only been changed minimally in TelegraphCQ are shaded in dark gray in Figure A.1. These include: the Postmaster, Listener, System Catalog, Query Parser and Optimizer. Components shown in light gray (the Executor, Buffer Manager and Access Methods) are leveraged by TCQ, but with significant changes. In addition to these components, the TelegraphCQ system also adopts the frontend components (not shown in the figure) of PostgreSQL in order to get access to important client-side call-level interface implementations such as ODBC and JDBC.

A.2.3

The TelegraphCQ Architecture

Figure A.2 depicts a bird’s eye view of TelegraphCQ. The figure shows (as ovals) the three processes that comprise the TelegraphCQ server. These processes communicate using a shared memory infrastructure. The TelegraphCQ Front End contains the Listener, Catalog, Parser, Planner and mini-Executor. The actual query processing takes place in a separate process called the TelegraphCQ Back End. Finally the TelegraphCQ Wrapper ClearingHouse is used to host the data ingress operators which make fresh tuples available for processing, archiving them if required. As in PostgreSQL, there is a Postmaster process (not shown in the figure) that listens on a well-known port and forks a Front End (FE) process for each fresh connection it receives. The listener accepts commands from a client and based on the command, chooses where to execute it. DDL statements and queries over tables are executed in the FE process itself. Continuous queries, i.e., those that involve streams, as well as combinations of streams and tables, are pre-compiled and sent via the Query Plan queue (in shared memory) to the Back End (BE) process. The pre-compilation process is a two-step approach. First, the FE calls the regular PostgreSQL optimizer to produce a query plan. Here, the optimizer is instructed to use non-blocking join operators. Second, the FE walks the plan tree produced by the optimizer and produces an equivalent adaptive plan that uses a single-query eddy (see Section 3.2.1 for details). The BE executor continually dequeues fresh queries and dynamically folds them into the current running query. Query results are in turn, placed in the client-specific Query Result queues. Once the FE has handed a query off to the BE, it produces an FE-local query plan that the mini-executor runs to return results to the connected client. The FE-local query plan is a standard PostgreSQL plan that is associated with a cursor with which a client can execute repeated fetch requests. The

193

Chapter A. The TelegraphCQ System

Figure A.2: TelegraphCQ Architecture FE-local plan simply consists of a special Scan operator that dequeues tuples from the Result queue on a fetch request from the client. Since a continuous query does not end until destroyed by the application, clients should submit such queries as part of named cursors. We have added a continuous query mode to psql, the standard PostgreSQL interactive client. In this mode, SELECT statements are automatically converted into named cursors which can then be iterated over to continually fetch results. Finally, there is the TelegraphCQ Wrapper ClearingHouse (WCH) process that manages data ingress operations. As with traditional federated systems, we use userdefined wrapper functions to massage data into a TelegraphCQ format (essentially that of PostgreSQL). In many of these systems, the wrapper also manages the remote network connections. In contrast, the TelegraphCQ WCH manages non-blocking sockets for all active connections, and simply schedules the user-defined wrapper code when there is data on its associated socket. Thus, the user-defined wrapper code limits itself to reading data off a network socket and managing format conversions. The chief challenge in using PostgreSQL is supporting the TelegraphCQ features it was not designed for: streaming data, continuous queries, shared processing and

194

Chapter A. The TelegraphCQ System

adaptivity. The biggest impact of these changes is to PostgreSQL’s process-perconnection model. In TelegraphCQ (see Figure A.2) we have a dedicated process, the TelegraphCQ Back End (BE), for executing shared long-running continuous queries. This is in addition to the per-connection TelegraphCQ Front End (FE) process that fields queries from, and returns results to, the client. The FE process is also used to process non-streaming (i.e., traditional) queries, and Data Definition Language (DDL) statements that are used to create and remove system objects like streams and wrappers. TelegraphCQ also has a dedicated Wrapper ClearingHouse process, which ensures that data ingress operations do not impede the progress of query execution.

A.3

Status and Major Features

At the time of writing this thesis, the TelegraphCQ team has completed release 2.1 of the system. At this point, the team do not anticipate any future releases of TelegraphCQ from Berkeley. In its current state, the system is very usable. In fact, there have been various projects outside Berkeley that have downloaded the TelegraphCQ source code for different purposes. Examples of these include modifying the system for data management research [Bizarro et al., 2005], as well as using the system for research in other fields such as network monitoring [Plagemann1 et al., 2004; Xu et al., 2005]. The following are some of the major features that are supported by the current incarnation of the TelegraphCQ system. 1. Support for archived and unarchived streams. 2. A rich query language (based on CQL) that includes joins, sliding window aggregates, sub-queries and recursion. 3. Support for hybrid queries that involve streams and local tables. 4. On-the-fly addition and removal of queries. The system can dynamically folds a new query plan into a running shared plan, as well as disentangle the artifacts of a query that is removed from the system. 5. A non-blocking wrapper clearinghouse that can run built-in and user-defined wrappers.

A.4

Summary

In this chapter I provided a brief overview of the TelegraphCQ data stream management system that was built at UC Berkeley. The TelegraphCQ system served as an 195

Chapter A. The TelegraphCQ System

implementation vehicle for most of the techniques in the thesis work that I presented in this dissertation. From a research perspective, a major contribution of the TelegraphCQ system was the demonstration that a shared adaptive query processor is feasible. From an engineering perspective, it was notable that although these technologies (i.e., sharing and adaptive) are considered radical, it was possible to implement them in a standard relational database system. The TelegraphCQ system is an important building block of the HiFi distributed streaming system that was also built at UC Berkeley. In the next chapter I present a brief overview of HiFi.

196

Appendix B The HiFi System In this chapter, I describe the HiFi distributed data stream management system developed at UC Berkeley. The HiFi project served as a motivation for the problem of shared processing of windowed aggregate queries in a hierarchy that was explored in Chapter 5 of this dissertation. My focus in this chapter is to present a sample of the research problems that drive the HiFi project. A more complete description of the research issues that HiFi addresses is presented in [Franklin et al., 2005]. I begin this chapter by motivating the reasons for the HiFi research project and describe some of the problems the project targets. Since the HiFi project is still in progress at the time of writing this dissertation, I only present a basic design of the system.

B.1

Motivation

As a motivation, consider a retail enterprise that needs to monitor its supply chain. This supply chain includes its suppliers, distribution centers, as well as retail stores. In a typical deployment (shown in Figure B.1), there will be sensors and RFID readers on individual store shelves or dock doors that continuously report readings about the objects in their vicinity. There are also various other sources such as point-of-sale data, website click-streams, etc. The data from these sources needs to be available at various levels of the system, for different uses. For example, these levels can be within the store (e.g., to restock shelves), at the warehouse (e.g., to replenish various stores), and at the central site (e.g., to determine the progress of promotions). The philosophy of the HiFi project is that, instead of building piecemeal “stove-pipe” solutions for each of these separate problems, a better approach is to build a system that manages all these streams of

197

Chapter B. The HiFi System

Reformatting Analyzing

Data Warehouse

Distribution Center

Supplier Data

Processing

Filtering

Factory

Branch

Kiosk

Dispatcher

Clickstream Phones

Warehouse

Vehicles

Store

Scanners Sensors

Demographic

RFID

Point of Sale

Figure B.1: A High Fan-In Example: Supply Chain Monitoring an enterprise.

B.2

HiFi Design

HiFi is a distributed system with a “high fan-in” architecture. Its edges consist of sources such as sensor networks, and its interior nodes are traditional computers that can host a DSMS like TelegraphCQ. The devices on the edges produce data streams that are aggregated from the local receptors they control. Data streams from multiple edge nodes are further aggregated at a higher level interior node, all the way up to the root of the hierarchy. Consumers of data from HiFi may also be widely dispersed across the enterprise, and even entities external to the enterprise (e.g., the suppliers of a retail chain). Thus, the data that reached the center of the hierarchy in a “fan-in” manner will be disseminated to clients who are at various other locations. This leads to the “bow-tie” architecture that is shown above in Figure B.2. The basic design for HiFi follows from the bow-tie architecture described above. Each node in the fan-in hierarchy needs the basic ability to manipulate streams from their child nodes, and to send streams to a parent node. This basic need can be satisfied by having each node run a Data Stream Management System (DSMS). 198

Chapter B. The HiFi System

Figure B.2: The High Fan-In Bow-tie Now, the key problem to be solved is in co-ordinating each DSMS. One possible approach that we considered is implementing this using standard federated database techniques from systems like Garlic [Josifovski et al., 2002]. Such federated systems have the desirable property that they are “loosely coupled” where distributed nodes communicate with each other using standard protocols. That is, a mediator connects to a remote node like any other application. It uses a “wrapper” interface to submit an SQL query to the remote system, and fetch its results. This is in contrast to clustered partitioned systems (e.g., Volcano [Graefe and McKenna, 1991]) that are tightly coupled, where query plans and not queries are exchanged between nodes. In this federated model, a query is optimized in a recursive fashion, where each node is only aware of its children. Here, a query submitted to the root node of the hierarchy will be automatically rewritten to use data from a stream that is populated by running one or more queries in remote nodes. Each node in the system performs the same actions (as the root node) when it receives a query. This allows queries to be recursively propagated through the HiFi hierarchy. There are, however, certain requirements of a HiFi system that suggest the use of an even more loosely coupled model than that of federated systems. These reasons are listed below: 1. Managing failures with long-lived queries. Since these systems run continuous queries that are long-lived, failures are more likely to occur during the lifetime of a query. In the event of failure it becomes important for data to be “re-routed” through other nodes of HiFi. We believe that this is easier to accomplish if each node is managed and operated independently of any other nodes in the system. In such a model, a DSMS is “insulated” from failures elsewhere in the system. 199

Chapter B. The HiFi System

2. Sharing can require Global query optimization. Efficient shared processing can require the system to be able to perform global query optimization, and not the recursive strategy described above. To understand why this is the case, let us consider a situation where HiFi nodes can have multiple parents (e.g., to make the system more fault tolerant). More specifically, consider a given node (n) that may not have a complete view of every query that is running on any of its children, and has to decide how to optimize a query q that it receives. This is because, a child node (c) might have already received a sub-query q10 , from another of its parents that was asked to process a query (q1 ). If the node n does not know anything about q1 , then it will also not know that its child c is executing the sub-query q10 . Blissful in its ignorance, the node n might request that its child c runs another sub-query q 0 . Now, it might well happen that an overall optimization goal (e.g., reduce communication costs) might be better served if the the child c ran a separate common sub-query (qq1 0 ) whose results can be used in processing both q and q1 in the parents of c, as opposed to answering q and q1 by running both the queries q 0 and q10 in the child c. To achieve this, we need a global query optimization strategy. The basic issue in both the reasons listed above, is that the federated model is not loosely coupled enough for the purposes required. This is because in the federated model, a node deals directly with other remote nodes.

DSQP HiFi Glue Data Stream Management System

MDR

HiFi Glue

•DSQP DSQPManagement DSQP •Query Planning HiFi Glue HiFi Glue •Archiving •Internode coordination and communication

Figure B.3: The HiFi Glue In contrast to the federated model, we envisage a design for HiFi where each HiFi node is an independent DSMS, and is completely oblivious of the rest of the system. Each nodes contains an instance of a special component, the HiFi Glue, that completely envelopes the local DSMS as shown in Figure B.3. 200

Chapter B. The HiFi System

The HiFi Glue is the fabric that seamlessly binds together the system. It coordinates its local DSMS, communicates with other HiFi Glue nodes, and manages incoming and outgoing streams. The major requirements that the HiFi Glue places on its associated DSMS is the following: 1. The ability to process continuous queries. 2. The ability to add and remove continuous queries on-the-fly. 3. The ability to add and remove sources on-the-fly.

B.3

Summary

In this chapter I provided a brief overview of the HiFi distributed data stream management system developed at UC Berkeley. The HiFi project motivated my consideration of a distributed hierarchical context for shared processing of windowed aggregate queries. The HiFi project is still in progress at the time of writing this dissertation. The team has built an initial prototype for system demonstrations [Cooper et al., 2004; Rizvi et al., 2005] in the VLDB 2004 and SIGMOD 2005 conferences. These prototypes have served to test the basic motivation of HiFi, which is that data streams from widely dispersed sources can be managed using a distributed streaming system that is organized using cascading streams and hierarchical aggregation.

201

Linked Data Query Processing Strategies

Top-k Linked Data Query Processing

REQUEST: Region-Based Query Processing in Sensor ...

Secure kNN Query Processing in Untrusted Cloud Environments.pdf ...

Chapter 5: Overview of Query Processing

Indexing Shared Content in Information Retrieval Systems - CiteSeerX

Shared Memory for Distributed Systems - CiteSeerX

Secure Processing in Embedded Systems

Prefrontal systems in financial processing

Oruta Privacy-Preserving Public Auditing for Shared Data in the ...

Weighting Techniques in Data Compression - Signal Processing ...

6 Salford Shared Data Centre.pdf

A Relational Model of Data for Large Shared Data Banks

GPUQP: Query Co-Processing Using Graphics Processors - hkust cse

Efficient Query Processing for Streamed XML Fragments

Efficient Top-k Hyperplane Query Processing for ...

Efficient Exact Edit Similarity Query Processing with the ...

A Space-Efficient Indexing Algorithm for Boolean Query Processing

GPUQP: Query Co-Processing Using Graphics ...

Sempala: Interactive SPARQL Query Processing on Hadoop - GitHub

Data Exchange: Query Answering for Incomplete Data ...