Chapter 5: Overview of Query Processing

• Query Processing Overview • Query Optimization • Distributed Query Processing Steps Acknowledgements: I am indebted to Arturas Mazeika for providing me his slides of this course.

DDB 2008/09

J. Gamper

Page 1

Query Processing Overview

• Query processing: A 3-step process that transforms a high-level query (of relational calculus/SQL) into an equivalent and more efficient lower-level query (of relational algebra). 1. Parsing and translation – Check syntax and verify relations. – Translate the query into an equivalent relational algebra expression. 2. Optimization – Generate an optimal evaluation plan (with lowest cost) for the query plan. 3. Evaluation – The query-execution engine takes an (optimal) evaluation plan, executes that plan, and returns the answers to the query.

DDB 2008/09

J. Gamper

Page 2

Query Processing . . . • The success of RDBMSs is due, in part, to the availability – of declarative query languages that allow to easily express complex queries without knowing about the details of the physical data organization and – of advanced query processing technology that transforms the high-level user/application queries into efficient lower-level query execution strategies.

• The query transformation should achieve both correctness and efficiency – The main difficulty is to achieve the efficiency – This is also one of the most important tasks of any DBMS

• Distributed query processing: Transform a high-level query (of relational calculus/SQL) on a distributed database (i.e., a set of global relations) into an equivalent and efficient lower-level query (of relational algebra) on relation fragments.

• Distributed query processing is more complex – Fragmentation/replication of relations – Additional communication costs – Parallel execution DDB 2008/09

J. Gamper

Page 3

Query Processing Example

• Example: Transformation of an SQL-query into an RA-query. Relations: EMP(ENO, ENAME, TITLE), ASG(ENO,PNO,RESP,DUR) Query: Find the names of employees who are managing a project? – High level query SELECT FROM WHERE

ENAME EMP,ASG EMP.ENO = ASG.ENO AND DUR > 37

– Two possible transformations of the query are: ∗ Expression 1: ΠEN AM E (σDU R>37∧EM P.EN O=ASG.EN O (EM P

× ASG))

∗ Expression 2: ΠEN AM E (EM P ⋊ ⋉EN O (σDU R>37 (ASG))) – Expression 2 avoids the expensive and large intermediate Cartesian product, and therefore typically is better.

DDB 2008/09

J. Gamper

Page 4

Query Processing Example . . .

• We make the following assumptions about the data fragmentation – Data is (horizontally) fragmented: ∗ Site1: ASG1 = σEN O≤”E3” (ASG) ∗ Site2: ASG2 = σEN O>”E3” (ASG) ∗ Site3: EM P 1 = σEN O≤”E3” (EM P ) ∗ Site4: EM P 2 = σEN O>”E3” (EM P ) ∗ Site5: Result – Relations ASG and EMP are fragmented in the same way – Relations ASG and EMP are locally clustered on attributes RESP and ENO, respectively

DDB 2008/09

J. Gamper

Page 5

Query Processing Example . . . • Now consider the expression ΠEN AM E (EM P ⋊ ⋉EN O (σDU R>37 (ASG))) • Strategy 1 (partially parallel execution): – Produce ASG′1 and move to Site 3 – Produce ASG′2 and move to Site 4 – Join ASG′1 with EMP1 at Site 3 and move the result to Site 5 – Join ASG′2 with EMP2 at Site 4 and move the result to Site 5 – Union the result in Site 5

• Strategy 2: – Move ASG1 and ASG2 to Site 5 – Move EMP1 and EMP2 to Site 5 – Select and join at Site 5

• For simplicity, the final projection is omitted. DDB 2008/09

J. Gamper

Page 6

Query Processing Example . . . • Calculate the cost of the two strategies under the following assumptions: – Tuples are uniformly distributed to the fragments; 20 tuples satisfy DUR>37 – size(EMP) = 400, size(ASG) = 1000 – tuple access cost = 1 unit; tuple transfer cost = 10 units – ASG and EMP have a local index on DUR and ENO

• Strategy 1 – Produce ASG’s: (10+10) * tuple access cost – Transfer ASG’s to the sites of EMPs: (10+10) * tuple transfer cost – Produce EMP’s: (10+10) * tuple access cost * 2

20 200 40

– Transfer EMP’s to result site: (10+10) * tuple transfer cost

200

– Total cost

460

• Strategy 2 – Transfer EMP1 , EMP2 to site 5: 400 * tuple transfer cost

4,000

– Transfer ASG1 , ASG2 to site 5: 1000 * tuple transfer cost

10,000

– Select tuples from ASG1

∪ ASG2 : 1000 * tuple access cost

– Join EMP and ASG’: 400 * 20 * tuple access cost

8,000 23,000

– Total cost DDB 2008/09

1,000

J. Gamper

Page 7

Query Optimization

• Query optimization is a crucial and difficult part of the overall query processing • Objective of query optimization is to minimize the following cost function: I/O cost + CPU cost + communication cost

• Two different scenarios are considered: – Wide area networks ∗ Communication cost dominates · low bandwidth · low speed · high protocol overhead ∗ Most algorithms ignore all other cost components – Local area networks ∗ Communication cost not that dominant ∗ Total cost function should be considered

DDB 2008/09

J. Gamper

Page 8

Query Optimization . . . • Ordering of the operators of relational algebra is crucial for efficient query processing • Rule of thumb: move expensive operators at the end of query processing • Cost of RA operations: Operation

Complexity

Select, Project

O(n)

(without duplicate elimination) Project

O(n log n)

(with duplicate elimination) Group Join Semi-join

O(n log n)

Division Set Operators Cartesian Product DDB 2008/09

J. Gamper

O(n2 ) Page 9

Query Optimization Issues

Several issues have to be considered in query optimization

• Types of query optimizers – wrt the search techniques (exhaustive search, heuristics) – wrt the time when the query is optimized (static, dynamic)

• Statistics • Decision sites • Network topology • Use of semijoins

DDB 2008/09

J. Gamper

Page 10

Query Optimization Issues . . .

• Types of Query Optimizers wrt Search Techniques – Exhaustive search ∗ Cost-based ∗ Optimal ∗ Combinatorial complexity in the number of relations – Heuristics ∗ Not optimal ∗ Regroups common sub-expressions ∗ Performs selection, projection first ∗ Replaces a join by a series of semijoins ∗ Reorders operations to reduce intermediate relation size ∗ Optimizes individual operations

DDB 2008/09

J. Gamper

Page 11

Query Optimization Issues . . .

• Types of Query Optimizers wrt Optimization Timing – Static ∗ Query is optimized prior to the execution ∗ As a consequence it is difficult to estimate the size of the intermediate results ∗ Typically amortizes over many executions – Dynamic ∗ Optimization is done at run time ∗ Provides exact information on the intermediate relation sizes ∗ Have to re-optimize for multiple executions – Hybrid ∗ First, the query is compiled using a static algorithm ∗ Then, if the error in estimate sizes greater than threshold, the query is re-optimized at run time

DDB 2008/09

J. Gamper

Page 12

Query Optimization Issues . . .

• Statistics – Relation/fragments ∗ Cardinality ∗ Size of a tuple ∗ Fraction of tuples participating in a join with another relation/fragment – Attribute ∗ Cardinality of domain ∗ Actual number of distinct values ∗ Distribution of attribute values (e.g., histograms) – Common assumptions ∗ Independence between different attribute values ∗ Uniform distribution of attribute values within their domain

DDB 2008/09

J. Gamper

Page 13

Query Optimization Issues . . .

• Decision sites – Centralized ∗ Single site determines the ”best” schedule ∗ Simple ∗ Knowledge about the entire distributed database is needed – Distributed ∗ Cooperation among sites to determine the schedule ∗ Only local information is needed ∗ Cooperation comes with an overhead cost – Hybrid ∗ One site determines the global schedule ∗ Each site optimizes the local sub-queries

DDB 2008/09

J. Gamper

Page 14

Query Optimization Issues . . .

• Network topology – Wide area networks (WAN) point-to-point ∗ Characteristics · Low bandwidth · Low speed · High protocol overhead ∗ Communication cost dominate; all other cost factors are ignored ∗ Global schedule to minimize communication cost ∗ Local schedules according to centralized query optimization – Local area networks (LAN) ∗ Communication cost not that dominant ∗ Total cost function should be considered ∗ Broadcasting can be exploited (joins) ∗ Special algorithms exist for star networks

DDB 2008/09

J. Gamper

Page 15

Query Optimization Issues . . .

• Use of Semijoins – Reduce the size of the join operands by first computing semijoins – Particularly relevant when the main cost is the communication cost – Improves the processing of distributed join operations by reducing the size of data exchange between sites – However, the number of messages as well as local processing time is increased

DDB 2008/09

J. Gamper

Page 16

Distributed Query Processing Steps

DDB 2008/09

J. Gamper

Page 17

Conclusion

• Query processing transforms a high level query (relational calculus) into an equivalent lower level query (relational algebra). The main difficulty is to achieve the efficiency in the transformation

• Query optimization aims to mimize the cost function: I/O cost + CPU cost + communication cost

• Query optimizers vary by search type (exhaustive search, heuristics) and by type of the algorithm (dynamic, static, hybrid). Different statistics are collected to support the query optimization process

• Query optimizers vary by decision sites (centralized, distributed, hybrid) • Query processing is done in the following sequence: query decomposition→data localization→global optimization→ local optimization

DDB 2008/09

J. Gamper

Page 18

Chapter 5: Overview of Query Processing

calculus/SQL) on a distributed database (i.e., a set of global relations) into an equivalent and efficient lower-level query (of ... ASG2 to site 5: 1000 * tuple transfer cost. 10,000. – Select tuples from ASG1 ∪ ASG2: 1000 * tuple access cost. 1,000 .... Wide area networks (WAN) point-to-point. ∗ Characteristics. · Low bandwidth.

226KB Sizes 0 Downloads 374 Views

Recommend Documents

Chapter 5
not in the domain. The only critical point is x = 0. As x moves away from 0 on either side, the values of y decrease. The function has a local maximum value at (0, ...... (b) Since. ,. dV. dV dr dt dr dt. = we have. 2 . dV dr rh dt dt π. = (c). 2. 2

Linked Data Query Processing Strategies
Recently, processing of queries on linked data has gained at- ... opment is exciting, paving new ways for next generation applications on the Web. ... In Sections 3 & 4 we present our approach to stream-based query ..... The only “interesting”.

Chapter 5 - DLSCRIB
Three different washing solutions are being compared to study their ... Plot the mean tensile strengths observed for each chemical type in Problem 4.3 and ...... np y p y .... h... n-1. Treatment x Squares. Squares. Treatments .... h.j.. SS. SS np y

Chapter 5
Every Document object has: •forms - an array of references to the forms of the document. •Each forms object has an elements array, which has references to the form's elements. Document also has property arrays for anchors, links, & images. JavaSc

Chapter 5 Density matrix formalism
In chap 2 we formulated quantum mechanics for isolated systems. In practice systems interect with their environnement and we need a description that takes this ...

chapter 5.pdf
Memory Management. 3. Device Management. 4. File Management. 5. Security Management. User. Utilities Application Software. Operating System. Hardware.

Word Chapter 5
Select the text from the first paragraph (do not select the title) to the end of the ... Insert the Simple Quote (NOT Simple text box—scroll down to find it) and make ...

Chapter 5.pdf
Loading… Page 1. Whoops! There was a problem loading more pages. Chapter 5.pdf. Chapter 5.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Chapter 5.pdf.

Chapter 5.pdf
hydraulic cylinder BC (short link). Explain the significance. of each force on the .... Chapter 5.pdf. Chapter 5.pdf. Open. Extract. Open with. Sign In. Main menu.

Chapter NR 5
cubic feet (one decimal place). Step 2. Compute Maximum Weight Capacity. Formula: Capacity = Cubic Capacity. 62.5 * Boat Weight B 5. Capacity = [(. 62.5) *. ] B 5. Capacity = pounds (nearest whole number). Note: The volume of integral structure aft o

Chapter 5 and 6 - GitHub
Mar 8, 2018 - These things are based on the sampling distribution of the estimators (ˆβ) if the model is true and we don't do any model selection. • What if we do model selection, use Kernels, think the model is wrong? • None of those formulas

GPUQP: Query Co-Processing Using Graphics Processors - hkust cse
on how GPUs can be programmed for heavy-duty database constructs, such as ... 2. PRELIMINARIES. As the GPU is designed for graphics applications, the basic data .... processor Sorting for Large Database Management. SIGMOD 2006: ...

Efficient Query Processing for Streamed XML Fragments
Institute of Computer System, Northeastern University, Shenyang, China ... and queries on parts of XML data require less memory and processing time.

Efficient Top-k Hyperplane Query Processing for ...
ABSTRACT. A query can be answered by a binary classifier, which sep- arates the instances that are relevant to the query from the ones that are not. When kernel methods are employed to train such a classifier, the class boundary is represented as a h

Efficient Exact Edit Similarity Query Processing with the ...
Jun 16, 2011 - edit similarity queries rely on a signature scheme to gener- ... Permission to make digital or hard copies of all or part of this work for personal or classroom ... database [2], or near duplicate documents in a document repository ...

A Space-Efficient Indexing Algorithm for Boolean Query Processing
index are 16.4% on DBLP, 26.8% on TREC, and 39.2% on ENRON. We evaluated the query processing time with varying numbers of tokens in a query.

REQUEST: Region-Based Query Processing in Sensor ...
In wireless sensor networks, node failures occur frequently. The effects of these failures can ..... tion service for ad-hoc sensor networks. SIGOPS Oper. Syst. Rev.

5 Meeting Overview - Consensus Building Institute
Measurement and Goal Setting – Emissions goals are aspirational, imperfect, and important. Setting one overarching numerical Cambridge emissions goal is necessarily inaccurate due to inconsistencies in data collection and the fact that some data ca

GPUQP: Query Co-Processing Using Graphics ...
computing devices including PCs, laptops, consoles and cell phones. GPUs are .... using the shared memory to sort all bitonic sequences whose sizes are small ...

Shared Query Processing in Data Streaming Systems
systems that can manage streaming data have gained tremendous ..... an application executes business and presentation logic, where there are fewer ..... systems (see Section 2.3 for a brief survey), only a small part of it involves shared ...... proc

Top-k Linked Data Query Processing
score bounds (and thus allow an earlier termination) as compared to top-k .... In a pull-based implementation, operators call a next method on their in-.

Sempala: Interactive SPARQL Query Processing on Hadoop - GitHub
Impala [1] is an open-source MPP SQL query engine for Hadoop inspired by ..... installed. The machines were connected via Gigabit network. This is actually ..... or cloud service (Impala is also supported by Amazon Elastic MapReduce).

5 Meeting Overview - Consensus Building Institute
many different efforts could connect to and build on. ... Update its website to highlight sustainability initiatives and services and best practices for residents and ...