High Performance Computing For senior undergraduate students

Lecture 7: Principles of Parallel Algorithms 08.11.2016

Dr. Mohammed Abdel-Megeed Salem Scientific Computing Department Faculty of Computer and Information Sciences Ain Shams University

Outline • Introduction to Parallel Algorithms – Tasks and Decomposition – Processes and Mapping – Processes Versus Processors

• Decomposition Techniques – Recursive Decomposition – Data-based Decomposition – Exploratory Decomposition – Hybrid Decomposition Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

2

3.2 Decomposition Techniques • So how does one decompose a task into various subtasks? • While there is no single recipe that works for all problems, we present a set of commonly used techniques that apply to broad classes of problems. These include: – – – –

recursive decomposition data decomposition exploratory decomposition speculative decomposition

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

3

3.2.1 Recursive Decomposition • Generally suited to problems that are solved using the divide-and-conquer strategy. • A given problem is first decomposed into a set of sub-problems. • These sub-problems are recursively decomposed further until a desired granularity is reached. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

4

Recursive Decomposition: Example A classic example of a divide-and-conquer algorithm on which we can apply recursive decomposition is Quicksort.

• Consider the problem of sorting a sequence A of n elements using the quicksort algorithm. • Quicksort is a divide and conquer algorithm that starts by selecting a pivot element x and then partitions the sequence A into two subsequences A0 and A1 such that all the elements in A0 are smaller than x and all the elements in A1 are greater than or equal to x. • Each one of the subsequences A0 and A1 is sorted by recursively calling quicksort. • The recursion terminates when each subsequence contains only a single element. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

5

Recursive Decomposition: Example

In this example, once the list has been partitioned around the pivot, each sublist can be processed concurrently (i.e., each sublist represents an independent subtask). This can be repeated recursively. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

6

Recursive Decomposition: Example • We define a task as the work of partitioning a given subsequence. Therefore, • Figure 3.8 also represents the task graph for the problem. Initially, there is only one sequence • (i.e., the root of the tree), and we can use only a single process to partition it. The completion • of the root task results in two subsequences (A0 and A1, corresponding to the two nodes at the • first level of the tree) and each one can be partitioned in parallel. Similarly, the concurrency • continues to increase as we move down the tree. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

7

Recursive Decomposition: Example The problem of finding the minimum number in a given list can be fashioned as a divide-and-conquer algorithm. 1. procedure SERIAL_MIN (A, n) 2. begin 3. min = A[0]; 4. for i := 1 to n − 1 do 5. if (A[i] < min) min := A[i]; 6. endfor; 7. return min; 8. end SERIAL_MIN Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

8

Recursive Decomposition: Example We can rewrite the loop as follows: 1. procedure RECURSIVE_MIN (A, n) 2. begin 3. if ( n = 1 ) then 4. min := A [0] ; 5. else 6. lmin := RECURSIVE_MIN ( A, n/2 ); 7. rmin := RECURSIVE_MIN ( &(A[n/2]), n - n/2 ); 8. if (lmin < rmin) then 9. min := lmin; 10. else 11. min := rmin; 12. endelse; 13. endelse; 14. return min; 15. endAbdel-Megeed RECURSIVE_MIN Dr. Mohammed Salem High Performance Computing 2016/ 2017

Lecture 7

9

Recursive Decomposition: Example • The code in the previous slide can be decomposed naturally using a recursive decomposition strategy. • Finding the minimum number in the set {4, 9, 1, 7, 8, 11, 2, 12}. • The task dependency graph associated with this computation is as follows:

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

10

Outline • Introduction to Parallel Algorithms – Tasks and Decomposition – Processes and Mapping – Processes Versus Processors

• Decomposition Techniques – Recursive Decomposition – Data-based Decomposition – Exploratory Decomposition – Hybrid Decomposition Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

11

3.2.2 Data Decomposition • Identify the data on which computations are performed. • Partition this data across various tasks. • This partitioning induces a decomposition of the problem. • Data can be partitioned in various ways - this critically impacts performance of a parallel algorithm.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

12

Data Decomposition: Output Data Decomposition • Often, each element of the output can be computed independently of others (but simply as a function of the input). • A partition of the output across tasks decomposes the problem naturally.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

13

Output Data Decomposition: Example Consider the problem of multiplying two n x n matrices A and B to yield matrix C. The output matrix C can be partitioned into four tasks as follows:

Task 1: Task 2: Task 3: Task 4:

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

14

Output Data Decomposition: Example A partitioning of output data does not result in a unique decomposition into tasks. For example, for the same problem as in previous slide, with identical output data distribution, we can derive the following two (other) decompositions:

Decomposition I

Decomposition II

Task 1: C1,1 = A1,1 B1,1

Task 1: C1,1 = A1,1 B1,1

Task 2: C1,1 = C1,1 + A1,2 B2,1

Task 2: C1,1 = C1,1 + A1,2 B2,1

Task 3: C1,2 = A1,1 B1,2

Task 3: C1,2 = A1,2 B2,2

Task 4: C1,2 = C1,2 + A1,2 B2,2

Task 4: C1,2 = C1,2 + A1,1 B1,2

Task 5: C2,1 = A2,1 B1,1

Task 5: C2,1 = A2,2 B2,1

Task 6: C2,1 = C2,1 + A2,2 B2,1

Task 6: C2,1 = C2,1 + A2,1 B1,1

Task 7: C2,2 = A2,1 B1,2

Task 7: C2,2 = A2,1 B1,2

Task 8: C2,2 = C2,2 + A2,2 B2,2

Task 8: C2,2 = C2,2 + A2,2 B2,2

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

15

Output Data Decomposition: Example • Consider the problem of counting the instances of given item sets in a database of transactions. • we are given a set T containing n transactions and a set I containing m itemsets. • find the number of times that each itemset in I appears in all the transactions; Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

16

Output Data Decomposition: Example • The computation can be decomposed into two tasks by partitioning the output into two parts and having each task compute its half of the frequencies. Here the itemsets input has also been partitioned. • The primary motivation for the decomposition is to have each task independently compute the subset of frequencies assigned to it.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

17

Output Data Decomposition: Example From the previous example, the following observations can be made: • If the database of transactions is replicated across the processes, each task can be independently accomplished with no communication. • If the database is partitioned across processes as well (for reasons of memory utilization), each task first computes partial counts. These counts are then aggregated at the appropriate task. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

18

Input Data Partitioning • In many cases, this is the only natural decomposition because the output is not clearly known a-priori (e.g., the problem of finding the minimum in a list, sorting a given list, etc.). • A task is associated with each input data partition. The task performs as much of the computation with its part of the data. Subsequent processing combines these partial results. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

19

Input Data Partitioning: Example In the database counting example, the input (i.e., the transaction set) can be partitioned. This induces a task decomposition in which each task generates partial counts for all itemsets. These are combined subsequently for aggregate counts.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

20

Partitioning Input and Output Data Often input and output data decomposition can be combined for a higher degree of concurrency. For the itemset counting example, the transaction set (input) and itemset counts (output) can both be decomposed as follows:

Intermediate Data Partitioning • Computation can often be viewed as a sequence of transformation from the input to the output data. • In these cases, it is often beneficial to use one of the intermediate stages as a basis for decomposition.

Intermediate Data Partitioning: Example

Let us revisit the example of dense matrix multiplication. We first show how we can visualize this computation in terms of intermediate matrices D.

Intermediate Data Partitioning: A decomposition of intermediate data structure leads to the following Example decomposition into 8 + 4 tasks: Stage I

Stage II

Task 01: D1,1,1= A1,1 B1,1

Task 02: D2,1,1= A1,2 B2,1

Task 03: D1,1,2= A1,1 B1,2

Task 04: D2,1,2= A1,2 B2,2

Task 05: D1,2,1= A2,1 B1,1

Task 06: D2,2,1= A2,2 B2,1

Task 07: D1,2,2= A2,1 B1,2

Task 08: D2,2,2= A2,2 B2,2

Task 09: C1,1 = D1,1,1 + D2,1,1

Task 10: C1,2 = D1,1,2 + D2,1,2

Task 11: C2,1 = D1,2,1 + D2,2,1

Task 12: C2,,2 = D1,2,2 + D2,2,2

Intermediate Data Partitioning: Example

The task dependency graph for the decomposition (shown in previous foil) into 12 tasks is as follows:

The Owner Computes Rule • The Owner Computes Rule generally states that the process assigned a particular data item is responsible for all computation associated with it. • In the case of input data decomposition, the owner computes rule implies that all computations that use the input data are performed by the process. • In the case of output data decomposition, the owner computes rule implies that the output is computed by the process to which the output data is assigned.

Outline • Introduction to Parallel Algorithms – Tasks and Decomposition – Processes and Mapping – Processes Versus Processors

• Decomposition Techniques – Recursive Decomposition – Data-based Decomposition – Exploratory Decomposition – Hybrid Decomposition Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

27

Exploratory Decomposition • The decomposition of the problem goes handin-hand with its execution. • It involves the exploration (search) of a state space of solutions.

Exploratory Decomposition: Example A simple application of exploratory decomposition is in the solution to a 15 puzzle (a tile puzzle). We show a sequence of three moves that transform a given initial state (a) to desired final state (d).

Exploratory Decomposition: Example A simple application of exploratory decomposition is in the solution to a 15 puzzle (a tile puzzle). We show a sequence of three moves that transform a given initial state (a) to desired final state (d).

Exploratory Decomposition: Example A simple application of exploratory decomposition is in the solution to a 15 puzzle (a tile puzzle). We show a sequence of three moves that transform a given initial state (a) to desired final state (d).

Exploratory Decomposition: Example A simple application of exploratory decomposition is in the solution to a 15 puzzle (a tile puzzle). We show a sequence of three moves that transform a given initial state (a) to desired final state (d).

Of-course, the problem of computing the solution, in general, is much more difficult than in this simple example.

Exploratory Decomposition: Example State space graph. Each node is a configuration and each edge connects configurations that can be reached from one another by a single move of a tile.

Exploratory Decomposition: Example The state space can be explored by generating various successor states of the current state and to view them as independent tasks.

Exploratory Decomposition • First, a few levels of configurations starting from the initial configuration are generated serially until the search tree has a sufficient number of leaf nodes. • Now each node is assigned to a task to explore further until at least one of them finds a solution. • As soon as one of the concurrent tasks finds a solution it can inform the others to terminate their searches. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

35

Exploratory Decomposition: Speedup • In many instances of exploratory decomposition, the decomposition technique may change the amount of work done by the parallel formulation. • This change results in super- or sub-linear speedups.

Speculative Decomposition • Speculative decomposition is similar to evaluating one or more of the branches of a switch statement in C in parallel before the input for the switch is available. • While one task is performing the computation that will eventually resolve the switch, other tasks could pick up the multiple branches of the switch in parallel. • When the input for the switch has finally been computed, the computation corresponding to the correct branch would be used while that corresponding to the other branches would be discarded.

Speculative Decomposition • In some applications, dependencies between tasks are not known a-priori. • There are generally two approaches to dealing with such applications: conservative approaches, which identify independent tasks only when they are guaranteed to not have dependencies, and, optimistic approaches, which schedule tasks even when they may potentially be erroneous. • Conservative approaches may yield little concurrency and optimistic approaches may require roll-back mechanism in the case of an error.

Speculative Decomposition: Example • • • • •

A classic example of speculative decomposition is in discrete event simulation. The central data structure in a discrete event simulation is a timeordered event list. Events are extracted precisely in time order, processed, and if required, resulting events are inserted back into the event list. Consider your day today as a discrete event system - you get up, get ready, drive to work, work, eat lunch, work some more, drive back, eat dinner, and sleep. Each of these events may be processed independently, however, in driving to work, you might meet with an unfortunate accident and not get to work at all. Therefore, an optimistic scheduling of other events will have to be rolled back.

Hybrid Decompositions Often, a mix of decomposition techniques is necessary for decomposing a problem. Consider the following examples: • In quicksort, recursive decomposition alone limits concurrency (Why?). A mix of data and recursive decompositions is more desirable. • In discrete event simulation, there might be concurrency in task processing. A mix of speculative decomposition and data decomposition may work well.

Outline • Introduction to Parallel Algorithms – Tasks and Decomposition – Processes and Mapping – Processes Versus Processors

• Decomposition Techniques – – – –

Recursive Decomposition Data-based Decomposition Exploratory Decomposition Hybrid Decomposition

• Characteristics of Tasks

Characteristics of Tasks Relevant task characteristics include: • Task generation. • Task sizes. • Knowledge of Task Sizes • Size of data associated with tasks.

Task Generation • Static task generation: Concurrent tasks can be identified a-priori. Typical matrix operations, graph algorithms, image processing applications • Dynamic task generation: Tasks are generated as we perform computation. A classic example of this is in game playing - each 15 puzzle board is generated from the previous one.

Task Sizes • Task sizes may be uniform (i.e., all tasks are the same size) or non-uniform. • Example for uniform is the matrix-vector multiplication. • Non-uniform task sizes may be such that they can be determined (or estimated) a-priori or not. • Examples in this class include Quicksort.

Size of Data Associated with Tasks • The size of data associated with a task may be small or large when viewed in the context of the size of the task. • A small context of a task implies that an algorithm can easily communicate this task to other processes dynamically (e.g., the 15 puzzle). • A large context ties the task to a process.

Characteristics of Task Interactions • Tasks may communicate with each other in various ways. The associated dichotomy is: • Static interactions: The tasks and their interactions are known a-priori. These are relatively simpler to code into programs. • Dynamic interactions: The timing or interacting tasks cannot be determined a-priori. These interactions are harder to code, especitally, as we shall see, using message passing APIs.

Characteristics of Task Interactions • Regular interactions: There is a definite pattern (in the graph sense) to the interactions. These patterns can be exploited for efficient implementation. • Irregular interactions: Interactions lack well-defined topologies.

Characteristics of Task Interactions A simple example of a regular static interaction pattern is in image dithering. The underlying communication pattern is a structured (2-D mesh) one:

Characteristics of Task Interactions The multiplication of a sparse matrix with a vector is a good example of a static irregular interaction pattern. Here is an example of a sparse matrix and its associated interaction pattern.

Characteristics of Task Interactions • Interactions may be read-only or read-write. • In read-only interactions, tasks just read data items associated with other tasks. • In read-write interactions tasks read, as well as modify data items associated with other tasks.

Characteristics of Task Interactions • Interactions may be one-way or two-way. • A one-way interaction can be initiated and accomplished by one of the two interacting tasks. • A two-way interaction requires participation from both tasks involved in an interaction. • One way interactions are somewhat harder to code in message passing APIs.

Outline • Introduction to Parallel Algorithms – Tasks and Decomposition – Processes and Mapping – Processes Versus Processors

• Decomposition Techniques – – – –

Recursive Decomposition Data-based Decomposition Exploratory Decomposition Hybrid Decomposition

• Characteristics of Tasks • Parallel Algorithm Models

Parallel Algorithm Models • Data Parallel Model: Tasks are statically (or semi-statically) mapped to processes and each task performs similar operations on different data. • Task Graph Model: Starting from a task dependency graph, the interrelationships among the tasks are utilized to promote locality or to reduce interaction costs.

Parallel Algorithm Models • Master-Slave Model: One or more processes generate work and allocate it to worker processes. This allocation may be static or dynamic. • Pipeline / Producer-Comsumer Model: A stream of data is passed through a succession of processes, each of which perform some task on it. • Hybrid Models: A hybrid model may be composed either of multiple models applied hierarchically or multiple models applied sequentially to different phases of a parallel algorithm.

Contacts High Performance Computing, 2016/2017 Dr. Mohammed Abdel-Megeed M. Salem Faculty of Computer and Information Sciences, Ain Shams University Abbassia, Cairo, Egypt Tel.: +2 011 1727 1050 Email: [email protected] Web: https://sites.google.com/a/fcis.asu.edu.eg/salem

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 7

55

Lecture 7

Nov 22, 2016 - Faculty of Computer and Information Sciences. Ain Shams University ... A into two subsequences A0 and A1 such that all the elements in A0 are ... In this example, once the list has been partitioned around the pivot, each sublist ..... Of-course, the problem of computing the solution, in general, is much.

1MB Sizes 1 Downloads 410 Views

Recommend Documents

Lectures / Lecture 7
Apr 5, 2010 - Contents. 1 Introduction (0:00–5:00). 2. 2 Security (5:00–112:00). 2 .... use it to distribute pornography, you don't have to pay for disk space or bandwidth, but you might make money off ... requests—the more of a threat you pose

Network Lecture # 7
16.4 Describe the signal pattern produced on the medium by the Manchester-encoded preamble of the IEEE. 802.3 MAC frame. 10101010 10101010 10101010 10101010 10101010 10101010 10101010 transmitted in order from left to right. The Manchester pattern pr

Lectures / Lecture 7
Apr 5, 2010 - Next we might try passing a phrase like “DELETE FROM ... your hard drive. This is why you should never open attachments from sources you don't trust! • Worms are more insidious because they don't require user interaction in order to

Lecture 7 Signature Schemes
AOL search data scandal (2006). #4417749: • clothes for age 60. • 60 single men ... rescue of older dogs. • movies for dogs. • sinus infection. Thelma Arnold.

Lecture 7 - Javamail .pdf
Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Lecture 7 - Javamail .pdf. Lecture 7 - Javamail .pdf. Open. Extract.

given wilson lecture color 10-7-13.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. given wilson ...

Lecture 7 1 Partially ordered sets - Cornell Math
18.312: Algebraic Combinatorics. Lionel Levine. Lecture 7. Lecture date: Feb 24, 2011. Notes by: Andrew Geng. 1 Partially ordered sets. 1.1 Definitions. Definition 1 A partially ordered set (poset for short) is a set P with a binary relation. R ⊆ P

LECTURE 7: NONLINEAR EQUATIONS and 1-d ... - GitHub
solving nonlinear equations and 1-d optimization here. • f(x)=0 (either in 1-d or many ... dimensions. • Requires a gradient: f(x+δ)=f(x)+δf'(x)+… • We want f(x+δ)=0, hence δ=-f(x)/f'(x). • Rate of convergence is quadratic (NR 9.4) εi+

Lecture 7 POWER OF HYDROGEN ION.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Lecture 7 ...

LECTURE - CHECKLIST
Consider hardware available for visual aids - Computer/ Laptop, LCD ... Decide timing- 65 minutes for lecture and 10 minutes for questions and answers.

Lecture 3
Oct 11, 2016 - request to the time the data is available at the ... If you want to fight big fires, you want high ... On the above architecture, consider the problem.

pdf-1490\record-of-agard-lecture-series-lecture ...
... the apps below to open or edit this item. pdf-1490\record-of-agard-lecture-series-lecture-series-i ... unne-j-c-north-atlantic-treaty-organization-vannucci.pdf.

Lectures / Lecture 4
Mar 1, 2010 - Exam 1 is next week during normal lecture hours. You'll find resources to help you prepare for the exam, which will be comprehensive, on the.

Prize Lecture slides
Dec 8, 2011 - Statistical Model for government surplus net-of interest st st = ∞. ∑ ... +R. −1 bt+1,t ≥ 0. Iterating backward bt = − t−1. ∑ j=0. Rj+1st+j−1 + Rtb0.

Lecture Note_Spectrophotometry.pdf
Aug 18, 2016 - ... rival UV‐Visible spectrometry. for its simplicity simplicity, versatility versatility, speed, accuracy accuracy and. cost‐effectiveness. Page 1 of 34 ...

Lecture 9
Feb 15, 2016 - ideological content have persisted among the American public from 1987 to 2012.2 ... That is, learning is social and takes place within the individuals' ... independent network structures, deriving always consensus results.

Lectures / Lecture 4
Mar 1, 2010 - course website. After lecture today, there will also be a review section. • Assignments are graded on a /–, /, /+ basis whereas exams are graded.

Inaugural lecture
Jan 31, 2001 - Contemporary global capitalism results from interactions between economics, finance, and technology. Any number of ... the form of software, but in the creation of images, and symbols. You could view it as a .... formal structures, rul

Frederick Sanger - Nobel Lecture
and hence the number of amino acid residues present. Values varying from ... In order to study in more detail the free amino groups of insulin and other proteins, a general ... disulphide bridges of cystine residues. Insulin is relatively rich in ...

Lecture Capture - USFSM
Step 2 on the Crestron: Touch the Lecture Capture Mode to turn on the projector and camera. Page 2. Step 3 on the Crestron Choose Podium PC. Now you will see your desktop on the projector. Panopto. Step 1 Log in to myUSF. Page 3. Step 2 Launch Canvas

Lecture 1 - GitHub
Jan 9, 2018 - We will put special emphasis on learning to use certain tools common to companies which actually do data ... Class time will consist of a combination of lecture, discussion, questions and answers, and problem solving, .... After this da

Lecture(PDF)
Structured programming uses an approach whichistop down,. OOPuses an ... anyshape it get rotated clockwise 360 degree and a soundis also played.

Inquisitive semantics lecture notes
Jun 25, 2012 - reformulated as a recursive definition of the set |ϕ|g of models over a domain. D in which ϕ is true relative to an assignment g. The inductive ...