High Performance Computing For senior undergraduate students

Lecture 5: Communication Model 01.11.2016

Dr. Mohammed Abdel-Megeed Salem Scientific Computing Department Faculty of Computer and Information Sciences Ain Shams University

Outline • 2.3 Dichotomy of Parallel Computing Platforms – 2.3.1 Control Structure of Parallel Platforms – 2.3.2 Communication Model of Parallel Platforms • Shared-Address-Space Platforms • Message-Passing Platforms

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

2

Communiction Models • Forms of data exchange between parallel tasks: – accessing a shared data space and – exchanging messages.

• Platforms that provide a shared data space are called shared-address-space machines or multiprocessors. • Platforms that support messaging are called message passing platforms or multicomputers. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

3

Shared-Address-Space Platforms • The "shared-address-space" supports a common data space that is accessible to all processors. • Processors interact by modifying data objects stored in this shared address-space. • Shared-address-space platforms supporting SIMD.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

4

Shared-Address-Space Platforms

• Typical shared-address-space architectures: – (a) Uniform-memory-access shared-address-space computer; – (b) Uniform-memory-access shared-address-space computer with caches and memories; – (c) Non-uniform-memory-access shared-address-space computer with local memory only. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

5

Uniform vs Non-Uniform Memory Access • Memory can be local (exclusive to a processor) or global (common to all processors). • If the time taken by a processor to access any memory word in the system (global or local) is identical, the platform is classified as a uniform memory access (UMA) multicomputer. • On the other hand, if the time taken to access certain memory words is longer than others, the platform is called a nonuniform memory access (NUMA) multicomputer Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

6

Control Structures

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

7

Control Structures

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

8

Uniform vs Non-Uniform Memory Access • Algorithm design for NUMA machines require locality to improve performance. • Programming NUMA platforms : – reads are easy since are implicitly visible to other processors. – read/write data to shared data requires mutual execlusion

• Caches in NUMA require coordinated access to multiple copies. This leads to the cache coherence problem. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

9

Shared Address Space vs Shared Memory Machines • It is important to note the difference between the terms shared address space and shared memory. • We refer to the former as a programming abstraction and to the latter as a physical machine attribute. • It is possible to provide a shared address space using a physically distributed memory. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

10

Message-Passing Platforms • A set of processors and their own (exclusive) memory • Memory addresses in one processor do not map to another processor  no global address space. • Changes made to local memory have no effect on the memory of other processors  no cache coherency Dr. Mohammed Abdel-Megeed Salem

• Instances of such a view come naturally from clustered workstations and non-shared-addressspace multicomputers

High Performance Computing 2016/ 2017

Lecture 5

11

Message-Passing Platforms • Interactions between different nodes are accomplished using messages on a network. • A message  Data, Work, Synchronization actions. • Programmed using (variants of) send and receive primitives. {GetID, NumProcs} • Libraries such as MPI and PVM provide such primitives. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

12

Message-Passing Platforms • Interactions are accomplished by sending and receiving messages, the basic are send and receive • Since the send and receive operations must specify target addresses, there must be a mechanism to assign an ID to each of the multiple processes executing a parallel program. • This ID is typically made available to the program using a function such as whoami. • One other function that is typically needed – numprocs, which specifies the number of processes participating in the ensemble. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

13

Message-Passing Platforms Advantages: • Memory is scalable with the number of processors. Increase the number of processors and the size of memory increases proportionately. • Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. • Cost effectiveness: can use commodity, off-the-shelf processors and networking.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

14

Message-Passing Platforms Disadvantages: • The programmer is responsible for many of the details associated with data communication between processors. • It may be difficult to map existing data structures, based on global memory, to this memory organization. • Uses Non-uniform memory access (NUMA). Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

15

Message Passing vs Shared address • Message passing requires little hardware support, other than a network. • Shared address space platforms can easily emulate message passing.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

16

Outline • 2.4 Physical Organization of Parallel Platforms – 2.4.1 Architecture of an Ideal Parallel Computer – 2.4.2 Interconnection Networks for Parallel Computers – 2.4.3 Network Topologies – 2.4.4 Evaluating Static Interconnection Networks – 2.4.5 Evaluating Dynamic Interconnection Networks – 2.4.6 Cache Coherence in Multiprocessor Systems

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

17

Architecture of an Ideal Parallel Computer • A natural extension of the Random Access Machine (RAM) serial architecture is the Parallel Random Access Machine, or PRAM. • PRAMs consist of p processors and a global memory of unbounded size that is uniformly accessible to all processors. • Processors share a common clock but may execute different instructions in each cycle. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

18

Architecture of an Ideal Parallel Computer • Depending on how simultaneous memory accesses are handled, PRAMs can be divided into four subclasses. (Concurrent/ exclusive) X(read/ write) – Exclusive-read, exclusive-write (EREW) PRAM. – Concurrent-read, exclusive-write (CREW) PRAM. – Exclusive-read, concurrent-write (ERCW) PRAM. – Concurrent-read, concurrent-write (CRCW) PRAM.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

19

Architecture of an Ideal Parallel Computer • Several protocols are used to resolve concurrent writes. – Common: write only if all values are identical. – Arbitrary: write the data from a randomly selected processor. – Priority: follow a pre-determined priority order. – Sum: Write the sum of all data items.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

20

Interconnection Networks for Parallel Computers • Interconnection networks carry data between processors and to memory. • Interconnects are made of switches and links (wires, fiber). • Interconnects are classified as static or dynamic.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

21

Interconnection Networks for Parallel Computers • Static networks consist of point-to-point communication links among processing nodes and are also referred to as direct networks. • Dynamic networks are built using switches and communication links. Dynamic networks are also referred to as indirect networks.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

22

Static and Dynamic Interconnection Networks

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

23

Interconnection Networks • Switches map a fixed number of inputs to outputs. – provide support for internal buffering (when the requested output port is busy), – routing (to alleviate congestion on the network), and – multicast (same output on multiple ports).

• The total number of ports on a switch is the degree of the switch. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

24

Outline • 2.4 Physical Organization of Parallel Platforms – 2.4.1 Architecture of an Ideal Parallel Computer – 2.4.2 Interconnection Networks for Parallel Computers – 2.4.3 Network Topologies – 2.4.4 Evaluating Static Interconnection Networks – 2.4.5 Evaluating Dynamic Interconnection Networks – 2.4.6 Cache Coherence in Multiprocessor Systems

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

25

Network Topologies: Buses • A bus-based network is perhaps the simplest network consisting of a shared medium that is common to all the nodes. Thus… • All processors access a common bus for exchanging data. • The distance between any two nodes is O(1) in a bus. The bus also provides a convenient broadcast media. •  The bandwidth of the shared bus is a major bottleneck. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

26

Network Topologies: Buses

Bus-based interconnects (a) with no local caches; (b) with local memory/caches.

Since much of the data accessed by processors is local to the processor, a local memory can improve the performance of busbased machines. Dr. Mohammed Abdel-Megeed Salem High Performance Computing 2016/ 2017 Lecture 5 27

Network Topologies: Buses, Example • p processors sharing a bus to the memory. Each processor accesses k data items, and each data access takes time tcycle, • Execution time lower bound: tcycle x kp sec.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

28

Network Topologies: Buses, Example • p processors sharing a bus to the memory. Each processor accesses k data items, and each data access takes time tcycle, • Execution time lower bound: tcycle x kp sec. • Let us assume that 50% of the memory accesses (0.5k) are made to local data. • Execution time lower bound: 0.5 x tcycle x k + 0.5 x tcycle x kp. Local data access

Dr. Mohammed Abdel-Megeed Salem

shared data access

High Performance Computing 2016/ 2017

Lecture 5

29

Network Topologies: Crossbars A crossbar network uses an p×m grid of switches to connect p inputs to m outputs in a non-blocking manner.

A completely non-blocking crossbar network connecting p processors to b memory banks.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

30

Network Topologies: Crossbars • The cost of a crossbar of p processors grows as O(p2). • This is generally difficult to scale for large values of p. • Crossbars have excellent performance scalability but poor cost scalability. • Buses have excellent cost scalability, but poor performance scalability. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

31

Network Topologies: Completely Connected Network • Each processor is connected to every other processor. • The number of links in the network scales as O(p2). • While the performance scales very well, the hardware complexity is not realizable for large values of p. • In this sense, these networks are static counterparts of crossbars. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

32

Network Topologies: Completely Connected and Star Connected Networks Example of an 8-node completely connected network.

(a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

33

Network Topologies: Star Connected Network • Every node is connected only to a common node at the center. • Distance between any pair of nodes is O(1). However, the central node becomes a bottleneck. • In this sense, star connected networks are static counterparts of buses.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

34

Network Topologies: Linear Arrays, Meshes, and k-d Meshes • In a linear array, each node has two neighbors, one to its left and one to its right. • If the nodes at either end are connected, we refer to it as a 1-D torus or a ring.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

35

Network Topologies: Linear Arrays, Meshes, and k-d Meshes • A generalization to 2 dimensions has nodes with 4 neighbors, to the north, south, east, and west. • A further generalization to d dimensions has nodes with 2d neighbors. • A special case of a d-dimensional mesh is a hypercube. Here, d = log p, where p is the total number of nodes. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

36

Network Topologies: Linear Arrays

Linear arrays: (a) with no wraparound links; (b) with wraparound link.

Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

37

Network Topologies: Tree-Based Networks

Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

38

Network Topologies: Tree Properties • The distance between any two nodes is no more than 2log(p). • Links higher up the tree potentially carry more traffic than those at the lower levels.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

39

Contacts High Performance Computing, 2016/2017 Dr. Mohammed Abdel-Megeed M. Salem Faculty of Computer and Information Sciences, Ain Shams University Abbassia, Cairo, Egypt Tel.: +2 011 1727 1050 Email: [email protected] Web: https://sites.google.com/a/fcis.asu.edu.eg/salem

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

40

High Performance Computing

Nov 1, 2016 - Platforms that support messaging are called message ..... Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree ...

889KB Sizes 1 Downloads 266 Views

Recommend Documents

High Performance Computing
Nov 8, 2016 - Faculty of Computer and Information Sciences. Ain Shams University ... Tasks are programmer-defined units of computation. • A given ... The number of tasks that can be executed in parallel is the degree of concurrency of a ...

High Performance Computing
Nov 29, 2016 - problem requires us to apply a 3 x 3 template to each pixel. If ... (ii) apply template on local subimage. .... Email: [email protected].

High Performance Computing
Dec 20, 2016 - Speedup. – Efficiency. – Cost. • The Effect of Granularity on Performance .... Can we build granularity in the example in a cost-optimal fashion?

High Performance Computing
Computational science paradigm: 3) Use high performance computer systems to simulate the ... and marketing decisions. .... Email: [email protected].

Advances in High-Performance Computing ... - Semantic Scholar
tions on a domain representing the surface of lake Constance, Germany. The shape of the ..... On the algebraic construction of multilevel transfer opera- tors.

SGI UV 300RL - High Performance Computing
By combining additional chassis (up to eight per standard 19-inch rack), UV 300RL is designed to scale up to 32 sockets and 1,152 threads (with hyper threading enabled). All of the interconnected chassis operate as a single system running under a sin

Advances in High-Performance Computing ... - Semantic Scholar
ement module is illustrated on the following model problem in eigenvalue computations. Let Ω ⊂ Rd, d = 2, 3 be a domain. We solve the eigenvalue problem:.

High performance computing in structural determination ...
Accepted 7 July 2008. Available online 16 July 2008 ... increasing complexity of algorithms and the amount of data needed to push the resolution limits. High performance ..... computing power and dozens of petabytes of storage distributed.

Ebook Introduction to High Performance Computing for ...
Book synopsis. Suitable for scientists, engineers, and students, this book presents a practical introduction to high performance computing (HPC). It discusses the ...

pdf-0743\high-performance-cluster-computing-programming-and ...
... the apps below to open or edit this item. pdf-0743\high-performance-cluster-computing-programming-and-applications-volume-2-by-rajkumar-buyya.pdf.

pdf-0743\high-performance-cluster-computing-programming-and ...
... the apps below to open or edit this item. pdf-0743\high-performance-cluster-computing-programming-and-applications-volume-2-by-rajkumar-buyya.pdf.

Bridging the High Performance Computing Gap: the ...
up by the system and the difficulties that have been faced by ... posed as a way to build virtual organizations aggregating .... tion and file transfer optimizations.

High-Performance Cloud Computing: A View of ...
1Cloud computing and Distributed Systems (CLOUDS) Laboratory. Department .... use of Cloud computing in computational science is still limited, but ..... Linux based systems. Being a .... features such as support for file transfer and resource.