High Performance Computing

Viewer
Transcript

High Performance Computing For senior undergraduate students

Lecture 5: Communication Model 01.11.2016

Dr. Mohammed Abdel-Megeed Salem Scientific Computing Department Faculty of Computer and Information Sciences Ain Shams University

Outline • 2.3 Dichotomy of Parallel Computing Platforms – 2.3.1 Control Structure of Parallel Platforms – 2.3.2 Communication Model of Parallel Platforms • Shared-Address-Space Platforms • Message-Passing Platforms

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

2

Communiction Models • Forms of data exchange between parallel tasks: – accessing a shared data space and – exchanging messages.

• Platforms that provide a shared data space are called shared-address-space machines or multiprocessors. • Platforms that support messaging are called message passing platforms or multicomputers. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

3

Shared-Address-Space Platforms • The "shared-address-space" supports a common data space that is accessible to all processors. • Processors interact by modifying data objects stored in this shared address-space. • Shared-address-space platforms supporting SIMD.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

4

Shared-Address-Space Platforms

• Typical shared-address-space architectures: – (a) Uniform-memory-access shared-address-space computer; – (b) Uniform-memory-access shared-address-space computer with caches and memories; – (c) Non-uniform-memory-access shared-address-space computer with local memory only. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

5

Uniform vs Non-Uniform Memory Access • Memory can be local (exclusive to a processor) or global (common to all processors). • If the time taken by a processor to access any memory word in the system (global or local) is identical, the platform is classified as a uniform memory access (UMA) multicomputer. • On the other hand, if the time taken to access certain memory words is longer than others, the platform is called a nonuniform memory access (NUMA) multicomputer Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

6

Control Structures

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

7

Control Structures

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

8

Uniform vs Non-Uniform Memory Access • Algorithm design for NUMA machines require locality to improve performance. • Programming NUMA platforms : – reads are easy since are implicitly visible to other processors. – read/write data to shared data requires mutual execlusion

• Caches in NUMA require coordinated access to multiple copies. This leads to the cache coherence problem. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

9

Shared Address Space vs Shared Memory Machines • It is important to note the difference between the terms shared address space and shared memory. • We refer to the former as a programming abstraction and to the latter as a physical machine attribute. • It is possible to provide a shared address space using a physically distributed memory. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

10

Message-Passing Platforms • A set of processors and their own (exclusive) memory • Memory addresses in one processor do not map to another processor  no global address space. • Changes made to local memory have no effect on the memory of other processors  no cache coherency Dr. Mohammed Abdel-Megeed Salem

• Instances of such a view come naturally from clustered workstations and non-shared-addressspace multicomputers

High Performance Computing 2016/ 2017

Lecture 5

11

Message-Passing Platforms • Interactions between different nodes are accomplished using messages on a network. • A message  Data, Work, Synchronization actions. • Programmed using (variants of) send and receive primitives. {GetID, NumProcs} • Libraries such as MPI and PVM provide such primitives. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

12

Message-Passing Platforms • Interactions are accomplished by sending and receiving messages, the basic are send and receive • Since the send and receive operations must specify target addresses, there must be a mechanism to assign an ID to each of the multiple processes executing a parallel program. • This ID is typically made available to the program using a function such as whoami. • One other function that is typically needed – numprocs, which specifies the number of processes participating in the ensemble. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

13

Message-Passing Platforms Advantages: • Memory is scalable with the number of processors. Increase the number of processors and the size of memory increases proportionately. • Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. • Cost effectiveness: can use commodity, off-the-shelf processors and networking.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

14

Message-Passing Platforms Disadvantages: • The programmer is responsible for many of the details associated with data communication between processors. • It may be difficult to map existing data structures, based on global memory, to this memory organization. • Uses Non-uniform memory access (NUMA). Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

15

Message Passing vs Shared address • Message passing requires little hardware support, other than a network. • Shared address space platforms can easily emulate message passing.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

16

Outline • 2.4 Physical Organization of Parallel Platforms – 2.4.1 Architecture of an Ideal Parallel Computer – 2.4.2 Interconnection Networks for Parallel Computers – 2.4.3 Network Topologies – 2.4.4 Evaluating Static Interconnection Networks – 2.4.5 Evaluating Dynamic Interconnection Networks – 2.4.6 Cache Coherence in Multiprocessor Systems

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

17

Architecture of an Ideal Parallel Computer • A natural extension of the Random Access Machine (RAM) serial architecture is the Parallel Random Access Machine, or PRAM. • PRAMs consist of p processors and a global memory of unbounded size that is uniformly accessible to all processors. • Processors share a common clock but may execute different instructions in each cycle. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

18

Architecture of an Ideal Parallel Computer • Depending on how simultaneous memory accesses are handled, PRAMs can be divided into four subclasses. (Concurrent/ exclusive) X(read/ write) – Exclusive-read, exclusive-write (EREW) PRAM. – Concurrent-read, exclusive-write (CREW) PRAM. – Exclusive-read, concurrent-write (ERCW) PRAM. – Concurrent-read, concurrent-write (CRCW) PRAM.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

19

Architecture of an Ideal Parallel Computer • Several protocols are used to resolve concurrent writes. – Common: write only if all values are identical. – Arbitrary: write the data from a randomly selected processor. – Priority: follow a pre-determined priority order. – Sum: Write the sum of all data items.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

20

Interconnection Networks for Parallel Computers • Interconnection networks carry data between processors and to memory. • Interconnects are made of switches and links (wires, fiber). • Interconnects are classified as static or dynamic.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

21

Interconnection Networks for Parallel Computers • Static networks consist of point-to-point communication links among processing nodes and are also referred to as direct networks. • Dynamic networks are built using switches and communication links. Dynamic networks are also referred to as indirect networks.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

22

Static and Dynamic Interconnection Networks

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

23

Interconnection Networks • Switches map a fixed number of inputs to outputs. – provide support for internal buffering (when the requested output port is busy), – routing (to alleviate congestion on the network), and – multicast (same output on multiple ports).

• The total number of ports on a switch is the degree of the switch. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

24

Outline • 2.4 Physical Organization of Parallel Platforms – 2.4.1 Architecture of an Ideal Parallel Computer – 2.4.2 Interconnection Networks for Parallel Computers – 2.4.3 Network Topologies – 2.4.4 Evaluating Static Interconnection Networks – 2.4.5 Evaluating Dynamic Interconnection Networks – 2.4.6 Cache Coherence in Multiprocessor Systems

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

25

Network Topologies: Buses • A bus-based network is perhaps the simplest network consisting of a shared medium that is common to all the nodes. Thus… • All processors access a common bus for exchanging data. • The distance between any two nodes is O(1) in a bus. The bus also provides a convenient broadcast media. •  The bandwidth of the shared bus is a major bottleneck. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

26

Network Topologies: Buses

Bus-based interconnects (a) with no local caches; (b) with local memory/caches.

Since much of the data accessed by processors is local to the processor, a local memory can improve the performance of busbased machines. Dr. Mohammed Abdel-Megeed Salem High Performance Computing 2016/ 2017 Lecture 5 27

Network Topologies: Buses, Example • p processors sharing a bus to the memory. Each processor accesses k data items, and each data access takes time tcycle, • Execution time lower bound: tcycle x kp sec.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

28

Network Topologies: Buses, Example • p processors sharing a bus to the memory. Each processor accesses k data items, and each data access takes time tcycle, • Execution time lower bound: tcycle x kp sec. • Let us assume that 50% of the memory accesses (0.5k) are made to local data. • Execution time lower bound: 0.5 x tcycle x k + 0.5 x tcycle x kp. Local data access

Dr. Mohammed Abdel-Megeed Salem

shared data access

High Performance Computing 2016/ 2017

Lecture 5

29

Network Topologies: Crossbars A crossbar network uses an p×m grid of switches to connect p inputs to m outputs in a non-blocking manner.

A completely non-blocking crossbar network connecting p processors to b memory banks.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

30

Network Topologies: Crossbars • The cost of a crossbar of p processors grows as O(p2). • This is generally difficult to scale for large values of p. • Crossbars have excellent performance scalability but poor cost scalability. • Buses have excellent cost scalability, but poor performance scalability. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

31

Network Topologies: Completely Connected Network • Each processor is connected to every other processor. • The number of links in the network scales as O(p2). • While the performance scales very well, the hardware complexity is not realizable for large values of p. • In this sense, these networks are static counterparts of crossbars. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

32

Network Topologies: Completely Connected and Star Connected Networks Example of an 8-node completely connected network.

(a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

33

Network Topologies: Star Connected Network • Every node is connected only to a common node at the center. • Distance between any pair of nodes is O(1). However, the central node becomes a bottleneck. • In this sense, star connected networks are static counterparts of buses.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

34

Network Topologies: Linear Arrays, Meshes, and k-d Meshes • In a linear array, each node has two neighbors, one to its left and one to its right. • If the nodes at either end are connected, we refer to it as a 1-D torus or a ring.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

35

Network Topologies: Linear Arrays, Meshes, and k-d Meshes • A generalization to 2 dimensions has nodes with 4 neighbors, to the north, south, east, and west. • A further generalization to d dimensions has nodes with 2d neighbors. • A special case of a d-dimensional mesh is a hypercube. Here, d = log p, where p is the total number of nodes. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

36

Network Topologies: Linear Arrays

Linear arrays: (a) with no wraparound links; (b) with wraparound link.

Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

37

Network Topologies: Tree-Based Networks

Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

38

Network Topologies: Tree Properties • The distance between any two nodes is no more than 2log(p). • Links higher up the tree potentially carry more traffic than those at the lower levels.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

39

Contacts High Performance Computing, 2016/2017 Dr. Mohammed Abdel-Megeed M. Salem Faculty of Computer and Information Sciences, Ain Shams University Abbassia, Cairo, Egypt Tel.: +2 011 1727 1050 Email: [email protected] Web: https://sites.google.com/a/fcis.asu.edu.eg/salem

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 5

40

High Performance Computing

Nov 1, 2016 - Platforms that support messaging are called message ..... Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree ...

Download PDF

889KB Sizes 1 Downloads 308 Views

Report

High Performance Computing

Recommend Documents