High Performance Computing For senior undergraduate students
Lecture 5: Communication Model 01.11.2016
Dr. Mohammed Abdel-Megeed Salem Scientific Computing Department Faculty of Computer and Information Sciences Ain Shams University
Outline • 2.3 Dichotomy of Parallel Computing Platforms – 2.3.1 Control Structure of Parallel Platforms – 2.3.2 Communication Model of Parallel Platforms • Shared-Address-Space Platforms • Message-Passing Platforms
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
2
Communiction Models • Forms of data exchange between parallel tasks: – accessing a shared data space and – exchanging messages.
• Platforms that provide a shared data space are called shared-address-space machines or multiprocessors. • Platforms that support messaging are called message passing platforms or multicomputers. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
3
Shared-Address-Space Platforms • The "shared-address-space" supports a common data space that is accessible to all processors. • Processors interact by modifying data objects stored in this shared address-space. • Shared-address-space platforms supporting SIMD.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
4
Shared-Address-Space Platforms
• Typical shared-address-space architectures: – (a) Uniform-memory-access shared-address-space computer; – (b) Uniform-memory-access shared-address-space computer with caches and memories; – (c) Non-uniform-memory-access shared-address-space computer with local memory only. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
5
Uniform vs Non-Uniform Memory Access • Memory can be local (exclusive to a processor) or global (common to all processors). • If the time taken by a processor to access any memory word in the system (global or local) is identical, the platform is classified as a uniform memory access (UMA) multicomputer. • On the other hand, if the time taken to access certain memory words is longer than others, the platform is called a nonuniform memory access (NUMA) multicomputer Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
6
Control Structures
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
7
Control Structures
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
8
Uniform vs Non-Uniform Memory Access • Algorithm design for NUMA machines require locality to improve performance. • Programming NUMA platforms : – reads are easy since are implicitly visible to other processors. – read/write data to shared data requires mutual execlusion
• Caches in NUMA require coordinated access to multiple copies. This leads to the cache coherence problem. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
9
Shared Address Space vs Shared Memory Machines • It is important to note the difference between the terms shared address space and shared memory. • We refer to the former as a programming abstraction and to the latter as a physical machine attribute. • It is possible to provide a shared address space using a physically distributed memory. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
10
Message-Passing Platforms • A set of processors and their own (exclusive) memory • Memory addresses in one processor do not map to another processor no global address space. • Changes made to local memory have no effect on the memory of other processors no cache coherency Dr. Mohammed Abdel-Megeed Salem
• Instances of such a view come naturally from clustered workstations and non-shared-addressspace multicomputers
High Performance Computing 2016/ 2017
Lecture 5
11
Message-Passing Platforms • Interactions between different nodes are accomplished using messages on a network. • A message Data, Work, Synchronization actions. • Programmed using (variants of) send and receive primitives. {GetID, NumProcs} • Libraries such as MPI and PVM provide such primitives. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
12
Message-Passing Platforms • Interactions are accomplished by sending and receiving messages, the basic are send and receive • Since the send and receive operations must specify target addresses, there must be a mechanism to assign an ID to each of the multiple processes executing a parallel program. • This ID is typically made available to the program using a function such as whoami. • One other function that is typically needed – numprocs, which specifies the number of processes participating in the ensemble. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
13
Message-Passing Platforms Advantages: • Memory is scalable with the number of processors. Increase the number of processors and the size of memory increases proportionately. • Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. • Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
14
Message-Passing Platforms Disadvantages: • The programmer is responsible for many of the details associated with data communication between processors. • It may be difficult to map existing data structures, based on global memory, to this memory organization. • Uses Non-uniform memory access (NUMA). Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
15
Message Passing vs Shared address • Message passing requires little hardware support, other than a network. • Shared address space platforms can easily emulate message passing.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
16
Outline • 2.4 Physical Organization of Parallel Platforms – 2.4.1 Architecture of an Ideal Parallel Computer – 2.4.2 Interconnection Networks for Parallel Computers – 2.4.3 Network Topologies – 2.4.4 Evaluating Static Interconnection Networks – 2.4.5 Evaluating Dynamic Interconnection Networks – 2.4.6 Cache Coherence in Multiprocessor Systems
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
17
Architecture of an Ideal Parallel Computer • A natural extension of the Random Access Machine (RAM) serial architecture is the Parallel Random Access Machine, or PRAM. • PRAMs consist of p processors and a global memory of unbounded size that is uniformly accessible to all processors. • Processors share a common clock but may execute different instructions in each cycle. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
18
Architecture of an Ideal Parallel Computer • Depending on how simultaneous memory accesses are handled, PRAMs can be divided into four subclasses. (Concurrent/ exclusive) X(read/ write) – Exclusive-read, exclusive-write (EREW) PRAM. – Concurrent-read, exclusive-write (CREW) PRAM. – Exclusive-read, concurrent-write (ERCW) PRAM. – Concurrent-read, concurrent-write (CRCW) PRAM.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
19
Architecture of an Ideal Parallel Computer • Several protocols are used to resolve concurrent writes. – Common: write only if all values are identical. – Arbitrary: write the data from a randomly selected processor. – Priority: follow a pre-determined priority order. – Sum: Write the sum of all data items.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
20
Interconnection Networks for Parallel Computers • Interconnection networks carry data between processors and to memory. • Interconnects are made of switches and links (wires, fiber). • Interconnects are classified as static or dynamic.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
21
Interconnection Networks for Parallel Computers • Static networks consist of point-to-point communication links among processing nodes and are also referred to as direct networks. • Dynamic networks are built using switches and communication links. Dynamic networks are also referred to as indirect networks.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
22
Static and Dynamic Interconnection Networks
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
23
Interconnection Networks • Switches map a fixed number of inputs to outputs. – provide support for internal buffering (when the requested output port is busy), – routing (to alleviate congestion on the network), and – multicast (same output on multiple ports).
• The total number of ports on a switch is the degree of the switch. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
24
Outline • 2.4 Physical Organization of Parallel Platforms – 2.4.1 Architecture of an Ideal Parallel Computer – 2.4.2 Interconnection Networks for Parallel Computers – 2.4.3 Network Topologies – 2.4.4 Evaluating Static Interconnection Networks – 2.4.5 Evaluating Dynamic Interconnection Networks – 2.4.6 Cache Coherence in Multiprocessor Systems
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
25
Network Topologies: Buses • A bus-based network is perhaps the simplest network consisting of a shared medium that is common to all the nodes. Thus… • All processors access a common bus for exchanging data. • The distance between any two nodes is O(1) in a bus. The bus also provides a convenient broadcast media. • The bandwidth of the shared bus is a major bottleneck. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
26
Network Topologies: Buses
Bus-based interconnects (a) with no local caches; (b) with local memory/caches.
Since much of the data accessed by processors is local to the processor, a local memory can improve the performance of busbased machines. Dr. Mohammed Abdel-Megeed Salem High Performance Computing 2016/ 2017 Lecture 5 27
Network Topologies: Buses, Example • p processors sharing a bus to the memory. Each processor accesses k data items, and each data access takes time tcycle, • Execution time lower bound: tcycle x kp sec.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
28
Network Topologies: Buses, Example • p processors sharing a bus to the memory. Each processor accesses k data items, and each data access takes time tcycle, • Execution time lower bound: tcycle x kp sec. • Let us assume that 50% of the memory accesses (0.5k) are made to local data. • Execution time lower bound: 0.5 x tcycle x k + 0.5 x tcycle x kp. Local data access
Dr. Mohammed Abdel-Megeed Salem
shared data access
High Performance Computing 2016/ 2017
Lecture 5
29
Network Topologies: Crossbars A crossbar network uses an p×m grid of switches to connect p inputs to m outputs in a non-blocking manner.
A completely non-blocking crossbar network connecting p processors to b memory banks.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
30
Network Topologies: Crossbars • The cost of a crossbar of p processors grows as O(p2). • This is generally difficult to scale for large values of p. • Crossbars have excellent performance scalability but poor cost scalability. • Buses have excellent cost scalability, but poor performance scalability. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
31
Network Topologies: Completely Connected Network • Each processor is connected to every other processor. • The number of links in the network scales as O(p2). • While the performance scales very well, the hardware complexity is not realizable for large values of p. • In this sense, these networks are static counterparts of crossbars. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
32
Network Topologies: Completely Connected and Star Connected Networks Example of an 8-node completely connected network.
(a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
33
Network Topologies: Star Connected Network • Every node is connected only to a common node at the center. • Distance between any pair of nodes is O(1). However, the central node becomes a bottleneck. • In this sense, star connected networks are static counterparts of buses.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
34
Network Topologies: Linear Arrays, Meshes, and k-d Meshes • In a linear array, each node has two neighbors, one to its left and one to its right. • If the nodes at either end are connected, we refer to it as a 1-D torus or a ring.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
35
Network Topologies: Linear Arrays, Meshes, and k-d Meshes • A generalization to 2 dimensions has nodes with 4 neighbors, to the north, south, east, and west. • A further generalization to d dimensions has nodes with 2d neighbors. • A special case of a d-dimensional mesh is a hypercube. Here, d = log p, where p is the total number of nodes. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
36
Network Topologies: Linear Arrays
Linear arrays: (a) with no wraparound links; (b) with wraparound link.
Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound. Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
37
Network Topologies: Tree-Based Networks
Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
38
Network Topologies: Tree Properties • The distance between any two nodes is no more than 2log(p). • Links higher up the tree potentially carry more traffic than those at the lower levels.
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
39
Contacts High Performance Computing, 2016/2017 Dr. Mohammed Abdel-Megeed M. Salem Faculty of Computer and Information Sciences, Ain Shams University Abbassia, Cairo, Egypt Tel.: +2 011 1727 1050 Email:
[email protected] Web: https://sites.google.com/a/fcis.asu.edu.eg/salem
Dr. Mohammed Abdel-Megeed Salem
High Performance Computing 2016/ 2017
Lecture 5
40