High Performance Computing
By:
Prof. Sumit S. Shinde PICT, Pune
INTODUCTION: Computer Architecture Basics Computer Generations Elements of Modern Computing System Serial Computer Vs Parallel Computers
Parallelism Bit level Parallelism (Word Size) Instruction Level Parallelism (ILP) Data Paralleism (Data Distribution) Task Parallelism (Thread Distribution)
Dependencies
Let Pi and Pj be two program segments. Bernstein's conditions describe when the two are independent and can be executed in parallel. For Pi, let Ii be all of the input variables and Oi the output variables, and likewise for Pj. Pi and Pj are independent if they satisfy
Parallel Computing Clock Rates of Computers increased from MHz to GHz Processors are capable of executing multiple instructions in same cycle. In terms of cost, performance and application requirements wide variety of parallel platforms currently available.
Motivating Parallelism Computational Power Argumentsfrom transistors to FLOPs Memory/Disk speed arguments gap between processor speed and memory presents a tremendous performance bottleneck. Data Communication Argument
Scope of Parallel Computing Applications in engineering and design - Design of airfoils, internal combustion engines, High speed circuits. Scientific applications - Bioinformatics, Analyzing biological sequences with a view to developing new drugs and cures for diseases requires innovative algorithm and large scale computational power.
Advancements in physics & chemistry, resulted in design of new materials, understanding of chemical pathways. - Weather modeling, Mineral Prospecting, Flood prediction etc. Commercial Applications: - Availability of large scale transaction data sparked considerable interest in data mining and analysis for optimizing business and marketing decisions. - Use of effective parallel algorithms for problems as clustering, time-series analysis. -
-
-
Application in Computer System: Network intrusion detection In the area of cryptography Embedded systems increasingly rely on distributed control algorithms.
Organization and contents of Text
Parallel Programming Platforms:
Traditional logical view of sequential computer Processor, memory and datapath presents bottleneck Architectural innovations have addressed these bottleneck Objective is to provide sufficient details for programmers to be able to write efficient code on variety of platforms. Develop cost models for quantifying performance of various parallel algorithms . Optimizing serial performance of codes before attempting parallelization.
Implicit Parallelism: Trends in Microprocessor Architecture. Improvement in clock speeds Cost-effective performance gain Increments in clock speed are severely diluted by limitations of memory technology. Higher level of device integration with large transistor count. (Issue is how to utilize them)
Pipelining and superscalar execution Stages in instruction execution(Fetch, Schedule, decode, operand fetch, execute, store) Ex. Assembly Line -Penalty of mis-prediction is to flush out large number of instructions. -Ability of a processor to issue multiple instructions in the same cycle is referred to as superscalar execution.
True data dependency Results of an instruction may be required for subsequent instruction. -
Resource Dependency Two instructions compete for a single processor resource. -
Branch Dependency Scheduling instructions a priori across branches may lead to errors. -
Vertical Waste During a cycle, If no instructions are issued on the execution units, it is referred as vertical waste. -
Horizontal Waste If only one part of execution units are used during a cycle, termed as horizontal waste. -
Very long instruction word processors
-
VLIW Processors lies on the compiler to resolve dependencies and resource availability at compile time. Instructions that can be executed concurrently are packed into groups and parceled off to the processer as a single long instruction word to be executed on multiple functional units at the same time. Loop unrolling, branch prediction and speculative execution all play important role in VLIW processors.
-
-
Limitations of Memory System Performance (Latency, Bandwidth) Improving Effective memory latency using caches - Hit ratio - Memory bound computation Impact of Memory Bandwidth - Increase size of memory blocks - Cache line Alternate approach for hiding memory latency (Prefetching, Multithreading, Spatial Locality)
Dichotomy of Parallel Computing Platform (Physical and Logical views) Control structure of parallel platform a. SIMD b. MIMD
-
SIMD Allow for ‘Activity mask’, whether to participate in operation or not. Requires less hardware, less memory Requires extensive design efforts.
MIMD - Executing a different program independently. - Variant is Single Program Multiple Data(SPMD) - Requires more hardware than SIMD, High memory
Communication model of parallel platforms (Shared Address Space Platforms, Uniform Memory Access, Non uniform memory access)
Uniform-If time required to access any memory word in the system (local or global) is identical, then platform is called as uniform memory access(UMA)
Non-uniform-If time required to access certain memory words in the system is longer, then platform is called as non uniform memory access(NUMA)
Physical Organization of Parallel Platforms
Architecture of an Ideal Parallel Computer
-Parallel Random Access Machine(PRAM) -EREW -CREW -ERCW -CRCW Protocols: -Common -Arbitrary -Priority -Sum
Interconnection Networks for Parallel Computers a. Static network (direct n/w) b. Dynamic network (indirect n/w)
Functionality of Switches: -Degree of switch -Mapping from input to output ports -routing, multicast, internal buffering
Cost of Switches -mapping hardware- with square of degrees -packaging cost- with number of pins
Network Topologies: Bus Based Networks:
Scalable in terms of cost, unscalable in terms of performance
CrossBar Networks:
Scalable in terms of performance but unscalable in terms of cost
Multistage Networks:
More scalable than the bus in terms of performance and more scalable than crossbar in terms of cost.
Omega Network
Perfect Shuffle Pass Through Cross-Over
Completely- connected Network
Linear Arrays, Meshes, and k-d Meshes
Tree Based Networks A] static tree B] dynamic tree
-Fat tree
Communication costs in parallel machines -
-
-
Message passing costs in parallel computers Startup time (ts) time required to handle a message at sending and receiving node Per-Hop time (th) (node latency) time taken by header to travel between two directly connected node. Per-Word Transfer Time (tw) if channel bandwidth is r words per second. Each word takes time tw=1/r to traverse link.
Store and Forward Routing
Packet Routing
Message is broken into packets and assembled with error, routing and sequencing field. - Size of packet (r+s) where r is original message & s is additional info. - mtw1 packetizing time depending on length of message. - Packet takes thl + tw2(r+s) time to reach at detination. - Destination receives m/r-1 additional packets every tw2(r+s) seconds
Where
Cut- Through Routing
Message is broken into fixed size units called flow control digits or flits. Flits do not contain overheads of message.
Levels of parallelism Instruction Level Parallelism (ILP) Transaction Level Parallelism Task Level Parallelism/ Thread Level Memory Parallelism - Ability to have pending memory operations, like cache misses or TLB misses Function Parallelism - Achieved by applying different operations to different data elements simultaneously.
Parallel Processing Models SIMD Model - Single stream is broadcasted. - It is a class of parallel computer. - Exploits data level parallelism. - Ex. Adjusting volume of digital audio, Contrast in digital image MIMD Model SPMD Model
-
Dataflow Model Static Approach
-
Manchester Dynamic Approach
Demand Driven Computation
Parallel Processing Architectures N-Wide Superscalar Architecture Multi Core Architecture Multi Threaded Processors - Advantages - Disadvantages