HPC Unit1 By Prof. Shinde S. S..pdf

Viewer
Transcript

High Performance Computing

By:

Prof. Sumit S. Shinde PICT, Pune

INTODUCTION: Computer Architecture Basics  Computer Generations  Elements of Modern Computing System  Serial Computer Vs Parallel Computers 

Parallelism Bit level Parallelism (Word Size)  Instruction Level Parallelism (ILP)  Data Paralleism (Data Distribution)  Task Parallelism (Thread Distribution) 

Dependencies 

Let Pi and Pj be two program segments. Bernstein's conditions describe when the two are independent and can be executed in parallel. For Pi, let Ii be all of the input variables and Oi the output variables, and likewise for Pj. Pi and Pj are independent if they satisfy

Parallel Computing Clock Rates of Computers increased from MHz to GHz  Processors are capable of executing multiple instructions in same cycle.  In terms of cost, performance and application requirements wide variety of parallel platforms currently available. 

Motivating Parallelism Computational Power Argumentsfrom transistors to FLOPs  Memory/Disk speed arguments gap between processor speed and memory presents a tremendous performance bottleneck.  Data Communication Argument 

Scope of Parallel Computing Applications in engineering and design - Design of airfoils, internal combustion engines, High speed circuits.  Scientific applications - Bioinformatics, Analyzing biological sequences with a view to developing new drugs and cures for diseases requires innovative algorithm and large scale computational power. 

Advancements in physics & chemistry, resulted in design of new materials, understanding of chemical pathways. - Weather modeling, Mineral Prospecting, Flood prediction etc.  Commercial Applications: - Availability of large scale transaction data sparked considerable interest in data mining and analysis for optimizing business and marketing decisions. - Use of effective parallel algorithms for problems as clustering, time-series analysis. -

 -

-

Application in Computer System: Network intrusion detection In the area of cryptography Embedded systems increasingly rely on distributed control algorithms.

Organization and contents of Text

Parallel Programming Platforms: 

 



 

Traditional logical view of sequential computer Processor, memory and datapath presents bottleneck Architectural innovations have addressed these bottleneck Objective is to provide sufficient details for programmers to be able to write efficient code on variety of platforms. Develop cost models for quantifying performance of various parallel algorithms . Optimizing serial performance of codes before attempting parallelization.

Implicit Parallelism: Trends in Microprocessor Architecture. Improvement in clock speeds  Cost-effective performance gain  Increments in clock speed are severely diluted by limitations of memory technology.  Higher level of device integration with large transistor count. (Issue is how to utilize them) 

Pipelining and superscalar execution Stages in instruction execution(Fetch, Schedule, decode, operand fetch, execute, store) Ex. Assembly Line -Penalty of mis-prediction is to flush out large number of instructions. -Ability of a processor to issue multiple instructions in the same cycle is referred to as superscalar execution. 

True data dependency Results of an instruction may be required for subsequent instruction. -

Resource Dependency Two instructions compete for a single processor resource. -

Branch Dependency Scheduling instructions a priori across branches may lead to errors. -

Vertical Waste During a cycle, If no instructions are issued on the execution units, it is referred as vertical waste. -

Horizontal Waste If only one part of execution units are used during a cycle, termed as horizontal waste. -



Very long instruction word processors

-

VLIW Processors lies on the compiler to resolve dependencies and resource availability at compile time. Instructions that can be executed concurrently are packed into groups and parceled off to the processer as a single long instruction word to be executed on multiple functional units at the same time. Loop unrolling, branch prediction and speculative execution all play important role in VLIW processors.

-

-

Limitations of Memory System Performance (Latency, Bandwidth) Improving Effective memory latency using caches - Hit ratio - Memory bound computation  Impact of Memory Bandwidth - Increase size of memory blocks - Cache line  Alternate approach for hiding memory latency (Prefetching, Multithreading, Spatial Locality) 

Dichotomy of Parallel Computing Platform (Physical and Logical views) Control structure of parallel platform a. SIMD b. MIMD 

 -

SIMD Allow for ‘Activity mask’, whether to participate in operation or not. Requires less hardware, less memory Requires extensive design efforts.

MIMD - Executing a different program independently. - Variant is Single Program Multiple Data(SPMD) - Requires more hardware than SIMD, High memory 



Communication model of parallel platforms (Shared Address Space Platforms, Uniform Memory Access, Non uniform memory access)

Uniform-If time required to access any memory word in the system (local or global) is identical, then platform is called as uniform memory access(UMA) 

Non-uniform-If time required to access certain memory words in the system is longer, then platform is called as non uniform memory access(NUMA) 

Physical Organization of Parallel Platforms 

Architecture of an Ideal Parallel Computer

-Parallel Random Access Machine(PRAM) -EREW -CREW -ERCW -CRCW Protocols: -Common -Arbitrary -Priority -Sum

Interconnection Networks for Parallel Computers a. Static network (direct n/w) b. Dynamic network (indirect n/w) 

Functionality of Switches: -Degree of switch -Mapping from input to output ports -routing, multicast, internal buffering 

Cost of Switches -mapping hardware- with square of degrees -packaging cost- with number of pins 

Network Topologies: Bus Based Networks: 

Scalable in terms of cost, unscalable in terms of performance



CrossBar Networks:

Scalable in terms of performance but unscalable in terms of cost



Multistage Networks:

More scalable than the bus in terms of performance and more scalable than crossbar in terms of cost.



Omega Network

Perfect Shuffle Pass Through Cross-Over



Completely- connected Network



Linear Arrays, Meshes, and k-d Meshes

Tree Based Networks A] static tree B] dynamic tree 

-Fat tree

Communication costs in parallel machines  -

-

-

Message passing costs in parallel computers Startup time (ts) time required to handle a message at sending and receiving node Per-Hop time (th) (node latency) time taken by header to travel between two directly connected node. Per-Word Transfer Time (tw) if channel bandwidth is r words per second. Each word takes time tw=1/r to traverse link.



Store and Forward Routing



Packet Routing

Message is broken into packets and assembled with error, routing and sequencing field. - Size of packet (r+s) where r is original message & s is additional info. - mtw1 packetizing time depending on length of message. - Packet takes thl + tw2(r+s) time to reach at detination. - Destination receives m/r-1 additional packets every tw2(r+s) seconds

Where



Cut- Through Routing

Message is broken into fixed size units called flow control digits or flits. Flits do not contain overheads of message.

Levels of parallelism Instruction Level Parallelism (ILP)  Transaction Level Parallelism  Task Level Parallelism/ Thread Level  Memory Parallelism - Ability to have pending memory operations, like cache misses or TLB misses  Function Parallelism - Achieved by applying different operations to different data elements simultaneously. 

Parallel Processing Models SIMD Model - Single stream is broadcasted. - It is a class of parallel computer. - Exploits data level parallelism. - Ex. Adjusting volume of digital audio, Contrast in digital image  MIMD Model  SPMD Model 

 -

Dataflow Model Static Approach

-

Manchester Dynamic Approach



Demand Driven Computation

Parallel Processing Architectures N-Wide Superscalar Architecture  Multi Core Architecture  Multi Threaded Processors - Advantages - Disadvantages 

HPC Unit1 By Prof. Shinde S. S..pdf

Computer Architecture Basics. Computer Generations. Elements of Modern Computing System. Serial Computer Vs Parallel Computers. Page 2 of 42 ...

Download PDF

2MB Sizes 2 Downloads 220 Views

Report

HPC Unit1 By Prof. Shinde S. S..pdf

Recommend Documents