High Performance Computing

By:

Prof. Sumit S. Shinde PICT, Pune

INTODUCTION: Computer Architecture Basics  Computer Generations  Elements of Modern Computing System  Serial Computer Vs Parallel Computers 

Parallelism Bit level Parallelism (Word Size)  Instruction Level Parallelism (ILP)  Data Paralleism (Data Distribution)  Task Parallelism (Thread Distribution) 

Dependencies 

Let Pi and Pj be two program segments. Bernstein's conditions describe when the two are independent and can be executed in parallel. For Pi, let Ii be all of the input variables and Oi the output variables, and likewise for Pj. Pi and Pj are independent if they satisfy

Parallel Computing Clock Rates of Computers increased from MHz to GHz  Processors are capable of executing multiple instructions in same cycle.  In terms of cost, performance and application requirements wide variety of parallel platforms currently available. 

Motivating Parallelism Computational Power Argumentsfrom transistors to FLOPs  Memory/Disk speed arguments gap between processor speed and memory presents a tremendous performance bottleneck.  Data Communication Argument 

Scope of Parallel Computing Applications in engineering and design - Design of airfoils, internal combustion engines, High speed circuits.  Scientific applications - Bioinformatics, Analyzing biological sequences with a view to developing new drugs and cures for diseases requires innovative algorithm and large scale computational power. 

Advancements in physics & chemistry, resulted in design of new materials, understanding of chemical pathways. - Weather modeling, Mineral Prospecting, Flood prediction etc.  Commercial Applications: - Availability of large scale transaction data sparked considerable interest in data mining and analysis for optimizing business and marketing decisions. - Use of effective parallel algorithms for problems as clustering, time-series analysis. -

 -

-

Application in Computer System: Network intrusion detection In the area of cryptography Embedded systems increasingly rely on distributed control algorithms.

Organization and contents of Text

Parallel Programming Platforms: 

 



 

Traditional logical view of sequential computer Processor, memory and datapath presents bottleneck Architectural innovations have addressed these bottleneck Objective is to provide sufficient details for programmers to be able to write efficient code on variety of platforms. Develop cost models for quantifying performance of various parallel algorithms . Optimizing serial performance of codes before attempting parallelization.

Implicit Parallelism: Trends in Microprocessor Architecture. Improvement in clock speeds  Cost-effective performance gain  Increments in clock speed are severely diluted by limitations of memory technology.  Higher level of device integration with large transistor count. (Issue is how to utilize them) 

Pipelining and superscalar execution Stages in instruction execution(Fetch, Schedule, decode, operand fetch, execute, store) Ex. Assembly Line -Penalty of mis-prediction is to flush out large number of instructions. -Ability of a processor to issue multiple instructions in the same cycle is referred to as superscalar execution. 

True data dependency Results of an instruction may be required for subsequent instruction. -

Resource Dependency Two instructions compete for a single processor resource. -

Branch Dependency Scheduling instructions a priori across branches may lead to errors. -

Vertical Waste During a cycle, If no instructions are issued on the execution units, it is referred as vertical waste. -

Horizontal Waste If only one part of execution units are used during a cycle, termed as horizontal waste. -



Very long instruction word processors

-

VLIW Processors lies on the compiler to resolve dependencies and resource availability at compile time. Instructions that can be executed concurrently are packed into groups and parceled off to the processer as a single long instruction word to be executed on multiple functional units at the same time. Loop unrolling, branch prediction and speculative execution all play important role in VLIW processors.

-

-

Limitations of Memory System Performance (Latency, Bandwidth) Improving Effective memory latency using caches - Hit ratio - Memory bound computation  Impact of Memory Bandwidth - Increase size of memory blocks - Cache line  Alternate approach for hiding memory latency (Prefetching, Multithreading, Spatial Locality) 

Dichotomy of Parallel Computing Platform (Physical and Logical views) Control structure of parallel platform a. SIMD b. MIMD 

 -

SIMD Allow for ‘Activity mask’, whether to participate in operation or not. Requires less hardware, less memory Requires extensive design efforts.

MIMD - Executing a different program independently. - Variant is Single Program Multiple Data(SPMD) - Requires more hardware than SIMD, High memory 



Communication model of parallel platforms (Shared Address Space Platforms, Uniform Memory Access, Non uniform memory access)

Uniform-If time required to access any memory word in the system (local or global) is identical, then platform is called as uniform memory access(UMA) 

Non-uniform-If time required to access certain memory words in the system is longer, then platform is called as non uniform memory access(NUMA) 

Physical Organization of Parallel Platforms 

Architecture of an Ideal Parallel Computer

-Parallel Random Access Machine(PRAM) -EREW -CREW -ERCW -CRCW Protocols: -Common -Arbitrary -Priority -Sum

Interconnection Networks for Parallel Computers a. Static network (direct n/w) b. Dynamic network (indirect n/w) 

Functionality of Switches: -Degree of switch -Mapping from input to output ports -routing, multicast, internal buffering 

Cost of Switches -mapping hardware- with square of degrees -packaging cost- with number of pins 

Network Topologies: Bus Based Networks: 

Scalable in terms of cost, unscalable in terms of performance



CrossBar Networks:

Scalable in terms of performance but unscalable in terms of cost



Multistage Networks:

More scalable than the bus in terms of performance and more scalable than crossbar in terms of cost.



Omega Network

Perfect Shuffle Pass Through Cross-Over



Completely- connected Network



Linear Arrays, Meshes, and k-d Meshes

Tree Based Networks A] static tree B] dynamic tree 

-Fat tree

Communication costs in parallel machines  -

-

-

Message passing costs in parallel computers Startup time (ts) time required to handle a message at sending and receiving node Per-Hop time (th) (node latency) time taken by header to travel between two directly connected node. Per-Word Transfer Time (tw) if channel bandwidth is r words per second. Each word takes time tw=1/r to traverse link.



Store and Forward Routing



Packet Routing

Message is broken into packets and assembled with error, routing and sequencing field. - Size of packet (r+s) where r is original message & s is additional info. - mtw1 packetizing time depending on length of message. - Packet takes thl + tw2(r+s) time to reach at detination. - Destination receives m/r-1 additional packets every tw2(r+s) seconds

Where



Cut- Through Routing

Message is broken into fixed size units called flow control digits or flits. Flits do not contain overheads of message.

Levels of parallelism Instruction Level Parallelism (ILP)  Transaction Level Parallelism  Task Level Parallelism/ Thread Level  Memory Parallelism - Ability to have pending memory operations, like cache misses or TLB misses  Function Parallelism - Achieved by applying different operations to different data elements simultaneously. 

Parallel Processing Models SIMD Model - Single stream is broadcasted. - It is a class of parallel computer. - Exploits data level parallelism. - Ex. Adjusting volume of digital audio, Contrast in digital image  MIMD Model  SPMD Model 

 -

Dataflow Model Static Approach

-

Manchester Dynamic Approach



Demand Driven Computation

Parallel Processing Architectures N-Wide Superscalar Architecture  Multi Core Architecture  Multi Threaded Processors - Advantages - Disadvantages 

HPC Unit1 By Prof. Shinde S. S..pdf

Computer Architecture Basics. Computer Generations. Elements of Modern Computing System. Serial Computer Vs Parallel Computers. Page 2 of 42 ...

2MB Sizes 2 Downloads 184 Views

Recommend Documents

Unit1-SVU.pdf
input resistance of the MOSFET extremely high in the Mega-ohms (MΩ) region thereby. making it almost infinite. Create PDF files without this message by ...

HPC-C Labeling - FDA
Blood Center at 1-866-767-NCBP (1-866-767-6227) and FDA at 1-800-FDA- .... DMSO is not removed, and at 4 °C for up to 24 hours if DMSO is removed in a ...... Call the Transplant Unit to advise them that the product is ready for infusion if ...

HPC Requirements.pdf
Sign in. Page. 1. /. 1. Loading… Page 1 of 1. Page 1 of 1. HPC Requirements.pdf. HPC Requirements.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying HPC Requirements.pdf.

Review . . . Unit1 (MCV4U).pdf
Let's go on-line: http://archives.math.utk.edu/visual.calculus/1/limits.15/index.html. Page 3 of 4. Review . . . Unit1 (MCV4U).pdf. Review . . . Unit1 (MCV4U).pdf.

Atomic Structure spdf notes.pdf
Whoops! There was a problem loading more pages. Retrying... Atomic Structure spdf notes.pdf. Atomic Structure spdf notes.pdf. Open. Extract. Open with. Sign In.

Transcripts Target Toeic Unit1 .pdf
D. He is packing his suitcase. 7. A. The man has hurt his shoulder. B. The package is broken. C. The woman is angry with the man. D. The package is being delivered to the woman. 8. A. The women are walking together. B. The women are working on a proj

3-5-unit1-thepowerofwords.pdf
INVITE students to share their own stories. ASK: Have you ... They're continuing an inside joke; the first person did. something silly ... called weird? Possibly like the other person was kidding around, but maybe ... 3-5-unit1-thepowerofwords.pdf.

HPC Colony II
System software project funded by DOE Office of Science FastOS Award. ▫ Partners include .... L2 cache local memory .... Scheme 1: checkpoint to file-system.

HPC-ACCOUNT-REQUEST.pdf
pull down, select: USER ACCOUNT REQUEST. 3) In the “Describ your Issue”. box,enter: Request Access. Research HPC Systems. If not the PI of a grant, please.

Streamlining HPC Workloads with Containers.pdf
Google image search shows... Page 5 of 47. Streamlining HPC Workloads with Containers.pdf. Streamlining HPC Workloads with Containers.pdf. Open. Extract.

HPC Processing of LIDAR Data
Feb 4, 2005 - processors, distributed memory, and message-passing software libraries. An enhancement is .... This section describes the software development issues associated with designing ...... Conversely, custom-designed .NET Web ...

9-12-unit1-studentpacket.pdf
(Mashable.com, 2010: http://mashable.com/2010/03/17/youtube-24-hours/). A. 12. B. 16 ... 9-12-unit1-studentpacket.pdf. 9-12-unit1-studentpacket.pdf. Open.

Streamlining HPC Workloads with Containers.pdf
Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying... Download. Connect more ...

Simul8 HPC Environment technical specifications.pdf
Page 1 of 1. Deep dive into technical specifications of our machine. Within the last few years, we have seen the transformative impact of deep learning in many.

SC16 HPC Training Workshop-final
SC16 Workshop: Best Practices for HPC Training. 1. Third SC Workshop on Best Practices for HPC Training. Abstract (150 ... 9:00-9:10 am. Welcome and Goals ... www.hpcuniversity.org portal and multiple social media avenues. Timeline.

HPC Processing of LIDAR Data
high-performance LIDAR data processing—in light of the design criteria set ... for the appropriate parallel computing system, data processing algorithms, and.

9-12-unit1-feelingondisplay-2015.pdf
DIGITAL LIFE 101 / ASSESSMENT / DIGITAL LITERACY AND CITIZENSHIP IN A CONNECTED CULTURE / REV DATE 2015 ... Students explore the pressures many teen girls and boys face to keep up ... connections between these experiences and broader social messages

Prof. Akeel Bilgrami - Groups
Portuguese. Stanford University. [ 3.45-5.15pm ]. Chaired by Prof. Amitabha Dasgupta. 17 June, 2016 (Friday). C.V. Raman Auditorium (Science Complex).

Prof. M.M.Mulokozi.pdf
... Examining (for Literature and Language). University of Dodoma, Tanzania 2012 -. University of Botswana (2 MA disserations) 2009. Makerere University (MA & PhD dissertations/theses) 2007 - 2009. Masinde Muliro University of Technolgy, Kenya 2008 -