High Performance Computing

Viewer
Transcript

High Performance Computing For senior undergraduate students

Lecture 1: Course Information & Introduction 27.09.2016

Dr. Mohammed Abdel-Megeed Salem Scientific Computing Department Faculty of Computer and Information Sciences Ain Shams University

Course Information • Lecture Time: Tuesday : 14:00 – 17:00 • Lecturer: Dr. Mohammed Abdel-Megeed Salem – [email protected] – https://sites.google.com/a/fcis.asu.edu.eg/salem

Course Information - Textbook Introduction to Parallel Computing, 2nd Edition Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Addison-Wesley

2003

Course Information - Content • • • • • • • •

Introduction (Chapter 1) Parallel Programming Platforms (Chapter 2) Principles of Parallel Algorithm Design (Chapter 3)

Analytical Modeling of Parallel Programs (Chapter 5) Programming Using the Message-Passing Paradigm (Chapter 6) Programming Shared Address Space Platforms (Chapter 7)

Dense Matrix Operations (Chapter 8) Application of Parallel Programmin in Image Processing

Course Information - Grading • • • • • •

Attendence and Participation Midterm Assignments Lab Participation Final Lab Exam Final Exam

05% 10% 05% 05% 10% 65%

Parallel Computing!! • What is Parallel Computing? – Multiple processors cooperating concurrently to solve one problem. – Parallel Computing is the use of parallel processing for running advanced application programs efficiently, reliably and quickly.

Parallel Computing - What • Why is Parallel Computing Important? •

Traditional scientific and engineering paradigm:

1) Do theory or paper design. 2) Perform experiments or build system.

•

Limitations:

• • • •

•

—Too difficult -- build large wind tunnels. —Too expensive -- build a throw-away passenger jet. —Too slow -- wait for climate evolution. —Too dangerous -- weapons, drug design, climate experimentation. Computational science paradigm:

3) Use high performance computer systems to simulate the phenomenon – Base on known physical laws and efficient numerical methods.

Parallel Computing - Why • Why is Parallel Computing Important?

Parallel Computing – Example: Global Climate Modeling Problem •

Problem is to compute: •

•

(latitude, longitude, elevation, time)  • temperature, pressure, humidity, wind velocity Approach:

•

—Discretize the domain, e.g., a measurement point every 10 km

•

—Devise an algorithm to predict weather at time t+t given t

Parallel Computing – Example: Global Climate Modeling Problem •

One piece is modeling the fluid flow in the atmosphere • – Roughly 100 Flops per grid point with 1 minute timestep

•

Computational requirements:

• —To match real-time, need 5 x 1011 flops in 60 seconds = 8 Gflop/s • —Weather prediction (7 days in 24 hours)  56 Gflop/s • —Climate prediction (50 years in 30 days)  4.8 Tflop/s

•

To double the grid resolution, computation is 8x to 16x.

Parallel Computing - Measures • High Performance Computing (HPC) units are: • —Flop: floating point operation • —Flop/s: floating point operations per second • —Bytes: size of data

Parallel Computing - Measures Mega

Mflop/s = 106 flop/sec

Mbyte = 220 = 1048576 ~ 106 bytes

Giga

Gflop/s = 109 flop/sec

Gbyte = 230 ~ 109 bytes

Tera

Tflop/s = 1012 flop/sec

Tbyte = 240 ~ 1012 bytes

Peta

Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytes

Exa

Eflop/s = 1018 flop/sec

Ebyte = 260 ~ 1018 bytes

Zetta

Zflop/s = 1021 flop/sec

Zbyte = 270 ~ 1021 bytes

Yotta

Yflop/s = 1024 flop/sec

Ybyte = 280 ~ 1024 bytes

Technology Trends: Microprocessor Capacity Moore’s Law

2X transistors/Chip Every 1.5 years

Called “Moore’s Law” Microprocessors have become smaller, denser, and more powerful. 13

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra

Applications • Engineering: – Design of aircrafts – Design and simulation of micro- and nano-scale systems.

• Commercial: – largest parallel computers power the wall street! – Data mining and analysis for optimizing business and marketing decisions.

Applications • Computers: – Embedded systems increasingly rely on distributed control algorithms. – Network intrusion detection, cryptography, etc. – Optimizing performance of modern automobile. – Networks, mail-servers, search engines… – Visualization

• Computational Sciences: – Bioinformatics: Functional and structural characterization of genes and proteins. – Astrophysics: exploring the evolution of galaxies. – Weather modeling, flood/tornado prediction.

Parallel Programming Approaches • Explicitly: – define tasks, work decomposition, data decomposition, communication, synchronization.

• Implicitly: – define tasks only, rest implied; or define tasks and work decomposition rest implied;

Explicit Parallel Computing • •

Algorithm development is harder •

Software development is much harder •

•

complexity of specifying and coordinating concurrent activities

lack of standardized & effective development tools, programming models, and environments

Rapid race of change in computer system architecture •

today’s hot parallel algorithm may not be suitable for tomorrow’s parallel computer!

Automatic Parallelism •

Bit level parallelism

•

•

Instruction level parallelism (ILP)

•

•

multiple instructions execute per clock cycle

Memory system parallelism

•

•

within floating point operations, etc.

overlap of memory operations with computation

OS parallelism

•

multiple jobs run in parallel on commodity SMPs

Your Turn • 1. Go to the Top 500 Supercomputers site (http://www.top500.org/) and list the five most powerful supercomputers along with their FLOPS rating. • 2. Collect statistics on the number of components in state of the art integrated circuits over the years. Plot the number of components as a function of time and compare the growth rate to that dictated by Moore's law.

Parallel Computing!! • What is Parallel Computing? – Multiple processors cooperating concurrently to solve one problem. – Parallel Computing is the use of parallel processing for running advanced application programs efficiently, reliably and quickly.

Scope of Parallelism • Conventional architectures coarsely comprise of a processor, memory system, and the data path. • Each of these components present significant performance bottlenecks. • Parallelism addresses each of these components in significant ways.

Implicit Parallelism: Trends in Microprocessor Architectures • Microprocessor clock speeds have posted impressive gains over the past two decades (two to three orders of magnitude). • Higher levels of device integration have made available a large number of transistors. • The question of how best to utilize these resources is an important one. • Current processors use these resources in multiple functional units and execute multiple instructions in the same cycle. • The precise manner in which these instructions are selected and executed provides impressive diversity in architectures.

Explicit Parallelism • An explicitly parallel program must specify concurrency and interaction between concurrent subtasks. • The former is sometimes also referred to as the control structure and the latter as the communication model.

Control Structure of Parallel Programs • Parallelism can be expressed at various levels of granularity - from instruction level to processes. • Between these extremes exist a range of models, along with corresponding architectural support.

Concurrency • What is Concurrency? And, does concurrency has different levels or layers?

Concurrency

Concurreny • Two events are said to be concurrent if they occur within the same time interval. • Two or more tasks executing over the same time interval are said to execute concurrently.

Concurreny • Each task may alternate executing. However, the length of a second is so short that it appears that both tasks are executing simultaneously. – two tasks may occur concurrently within the same second but with each task executing within different fractions of the second. The first task may execute for the first tenth of the second and pause, the second task may execute for the next tenth of the second and pause, the first task may start again executing in the third tenth of a second, and so on.

Concurrency • Concurrent tasks can execute in a single or multiprocessing environment. • In a single processing environment, concurrent tasks exist at the same time and execute within the same time period by context switching. • In a multiprocessor environment, concurrent tasks may execute at the same instant over the same time period.

Concurrency Levels • • • •

Instruction Level Routine Level Object Level Application Level

Concurrency Levels • Instruction Level – Multiple parts of a single instruction can be executed simultaneously.

• Routine Level – Functions are assigned to different threads and can execute „simultaneously“ if enough processores are available.

Concurrency Levels • Object Level – Objects resding in different threads or processes may execute their methods concurrently.

• Application Level – Two or more applications can cooperatively work together to solve some problem.

• Different applications utilize different aspects of parallelism. • Data intensive applications utilize high aggregate throughput, • Server applications utilize high aggregate network bandwidth, • Scientific applications typically utilize high processing and memory system performance. • It is important to understand each of these performance bottlenecks.

Parallelism in Uni-Processors • A uniprocessor (One CPU) system can perform two or more tasks simultaneously. • The tasks are not related to each other. So, a system that processes two different instructions simultaneously.

Uniprocessor Architecture • Von Neumann Architecture CPU

Control Unit Memory

Storage

Output Devices

Input Devices

Data

Arithmetic and Logic Unit

Information

CPU Execution Cycle

CPU Execution Cycle • The (fetch-decode-execute cycle) to run programs as follows: – 1. The control unit fetches the next instruction from main memory using the program counter to determine where the instruction is located. – 2. The instruction is decoded into a language that the ALU can understand.

CPU Execution Cycle • 3. Any data operands required to execute the instruction are fetched from memory and placed into registers within the CPU. • 4. The ALU executes the instruction and places results in registers or memory.

CPU Execution Cycle • Each clock cycle, CPU performs an operation • Operations may be simple (logical .or.) or complex (floating point division) • Set of operations supported by hardware is called instruction set architecture (ISA) • Depending on ISA, an operation may require one or more instructions

Contacts High Performance Computing, 2016/2017 Dr. Mohammed Abdel-Megeed M. Salem Faculty of Computer and Information Sciences, Ain Shams University Abbassia, Cairo, Egypt Tel.: +2 011 1727 1050 Email: [email protected] Web: https://sites.google.com/a/fcis.asu.edu.eg/salem