10x10: Using Extreme Heterogeneity to Build a General-Purpose Processor with Exascale Energy-Efficiency Apala Guha†, Yao Zhang†, Raihan Rasool†, Lei Zhang†, Amirali Shambayati†, Andrew A. Chien†‡ †Large-scale systems group, Department of Computer Science, University of Chicago ‡Mathematics and Computer Science Division, Argonne National Laboratory http://10x10.cs.uchicago.edu {aguha, zhangyaobit, raihan, zlei, achien, amirali} @cs.uchicago.edu

Performance

MOTIVATION

Dennard Scaling

SOLUTION

Energy-limited Scaling Heterogeneous (incl. Hybrid) Small Core (100’s) 90/10 architectural optimization no longer works

Big Core (10’s) Breaking a 50-year computing Paradigm, what’s the best way?

• Each core is an ensemble of micro-engines. • Code regions are executed by the best matched micro-engines, and the rest remain ‘off’. • Together, the micro-engines provide high performance and energy efficiency for entire applications.

Time

• Due to power limits, single-processor performance is no longer increasing. • We need extreme energy-efficiency. • GPPs and GPUs have some benefits, but extreme heterogeneity has potential for even greater energy efficiency.

0.01

Power (W)

EE [Ops/J] @ fixed Process

ASIC

0.1

Ideal Spectrum of energy efficiency vs. programmability Compute Asics, soc, gpu, parallel cpu, cpu Chip WhereSoC are we going? Overlap, dominate? / IP +GPU+M-core Answer is deeply a hardware and software question Accel – Waning of Moore’s Law, end of GPUdays + features Moore’s Law, success of nearthreshold and device +M-core scaling heroics – Software translation technology Core for cross-compilation, transformation and optimization, higher-level Programmability/ programming Portability

• 10x10 has to satisfy both energy efficiency and programmability. • Energy-efficiency of customized silicon (SoCs and ASICs) • Programmability of multi-core systems

1

10

Augmented reality Object recognition

100

1000

10000

Image-recognition

1 Dyser Epiphany

10

Tilera

Atom N2800

IBM Cell

Accelerators can be up 100 to 10,000 times more energy-efficient than multi-cores 1000

1

H.264 encoder - custom

3D lighting accelerator

Cortex A8

• •

H.264 encoder ASIC

H.264 general-purpose H.264 SIMD+VLIW H.264 fused ins Cryptomaniac QsCores 0.1

REQUIREMENTS OF A SOLUTION

• •

Performance (GIns / s)

LAP Anton Grape-8

Sandybridge Nehalem Kepler

2

CODE REGION CLUSTERING: WHAT CODE FEATURES SHOULD THE MICRO-ENGINES TARGET?

What code features should the microengines target?

10x10 CHALLENGES How to design the micro-engines?

Workload

What is the programming level for achieving efficiency?

3

MICRO-ENGINE CO-DESIGN: HOW TO DESIGN THE MICRO-ENGINES? Generality: Cover as many clusters as possible within the chip area constraint. Our micro-engine designs are based on fine-grained program features, which is extracted from data flow graphs (DFGs) covering a couple of instructions, thus promises maximum reusability among clusters.

Factor into “10“ Bins

Micro-engine per Bin

Compose

• We cluster similar hot code regions from a variety of applications. The goal is to build one microengine targeting each cluster and compose them into a core. • We use PARSEC, Biobench, UHPC challenge apps and embedded apps. • We select the hot loop nests from these apps. • We study the operation mix, dataflow patterns, memory access patterns for these hot loop nests. • We cluster together similar loop nests.

4

PROGRAMMABILITY: WHAT IS THE PROGRAMMING LEVEL FOR ACHIEVING EFFICIENCY? An abstract layer Parallelism abstraction: inter-node, intra-node, manycore, vector Data abstraction: adaptive layout for distributed and hierarchical memory

Flexibility: Micro-engine designs should consider both computation structures and data types to achieve better matching with clusters.

C-based programming model: MPI, OpenMP, OpenCL, SSE Programmer-controlled parallelism mapping & data distribution High performance at the cost of programming effort and portability

Efficient Implementation: Efficient design automation techniques enable high quality implementations of many micro-engines.

Hardware

5 CONCLUSIONS • Paradigm shift in architecture design inevitable. o 90/10 optimization not viable anymore. o Individual code patterns will have larger influence on architecture . • Extreme heterogeneity needed: o Hybrid CPU-GPU computing not enough. o 10x10 marks a shift to this new paradigm for GPPs. • Heterogeneous solutions must be usable. o Usability challenges become more severe as heterogeneity increases. o Elegant programmability solutions necessary.

6

ACKNOWLEDGEMENTS

7

This work supported in part by the National Science Foundation under award NSF OCI0-1057921 and the Defense Advanced Research Projects Agency under award HR0011-13-2-0014. The contents do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred.

10x10: Using Extreme Heterogeneity to Build a General-Purpose ...

What is the programming level for achieving ... High performance at the cost of programming effort and portability ... design automation techniques enable.

2MB Sizes 2 Downloads 199 Views

Recommend Documents

10x10: Using Extreme Heterogeneity to Build a General-Purpose ...
†Large-scale systems group, Department of Computer Science, University of ... WHAT IS THE PROGRAMMING LEVEL FOR ACHIEVING EFFICIENCY?

A Note on Heterogeneity and Aggregation Using Data ...
Abstract: Using data from Indonesia, I show that both household- and ... (i) the Pareto being more appropriate than the exponential distribution for Yiv and Riv, ...

I2MTC16-10x10.pdf
between) would be denoted L, H, and M. These symbols are. passed to a finite state machine that accepts when a transition. is located, identifying the transition ...

Download LumaPix FotoFusion Extreme 5.4 Build 100143 + cRACk ...
Page 1 of 1. LumaPix FotoFusion Extreme 5.4 Build 100143. + cRACk-MPT. Download Required File Through Downloader (100% Working). LumaPix ...

Extreme Science, Extreme Parenting, and How to Make ...
Full PDF The Boy Who Played with Fusion: Extreme Science, Extreme Parenting, and How to Make a Star, PDF ePub Mobi The Boy Who Played with Fusion: ...

Using Music to Build Spirituality booklet.pdf
Connect more apps... Try one of the apps below to open or edit this item. Using Music to Build Spirituality booklet.pdf. Using Music to Build Spirituality booklet.pdf.

PDF Million Dollar Websites: Build a Better Website Using Best ...
Conversion Full Online. PDF Million Dollar Websites: Build a Better Website Using Best Practices of the Web Elite in E-Business, Design, SEO, Usability, Social, Mobile and. Conversion Full Online TAGS : pdf Million Dollar Websites: Build a Better Web

Identification of Rare Categories Using Extreme Learning Machine
are useful in many fields such as Medical diagnostics, Credit card fraud detections etc. There are many methods are use to find the rare classes, they are ...

Identification of Rare Categories Using Extreme Learning Machine
are useful in many fields such as Medical diagnostics, Credit card fraud detections etc. There are ... Here the extreme learning machine is use for classification.ELM is used .... Generative Classifiers: A Comparison of Logistic Regression and.