10x10: Using Extreme Heterogeneity to Build a General-Purpose ...

Viewer
Transcript

10x10: Using Extreme Heterogeneity to Build a General-Purpose Processor with Exascale Energy-Efficiency Apala Guha†, Yao Zhang†, Raihan Rasool†, Lei Zhang†, Amirali Shambayati†, Andrew A. Chien†‡ †Large-scale systems group, Department of Computer Science, University of Chicago ‡Mathematics and Computer Science Division, Argonne National Laboratory http://10x10.cs.uchicago.edu {aguha, zhangyaobit, raihan, zlei, achien, amirali} @cs.uchicago.edu

Performance

MOTIVATION

Dennard Scaling

SOLUTION

Energy-limited Scaling Heterogeneous (incl. Hybrid) Small Core (100’s) 90/10 architectural optimization no longer works

Big Core (10’s) Breaking a 50-year computing Paradigm, what’s the best way?

• Each core is an ensemble of micro-engines. • Code regions are executed by the best matched micro-engines, and the rest remain ‘off’. • Together, the micro-engines provide high performance and energy efficiency for entire applications.

Time

• Due to power limits, single-processor performance is no longer increasing. • We need extreme energy-efficiency. • GPPs and GPUs have some benefits, but extreme heterogeneity has potential for even greater energy efficiency.

0.01

Power (W)

EE [Ops/J] @ fixed Process

ASIC

0.1

Ideal Spectrum of energy efficiency vs. programmability Compute Asics, soc, gpu, parallel cpu, cpu Chip WhereSoC are we going? Overlap, dominate? / IP +GPU+M-core Answer is deeply a hardware and software question Accel – Waning of Moore’s Law, end of GPUdays + features Moore’s Law, success of nearthreshold and device +M-core scaling heroics – Software translation technology Core for cross-compilation, transformation and optimization, higher-level Programmability/ programming Portability

• 10x10 has to satisfy both energy efficiency and programmability. • Energy-efficiency of customized silicon (SoCs and ASICs) • Programmability of multi-core systems

1

10

Augmented reality Object recognition

100

1000

10000

Image-recognition

1 Dyser Epiphany

10

Tilera

Atom N2800

IBM Cell

Accelerators can be up 100 to 10,000 times more energy-efficient than multi-cores 1000

1

H.264 encoder - custom

3D lighting accelerator

Cortex A8

• •

H.264 encoder ASIC

H.264 general-purpose H.264 SIMD+VLIW H.264 fused ins Cryptomaniac QsCores 0.1

REQUIREMENTS OF A SOLUTION

• •

Performance (GIns / s)

LAP Anton Grape-8

Sandybridge Nehalem Kepler

2

CODE REGION CLUSTERING: WHAT CODE FEATURES SHOULD THE MICRO-ENGINES TARGET?

What code features should the microengines target?

10x10 CHALLENGES How to design the micro-engines?

Workload

What is the programming level for achieving efficiency?

3

MICRO-ENGINE CO-DESIGN: HOW TO DESIGN THE MICRO-ENGINES? Generality: Cover as many clusters as possible within the chip area constraint. Our micro-engine designs are based on fine-grained program features, which is extracted from data flow graphs (DFGs) covering a couple of instructions, thus promises maximum reusability among clusters.

Factor into “10“ Bins

Micro-engine per Bin

Compose

• We cluster similar hot code regions from a variety of applications. The goal is to build one microengine targeting each cluster and compose them into a core. • We use PARSEC, Biobench, UHPC challenge apps and embedded apps. • We select the hot loop nests from these apps. • We study the operation mix, dataflow patterns, memory access patterns for these hot loop nests. • We cluster together similar loop nests.

4

PROGRAMMABILITY: WHAT IS THE PROGRAMMING LEVEL FOR ACHIEVING EFFICIENCY? An abstract layer Parallelism abstraction: inter-node, intra-node, manycore, vector Data abstraction: adaptive layout for distributed and hierarchical memory

Flexibility: Micro-engine designs should consider both computation structures and data types to achieve better matching with clusters.

C-based programming model: MPI, OpenMP, OpenCL, SSE Programmer-controlled parallelism mapping & data distribution High performance at the cost of programming effort and portability

Efficient Implementation: Efficient design automation techniques enable high quality implementations of many micro-engines.

Hardware

5 CONCLUSIONS • Paradigm shift in architecture design inevitable. o 90/10 optimization not viable anymore. o Individual code patterns will have larger influence on architecture . • Extreme heterogeneity needed: o Hybrid CPU-GPU computing not enough. o 10x10 marks a shift to this new paradigm for GPPs. • Heterogeneous solutions must be usable. o Usability challenges become more severe as heterogeneity increases. o Elegant programmability solutions necessary.

6

ACKNOWLEDGEMENTS

7

This work supported in part by the National Science Foundation under award NSF OCI0-1057921 and the Defense Advanced Research Projects Agency under award HR0011-13-2-0014. The contents do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred.