10x10: Using Extreme Heterogeneity to Build a General-Purpose Processor with Exascale Energy-Efficiency Apala Guha†, Yao Zhang†, Raihan Rasool†, Lei Zhang†, Amirali Shambayati†, Andrew A. Chien†‡ †Large-scale systems group, Department of Computer Science, University of Chicago ‡Mathematics and Computer Science Division, Argonne National Laboratory http://10x10.cs.uchicago.edu {aguha, zhangyaobit, raihan, zlei, achien, amirali} @cs.uchicago.edu
Performance
MOTIVATION
Dennard Scaling
SOLUTION
Energy-limited Scaling Heterogeneous (incl. Hybrid) Small Core (100’s) 90/10 architectural optimization no longer works
Big Core (10’s) Breaking a 50-year computing Paradigm, what’s the best way?
• Each core is an ensemble of micro-engines. • Code regions are executed by the best matched micro-engines, and the rest remain ‘off’. • Together, the micro-engines provide high performance and energy efficiency for entire applications.
Time
• Due to power limits, single-processor performance is no longer increasing. • We need extreme energy-efficiency. • GPPs and GPUs have some benefits, but extreme heterogeneity has potential for even greater energy efficiency.
0.01
Power (W)
EE [Ops/J] @ fixed Process
ASIC
0.1
Ideal Spectrum of energy efficiency vs. programmability Compute Asics, soc, gpu, parallel cpu, cpu Chip WhereSoC are we going? Overlap, dominate? / IP +GPU+M-core Answer is deeply a hardware and software question Accel – Waning of Moore’s Law, end of GPUdays + features Moore’s Law, success of nearthreshold and device +M-core scaling heroics – Software translation technology Core for cross-compilation, transformation and optimization, higher-level Programmability/ programming Portability
• 10x10 has to satisfy both energy efficiency and programmability. • Energy-efficiency of customized silicon (SoCs and ASICs) • Programmability of multi-core systems
1
10
Augmented reality Object recognition
100
1000
10000
Image-recognition
1 Dyser Epiphany
10
Tilera
Atom N2800
IBM Cell
Accelerators can be up 100 to 10,000 times more energy-efficient than multi-cores 1000
1
H.264 encoder - custom
3D lighting accelerator
Cortex A8
• •
H.264 encoder ASIC
H.264 general-purpose H.264 SIMD+VLIW H.264 fused ins Cryptomaniac QsCores 0.1
REQUIREMENTS OF A SOLUTION
• •
Performance (GIns / s)
LAP Anton Grape-8
Sandybridge Nehalem Kepler
2
CODE REGION CLUSTERING: WHAT CODE FEATURES SHOULD THE MICRO-ENGINES TARGET?
What code features should the microengines target?
10x10 CHALLENGES How to design the micro-engines?
Workload
What is the programming level for achieving efficiency?
3
MICRO-ENGINE CO-DESIGN: HOW TO DESIGN THE MICRO-ENGINES? Generality: Cover as many clusters as possible within the chip area constraint. Our micro-engine designs are based on fine-grained program features, which is extracted from data flow graphs (DFGs) covering a couple of instructions, thus promises maximum reusability among clusters.
Factor into “10“ Bins
Micro-engine per Bin
Compose
• We cluster similar hot code regions from a variety of applications. The goal is to build one microengine targeting each cluster and compose them into a core. • We use PARSEC, Biobench, UHPC challenge apps and embedded apps. • We select the hot loop nests from these apps. • We study the operation mix, dataflow patterns, memory access patterns for these hot loop nests. • We cluster together similar loop nests.
4
PROGRAMMABILITY: WHAT IS THE PROGRAMMING LEVEL FOR ACHIEVING EFFICIENCY? An abstract layer Parallelism abstraction: inter-node, intra-node, manycore, vector Data abstraction: adaptive layout for distributed and hierarchical memory
Flexibility: Micro-engine designs should consider both computation structures and data types to achieve better matching with clusters.
C-based programming model: MPI, OpenMP, OpenCL, SSE Programmer-controlled parallelism mapping & data distribution High performance at the cost of programming effort and portability
Efficient Implementation: Efficient design automation techniques enable high quality implementations of many micro-engines.
Hardware
5 CONCLUSIONS • Paradigm shift in architecture design inevitable. o 90/10 optimization not viable anymore. o Individual code patterns will have larger influence on architecture . • Extreme heterogeneity needed: o Hybrid CPU-GPU computing not enough. o 10x10 marks a shift to this new paradigm for GPPs. • Heterogeneous solutions must be usable. o Usability challenges become more severe as heterogeneity increases. o Elegant programmability solutions necessary.
6
ACKNOWLEDGEMENTS
7
This work supported in part by the National Science Foundation under award NSF OCI0-1057921 and the Defense Advanced Research Projects Agency under award HR0011-13-2-0014. The contents do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred.