GPU Power Model
Nandhini Sudarsanan
[email protected] Nathan Vanderby
[email protected] Neeraj Mishra
[email protected] Usha Vinodh
[email protected] Chi Xu
[email protected]
2
Outline Introduction and Motivation Analytical Model Description Experiment Setup
Results Conclusion and Further Work
CSCI 8205: GPU Power Model
5/4/11
3
Introduction Develop a methodology for building an accurate power
model for a GPU. Validate with a NVIDA’s GTX 480 GPU. Measure power efficiency of various NVIDIA SDK
benchmarks. Accurate power model can help Explore various architectural and algorithmic trade offs. Figure out balance of workload between GPU and CPU.
CSCI 8205: GPU Power Model
5/4/11
4
Motivation Power Consumption: Key criterion for future Hardware Devices
and Embedded Software. Effect of increased power density has been not been felt till now Supply voltage was scaled back too.
Current and Power density remained constant.
Further reduction in supply voltage difficult in future Supply voltage approaching close to threshold voltage. Gate oxide thickness almost equal to 1nm.
CSCI 8205: GPU Power Model
5/4/11
5
Motivation
CSCI 8205: GPU Power Model
5/4/11
6
GPU Processing Power
CSCI 8205: GPU Power Model
5/4/11
7
Price of Power Maximum Load = Lot of Power Nvidia 8800 GTX: 137W Intel Xeon LS5400: 50W
CSCI 8205: GPU Power Model
5/4/11
8
Power Wall Power Density in GPUs larger that even high end CPUs Power gating, Clock gating have been successfully employed in
CPUs [Brooks, Hpca 2001] Power gating, Clock gating and other H/W based schemes are
not used in most GPUs [Kim Isca 2010] Accurate power model can help Explore various architectural and algorithmic trade offs. Figure out balance of workload between GPU and CPU. CSCI 8205: GPU Power Model
5/4/11
9
Background Power consumption can be divided into:
Power = Dynamic_power + Static_power + Short_Ckt_Power Dynamic power is determined by run-time events Fixed-function units: texture filtering and rasterization Programmable units: memory and floating point Static power determined by circuit technology chip layout operating temperature.
P = VCC * N* Kdesign* Ileak CSCI 8205: GPU Power Model
5/4/11
10
Previous Power Models Statistical power modeling approach for GPU [Matsuoka
2010]
Uses 13 CUDA Performance counters (ld,st,branch,tlb miss) to obtain profile Finds correlation b/w profiles and power by statistical model learning. Lot of information not captured by counters lost
Cycle-level simulations based Power Model ,[Skadron
HWWS'04]
Assume hypothetical architecture to explore new GPU microarchitectures and model power and leakage properties Cycle-level processor simulations are time consuming [Martonosi&Isci 2003] Do not allow a complete view of operating system effects, I/O [Isci 2003]
CSCI 8205: GPU Power Model
5/4/11
11
Outline Introduction and Motivation Analytical Model Description Parser Power Model
Experiment Setup Results
Conclusion and Further Work
CSCI 8205: GPU Power Model
5/4/11
12
Need for a Parser GPGPUsim output is not tailored to our needs GPGPUsim is time consuming Parser is very fast
GPGPUsim only CUDA 2.3 or prior
CSCI 8205: GPU Power Model
5/4/11
13
Limitations of the Parser Dynamic loops are not automatically determined. Branch prediction is assumed to be taken Highly tailored to our specific needs. A change in the PTX layout might require change to
parser.
CSCI 8205: GPU Power Model
5/4/11
14
Outline Introduction and Motivation Analytical Model Description Parser Power Model
Experiment Setup Results
Conclusion and Further Work
CSCI 8205: GPU Power Model
5/4/11
15
Fermi Architecture: sm_20 Memory Hierarchy
PCIE & RAM L2 Cache L1 Cache Shared Memory Registers
Streaming Processor
32 ALU, 32FPU, 4SFU 2 Pipelines, 16-24 stages 2 Warp Scheduler, 2 Inst /Cycle
CSCI 8205: GPU Power Model
5/4/11
16
Fermi Architecture: sm_20 Memory Hierarchy
PCIE & RAM L2 Cache L1 Cache Shared Memory Registers
Streaming Processor
32 ALU, 32FPU, 4SFU 2 Pipelines, 16-24 stages 2 Warp Scheduler, 2 Inst /Cycle
CSCI 8205: GPU Power Model
5/4/11
17
Factors in the Power Model Temperature # of SMs
CSCI 8205: GPU Power Model
5/4/11
18
Power Model PTX Level
CSCI 8205: GPU Power Model
5/4/11
19
Power Model Assembly Level
CSCI 8205: GPU Power Model
5/4/11
20
Outline Introduction and Motivation Analytical Model Description Parser Power Model
Experiment Setup Results
Conclusion and Further Work
CSCI 8205: GPU Power Model
5/4/11
21
Experiment Setup - Hardware Measure Power Consumption and Temperature Sample Temperature @ 10Hz, GPU sensor Current Clamp for PCIE & GPU Power Cable Data Acquisition Card @ 100Hz
GPU Performance Counter Profile 57 Counters per Kernel 9 Executions
CSCI 8205: GPU Power Model
5/4/11
22
Experiment Setup - Software Driver API Generate and Modify PTX code Minimize control loops Stress one type of PTX instruction per kernel, over 95% 76 kernels Wisely choose block and grid size and CUDA 4.0 Built in Binary -> Assembly Converter (Cuobjdump)
Timer interrupt to collect Temperature Remote login CSCI 8205: GPU Power Model
5/4/11
23
Limitations of PTX Higher level than assembly 30 out of 76 PTX take multiple assembly Divide, Sqrt, etc.: 1 PTX line, library in assembly
Compiler optimizations from PTX -> assembly Doesn’t reflect RAW dependencies Performance counters results based on assembly
CSCI 8205: GPU Power Model
5/4/11
24
Benchmarks Small number of overhead operations (loop counters,
initialization, etc.). Computational intensive work to allow for an experiment of
significant length for accurate current measurement. Exhibit high utilization of the CUDA cores, few data hazards as
possible. Grid and block sizes appropriately so that all SM are used,
since idle SM leak. Accordingly 7 benchmarks were selected from CUDA SDK.
CSCI 8205: GPU Power Model
5/4/11
25
Benchmarks Our benchmarks 2D convolution Matrix Multiplication Vector Addition Vector Reduction Scalar Product DCT 8x8
3DFD
CSCI 8205: GPU Power Model
5/4/11
26
Outline Introduction and Motivation Analytical Model Description Parser Power Model
Experiment Setup Results
Conclusion and Further Work
CSCI 8205: GPU Power Model
5/4/11
27
Results
CSCI 8205: GPU Power Model
5/4/11
28
Outline Introduction and Motivation Analytical Model Description Parser Power Model
Experiment Setup Results
Conclusion and Further Work
CSCI 8205: GPU Power Model
5/4/11
29
Conclusion and Further Work Conclusion
Further Work Take into account context switches Consider Multiple kernels running simultaneously
CSCI 8205: GPU Power Model
5/4/11
30
The End Thanks Q&A
CSCI 8205: GPU Power Model
5/4/11