GPU Power Model

Nandhini Sudarsanan [email protected] Nathan Vanderby [email protected] Neeraj Mishra [email protected] Usha Vinodh [email protected] Chi Xu [email protected]



2

Outline  Introduction and Motivation  Analytical Model Description  Experiment Setup

 Results  Conclusion and Further Work

CSCI 8205: GPU Power Model

5/4/11

3

Introduction  Develop a methodology for building an accurate power

model for a GPU.  Validate with a NVIDA’s GTX 480 GPU.  Measure power efficiency of various NVIDIA SDK

benchmarks.  Accurate power model can help  Explore various architectural and algorithmic trade offs.  Figure out balance of workload between GPU and CPU.

CSCI 8205: GPU Power Model

5/4/11

4

Motivation  Power Consumption: Key criterion for future Hardware Devices

and Embedded Software.  Effect of increased power density has been not been felt till now  Supply voltage was scaled back too. 

Current and Power density remained constant.

 Further reduction in supply voltage difficult in future  Supply voltage approaching close to threshold voltage.  Gate oxide thickness almost equal to 1nm.

CSCI 8205: GPU Power Model

5/4/11

5

Motivation

CSCI 8205: GPU Power Model

5/4/11

6

GPU Processing Power

CSCI 8205: GPU Power Model

5/4/11

7

Price of Power  Maximum Load = Lot of Power  Nvidia 8800 GTX: 137W  Intel Xeon LS5400: 50W

CSCI 8205: GPU Power Model

5/4/11

8

Power Wall  Power Density in GPUs larger that even high end CPUs  Power gating, Clock gating have been successfully employed in

CPUs [Brooks, Hpca 2001]  Power gating, Clock gating and other H/W based schemes are

not used in most GPUs [Kim Isca 2010]  Accurate power model can help  Explore various architectural and algorithmic trade offs.  Figure out balance of workload between GPU and CPU. CSCI 8205: GPU Power Model

5/4/11

9

Background  Power consumption can be divided into:

Power = Dynamic_power + Static_power + Short_Ckt_Power  Dynamic power is determined by run-time events  Fixed-function units: texture filtering and rasterization  Programmable units: memory and floating point  Static power determined by  circuit technology  chip layout  operating temperature.

P = VCC * N* Kdesign* Ileak CSCI 8205: GPU Power Model

5/4/11

10

Previous Power Models  Statistical power modeling approach for GPU [Matsuoka

2010]   

Uses 13 CUDA Performance counters (ld,st,branch,tlb miss) to obtain profile Finds correlation b/w profiles and power by statistical model learning. Lot of information not captured by counters lost

 Cycle-level simulations based Power Model ,[Skadron

HWWS'04] 

 

Assume hypothetical architecture to explore new GPU microarchitectures and model power and leakage properties Cycle-level processor simulations are time consuming [Martonosi&Isci 2003] Do not allow a complete view of operating system effects, I/O [Isci 2003]

CSCI 8205: GPU Power Model

5/4/11

11

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model

 Experiment Setup  Results

 Conclusion and Further Work

CSCI 8205: GPU Power Model

5/4/11

12

Need for a Parser  GPGPUsim output is not tailored to our needs  GPGPUsim is time consuming  Parser is very fast

 GPGPUsim only CUDA 2.3 or prior

CSCI 8205: GPU Power Model

5/4/11

13

Limitations of the Parser  Dynamic loops are not automatically determined.  Branch prediction is assumed to be taken  Highly tailored to our specific needs.  A change in the PTX layout might require change to

parser.

CSCI 8205: GPU Power Model

5/4/11

14

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model

 Experiment Setup  Results

 Conclusion and Further Work

CSCI 8205: GPU Power Model

5/4/11

15

Fermi Architecture: sm_20  Memory Hierarchy     

PCIE & RAM L2 Cache L1 Cache Shared Memory Registers

 Streaming Processor   

32 ALU, 32FPU, 4SFU 2 Pipelines, 16-24 stages 2 Warp Scheduler, 2 Inst /Cycle

CSCI 8205: GPU Power Model

5/4/11

16

Fermi Architecture: sm_20  Memory Hierarchy     

PCIE & RAM L2 Cache L1 Cache Shared Memory Registers

 Streaming Processor   

32 ALU, 32FPU, 4SFU 2 Pipelines, 16-24 stages 2 Warp Scheduler, 2 Inst /Cycle

CSCI 8205: GPU Power Model

5/4/11

17

Factors in the Power Model  Temperature  # of SMs



CSCI 8205: GPU Power Model

5/4/11

18

Power Model  PTX Level

CSCI 8205: GPU Power Model

5/4/11

19

Power Model  Assembly Level

CSCI 8205: GPU Power Model

5/4/11

20

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model

 Experiment Setup  Results

 Conclusion and Further Work

CSCI 8205: GPU Power Model

5/4/11

21

Experiment Setup - Hardware  Measure Power Consumption and Temperature  Sample Temperature @ 10Hz, GPU sensor  Current Clamp for PCIE & GPU Power Cable  Data Acquisition Card @ 100Hz

 GPU Performance Counter  Profile 57 Counters per Kernel  9 Executions

CSCI 8205: GPU Power Model

5/4/11

22

Experiment Setup - Software  Driver API  Generate and Modify PTX code  Minimize control loops  Stress one type of PTX instruction per kernel, over 95%  76 kernels  Wisely choose block and grid size and  CUDA 4.0  Built in Binary -> Assembly Converter (Cuobjdump)

 Timer interrupt to collect Temperature  Remote login CSCI 8205: GPU Power Model

5/4/11

23

Limitations of PTX  Higher level than assembly  30 out of 76 PTX take multiple assembly  Divide, Sqrt, etc.: 1 PTX line, library in assembly

 Compiler optimizations from PTX -> assembly  Doesn’t reflect RAW dependencies  Performance counters results based on assembly

CSCI 8205: GPU Power Model

5/4/11

24

Benchmarks  Small number of overhead operations (loop counters,

initialization, etc.).  Computational intensive work to allow for an experiment of

significant length for accurate current measurement.  Exhibit high utilization of the CUDA cores, few data hazards as

possible.  Grid and block sizes appropriately so that all SM are used,

since idle SM leak.  Accordingly 7 benchmarks were selected from CUDA SDK.

CSCI 8205: GPU Power Model

5/4/11

25

Benchmarks  Our benchmarks  2D convolution  Matrix Multiplication  Vector Addition  Vector Reduction  Scalar Product  DCT 8x8

 3DFD

CSCI 8205: GPU Power Model

5/4/11

26

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model

 Experiment Setup  Results

 Conclusion and Further Work

CSCI 8205: GPU Power Model

5/4/11

27

Results

CSCI 8205: GPU Power Model

5/4/11

28

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model

 Experiment Setup  Results

 Conclusion and Further Work

CSCI 8205: GPU Power Model

5/4/11

29

Conclusion and Further Work  Conclusion

 Further Work  Take into account context switches  Consider Multiple kernels running simultaneously

CSCI 8205: GPU Power Model

5/4/11

30

The End Thanks Q&A

CSCI 8205: GPU Power Model

5/4/11

GPU Power Model -

Analytical Model Description. ⬈ Parser. ⬈ Power Model. ⬈ Experiment Setup. ⬈ Results. ⬈ Conclusion and Further Work. CSCI 8205: GPU Power Model. 11.

821KB Sizes 1 Downloads 208 Views

Recommend Documents

GPU Computing - GitHub
Mar 9, 2017 - from their ability to do large numbers of ... volves a large number of similar, repetitive cal- ... Copy arbitrary data between CPU and GPU. • Fast.

Cascaded HOG on GPU
discards detection windows obviously not including target objects. It reduces the .... (block) consisting of some cells in window. The histogram binning and it.

gpu optimizer: a 3d reconstruction on the gpu using ...
graphics hardware available in any desktop computer. This work ... the domain of the problem. .... memory and then search the best candidate using the. CPU.

Call For Paper GPU Design Patterns - Teratec
Page 1. Call For Paper. GPU Design Patterns. The Open GPU aims at building OpenCL and CUDA tools for CPU /GPU hybrid computing through ... Web sites :.

Call For Paper GPU Design Patterns - Teratec
GPU Design Patterns. The Open GPU aims at ... Designing the appropriate hardware and software architectures for the exploitation of these ... Web sites :.

Scalable GPU Graph Traversal - Research - Nvidia
Feb 29, 2012 - format to store the graph in memory consisting of two arrays. Fig. ... single array of m integers. ... Their work complexity is O(n2+m) as there may.

Mobile GPU BF4.pdf
Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Mobile GPU BF4.pdf. Mobile GPU BF4.pdf. Open. Extract.

Shredder: GPU-Accelerated Incremental Storage and ... - Usenix
[28] JANG, K., HAN, S., HAN, S., MOON, S., AND PARK, K. Sslshader: cheap ssl acceleration with commodity processors. In. Proceedings of the 8th USENIX ...

GPU-Accelerated Incremental Storage and Computation - Usenix
chunking bandwidth compared to our optimized parallel implementation without a GPU on the same host system. .... The CUDA [6] programming ..... put data either from the network or the disk and trans- .... with Inc-HDFS client using a JAVA-CUDA interf

Market Power and Efficiency in a Search Model - IZA
4A different modeling approach is taken by Kaas and Madden (2008). They consider a two-firm Hotelling model and show that a minimum wage reduces the ...

Cheap 1Pcs G.T. Power Model Profession Rc Motor Digital Optical ...
Cheap 1Pcs G.T. Power Model Profession Rc Motor Digi ... rts 2 To 9 Bladed Paddle Propeller Free Shipping.pdf. Cheap 1Pcs G.T. Power Model Profession Rc ...

Flexible Software Profiling of GPU Architectures
are often difficult to connect to the latest software toolchains .... GPU Software Stack ..... divergence; and detailed accounting of unique references gen-.

PacketShader: A GPU-Accelerated Software Router
Sue Moon†. †Department of Computer Science, KAIST, Korea. {sangjin .... ory, (ii) a host program instructs the GPU to launch the kernel, (iii) the GPU executes ...

GPU Enhanced Global Terrain Rendering". - Personal Web Pages
over tax the system so as to not slow other rendering systems . Using a mixture of ... working with the raw data while been an efficient storage mechanism. ... file. This meta-data is used to place the block of terrain in the appropriate location.

Download GPU PRO 3: Advanced Rendering ...
and games that run on the DirectX or OpenGL run-times or any other run-time ... A dedicated section on general purpose GPU programming f Full description.

PDF Download GPU Pro 6: Advanced Rendering ...
The latest edition of this bestselling game development reference offers proven ... dedicated section on general purpose GPU programming that covers CUDA, ...

GPU Accelerated Post-Processing for Multifrequency ...
We transfer the input data arrays from the CPU memory to the. GPU memory. Then, the parallelized numerous estimation op- erations (std(), mean() and var()) are processed on the GPU. The GPU output is transferred back to the CPU memory. We then comput

Bipartite Graph Matching Computation on GPU
We present a new data-parallel approach for computing bipartite graph matching that is ... As an application to the GPU implementation developed, we propose a new formulation for a ..... transparent way to its developers. Computer vision ..... in alg