GPU Power Model

Nandhini Sudarsanan [email protected] Nathan Vanderby [email protected] Neeraj Mishra [email protected] Usha Vinodh [email protected] Chi Xu [email protected]



2

Outline  Introduction and Motivation  Analytical Model Description  Experiment Setup

 Results  Conclusion and Further Work

CSCI 8205: GPU Power Model

5/4/11

3

Introduction  Develop a methodology for building an accurate power

model for a GPU.  Validate with a NVIDA’s GTX 480 GPU.  Measure power efficiency of various NVIDIA SDK

benchmarks.  Accurate power model can help  Explore various architectural and algorithmic trade offs.  Figure out balance of workload between GPU and CPU.

CSCI 8205: GPU Power Model

5/4/11

4

Motivation  Power Consumption: Key criterion for future Hardware Devices

and Embedded Software.  Effect of increased power density has been not been felt till now  Supply voltage was scaled back too. 

Current and Power density remained constant.

 Further reduction in supply voltage difficult in future  Supply voltage approaching close to threshold voltage.  Gate oxide thickness almost equal to 1nm.

CSCI 8205: GPU Power Model

5/4/11

5

Motivation

CSCI 8205: GPU Power Model

5/4/11

6

GPU Processing Power

CSCI 8205: GPU Power Model

5/4/11

7

Price of Power  Maximum Load = Lot of Power  Nvidia 8800 GTX: 137W  Intel Xeon LS5400: 50W

CSCI 8205: GPU Power Model

5/4/11

8

Power Wall  Power Density in GPUs larger that even high end CPUs  Power gating, Clock gating have been successfully employed in

CPUs [Brooks, Hpca 2001]  Power gating, Clock gating and other H/W based schemes are

not used in most GPUs [Kim Isca 2010]  Accurate power model can help  Explore various architectural and algorithmic trade offs.  Figure out balance of workload between GPU and CPU. CSCI 8205: GPU Power Model

5/4/11

9

Background  Power consumption can be divided into:

Power = Dynamic_power + Static_power + Short_Ckt_Power  Dynamic power is determined by run-time events  Fixed-function units: texture filtering and rasterization  Programmable units: memory and floating point  Static power determined by  circuit technology  chip layout  operating temperature.

P = VCC * N* Kdesign* Ileak CSCI 8205: GPU Power Model

5/4/11

10

Previous Power Models  Statistical power modeling approach for GPU [Matsuoka

2010]   

Uses 13 CUDA Performance counters (ld,st,branch,tlb miss) to obtain profile Finds correlation b/w profiles and power by statistical model learning. Lot of information not captured by counters lost

 Cycle-level simulations based Power Model ,[Skadron

HWWS'04] 

 

Assume hypothetical architecture to explore new GPU microarchitectures and model power and leakage properties Cycle-level processor simulations are time consuming [Martonosi&Isci 2003] Do not allow a complete view of operating system effects, I/O [Isci 2003]

CSCI 8205: GPU Power Model

5/4/11

11

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model

 Experiment Setup  Results

 Conclusion and Further Work

CSCI 8205: GPU Power Model

5/4/11

12

Need for a Parser  GPGPUsim output is not tailored to our needs  GPGPUsim is time consuming  Parser is very fast

 GPGPUsim only CUDA 2.3 or prior

CSCI 8205: GPU Power Model

5/4/11

13

Limitations of the Parser  Dynamic loops are not automatically determined.  Branch prediction is assumed to be taken  Highly tailored to our specific needs.  A change in the PTX layout might require change to

parser.

CSCI 8205: GPU Power Model

5/4/11

14

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model

 Experiment Setup  Results

 Conclusion and Further Work

CSCI 8205: GPU Power Model

5/4/11

15

Fermi Architecture: sm_20  Memory Hierarchy     

PCIE & RAM L2 Cache L1 Cache Shared Memory Registers

 Streaming Processor   

32 ALU, 32FPU, 4SFU 2 Pipelines, 16-24 stages 2 Warp Scheduler, 2 Inst /Cycle

CSCI 8205: GPU Power Model

5/4/11

16

Fermi Architecture: sm_20  Memory Hierarchy     

PCIE & RAM L2 Cache L1 Cache Shared Memory Registers

 Streaming Processor   

32 ALU, 32FPU, 4SFU 2 Pipelines, 16-24 stages 2 Warp Scheduler, 2 Inst /Cycle

CSCI 8205: GPU Power Model

5/4/11

17

Factors in the Power Model  Temperature  # of SMs



CSCI 8205: GPU Power Model

5/4/11

18

Power Model  PTX Level

CSCI 8205: GPU Power Model

5/4/11

19

Power Model  Assembly Level

CSCI 8205: GPU Power Model

5/4/11

20

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model

 Experiment Setup  Results

 Conclusion and Further Work

CSCI 8205: GPU Power Model

5/4/11

21

Experiment Setup - Hardware  Measure Power Consumption and Temperature  Sample Temperature @ 10Hz, GPU sensor  Current Clamp for PCIE & GPU Power Cable  Data Acquisition Card @ 100Hz

 GPU Performance Counter  Profile 57 Counters per Kernel  9 Executions

CSCI 8205: GPU Power Model

5/4/11

22

Experiment Setup - Software  Driver API  Generate and Modify PTX code  Minimize control loops  Stress one type of PTX instruction per kernel, over 95%  76 kernels  Wisely choose block and grid size and  CUDA 4.0  Built in Binary -> Assembly Converter (Cuobjdump)

 Timer interrupt to collect Temperature  Remote login CSCI 8205: GPU Power Model

5/4/11

23

Limitations of PTX  Higher level than assembly  30 out of 76 PTX take multiple assembly  Divide, Sqrt, etc.: 1 PTX line, library in assembly

 Compiler optimizations from PTX -> assembly  Doesn’t reflect RAW dependencies  Performance counters results based on assembly

CSCI 8205: GPU Power Model

5/4/11

24

Benchmarks  Small number of overhead operations (loop counters,

initialization, etc.).  Computational intensive work to allow for an experiment of

significant length for accurate current measurement.  Exhibit high utilization of the CUDA cores, few data hazards as

possible.  Grid and block sizes appropriately so that all SM are used,

since idle SM leak.  Accordingly 7 benchmarks were selected from CUDA SDK.

CSCI 8205: GPU Power Model

5/4/11

25

Benchmarks  Our benchmarks  2D convolution  Matrix Multiplication  Vector Addition  Vector Reduction  Scalar Product  DCT 8x8

 3DFD

CSCI 8205: GPU Power Model

5/4/11

26

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model

 Experiment Setup  Results

 Conclusion and Further Work

CSCI 8205: GPU Power Model

5/4/11

27

Results

CSCI 8205: GPU Power Model

5/4/11

28

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model

 Experiment Setup  Results

 Conclusion and Further Work

CSCI 8205: GPU Power Model

5/4/11

29

Conclusion and Further Work  Conclusion

 Further Work  Take into account context switches  Consider Multiple kernels running simultaneously

CSCI 8205: GPU Power Model

5/4/11

30

The End Thanks Q&A

CSCI 8205: GPU Power Model

5/4/11

GPU Power Model -

Analytical Model Description. ⬈ Parser. ⬈ Power Model. ⬈ Experiment Setup. ⬈ Results. ⬈ Conclusion and Further Work. CSCI 8205: GPU Power Model. 11.

821KB Sizes 1 Downloads 236 Views

Recommend Documents

No documents