A  hardware  and  compiler  assisted  programmable  warp   scheduler  for  GPGPUs 1 1 2 Lifeng  Liu ,  Meilin  Liu ,  Chongjun  Wang 1.Wright  State  University,  Department  of  Computer  Science  and  Engineering   2.Nanjing  University,  Department  of  Computer    Science  and  Technology  

Introduc>on

Condition 1 & 3:

Condition 2:

u Propose a programmable warp scheduler. u Estimate the best size of high priority warp group by a compiler framework to avoid L1 cache thrashing. u Develop a compiler framework to analyze the L1 cache foot prints of a specified GPU program using an extended polyhedron model and insert scheduler control instructions at proper locations. u The hardware overhead of our programmable warp scheduler is small.

(5) (4)

-w, w’ : warp ID of M and M’ Cache block condition:

(6)

-gdimy,gdimx: grid size -bdimy,bdimx: block size -end0, end1…: loop boundaries

Our  Contribu>ons u  Develop the hardware module of programmable warp scheduler for GPUs u  Add two PTX instructions to control the behaviors of the warp scheduler u  Propose an extended polyhedron model for GPU programs u  Develop a compiler framework based on the extended polyhedron model to insert scheduler control instructions automatically in GPU programs u  Evaluate the performance of our programmable warp scheduler on GPGPU-sim

Mo>va>on  Example  

-β,β’ : cache block index -γ, γ’: offset inside cache block

The cache blocks occupied by an m-dimensional array access could be estimated with (7)

Where

-E : the polyhedron defined by (3,4,5,6)

-F : the polyhedron defined by (3,4,5)

With Oθ obtained, we could initialize θ to be 1, and try to increase θ by 1 each time, until Oθ is larger than the total number of cache blocks in the L1 cache. Then we record the final value as the best high print warp group size. The estimated best high priority warp group size for ‘conv’ benchmark is reported in Figure 1.

Evalua>on Test platform: GPGPU-sim developed by Tor et al. [1]. The compiler framework is based on the polyhedron compiler framework PLUTO [2]. Group size estimation accuracy:

Figure  1.  Performance  vs.  group  size.(The  simula/on  result  is   obtained  by  running  2D  convolu/on  program  on  GP-­‐GPU  simulator   that  is  configured  with  8K,  16K  and  32K  L1  cache)

Due to the performance cliff phenomenon, it is important to select the best high priority group size.

Compiler  Assisted  Programmable  Warp  Scheduler  

Table  1.Ex/ma/on  accuracy The performance of our proposed programmable warp scheduler (PWS) is compared with four existing warp scheduling techniques: Loose round-robin (LRR), Two-level [4], Greedy then oldest (GTO) [3] and CCWS [5] on selected benchmark GPU kernels.

In the extended polyhedron model, an instance of a statement nested in an n-level loop could be represented by its execution vector (1) Where bidy and bidx represent block indexes and tidy and tidx represent thread indexes. i0,i1…represents loop indexes. An m-dimensional array access could be represented as a memory access vector: (2) Where α1,α2 … are linear combinations of e (3) Definition: When self-eviction occurs, two memory accesses M and M’ must be concurrent memory accesses which meet the following conditions: 1.  M and M’ are issued by the same statement. 2.  M and M’ are issued by the warps located in the high priority warp group. 3.  M and M’ are issued simultaneously.

RESEARCH POSTER PRESENTATION DESIGN © 2015

www.PosterPresentations.com

Figure  2.Speedups

References [1] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. ISPASS 2009. [2] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. PLDI ’08. [3] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. SIGARCH Comput. Archit. News, 39(3):235–246, June 2011. [4] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving gpu performance via large warps and two-level warp scheduling. MICRO-44. [5] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. MICRO-45, pages 72–83, Washington, DC, USA, 2012. IEEE Computer Society.

Lifeng Liu1, Meilin Liu1, Chongjun Wang2

theme theme. You can going t the ma workin. Adjust have to. The de the con. You can ... Add two PTX instructions to control the behaviors of the warp scheduler.

579KB Sizes 3 Downloads 26 Views

Recommend Documents

MINING VISUALNESS Zheng Xu1∗, Xin-Jing Wang2 ...
mal”) and ambiguous concepts (e.g., “apple”, which may rep- resent a kind of fruit or a company); and 3) Even though a concept is highly visualizable, it may still ...