A hardware and compiler assisted programmable warp scheduler for GPGPUs 1 1 2 Lifeng Liu , Meilin Liu , Chongjun Wang 1.Wright State University, Department of Computer Science and Engineering 2.Nanjing University, Department of Computer Science and Technology
Introduc>on
Condition 1 & 3:
Condition 2:
u Propose a programmable warp scheduler. u Estimate the best size of high priority warp group by a compiler framework to avoid L1 cache thrashing. u Develop a compiler framework to analyze the L1 cache foot prints of a specified GPU program using an extended polyhedron model and insert scheduler control instructions at proper locations. u The hardware overhead of our programmable warp scheduler is small.
(5) (4)
-w, w’ : warp ID of M and M’ Cache block condition:
(6)
-gdimy,gdimx: grid size -bdimy,bdimx: block size -end0, end1…: loop boundaries
Our Contribu>ons u Develop the hardware module of programmable warp scheduler for GPUs u Add two PTX instructions to control the behaviors of the warp scheduler u Propose an extended polyhedron model for GPU programs u Develop a compiler framework based on the extended polyhedron model to insert scheduler control instructions automatically in GPU programs u Evaluate the performance of our programmable warp scheduler on GPGPU-sim
Mo>va>on Example
-β,β’ : cache block index -γ, γ’: offset inside cache block
The cache blocks occupied by an m-dimensional array access could be estimated with (7)
Where
-E : the polyhedron defined by (3,4,5,6)
-F : the polyhedron defined by (3,4,5)
With Oθ obtained, we could initialize θ to be 1, and try to increase θ by 1 each time, until Oθ is larger than the total number of cache blocks in the L1 cache. Then we record the final value as the best high print warp group size. The estimated best high priority warp group size for ‘conv’ benchmark is reported in Figure 1.
Evalua>on Test platform: GPGPU-sim developed by Tor et al. [1]. The compiler framework is based on the polyhedron compiler framework PLUTO [2]. Group size estimation accuracy:
Figure 1. Performance vs. group size.(The simula/on result is obtained by running 2D convolu/on program on GP-‐GPU simulator that is configured with 8K, 16K and 32K L1 cache)
Due to the performance cliff phenomenon, it is important to select the best high priority group size.
Compiler Assisted Programmable Warp Scheduler
Table 1.Ex/ma/on accuracy The performance of our proposed programmable warp scheduler (PWS) is compared with four existing warp scheduling techniques: Loose round-robin (LRR), Two-level [4], Greedy then oldest (GTO) [3] and CCWS [5] on selected benchmark GPU kernels.
In the extended polyhedron model, an instance of a statement nested in an n-level loop could be represented by its execution vector (1) Where bidy and bidx represent block indexes and tidy and tidx represent thread indexes. i0,i1…represents loop indexes. An m-dimensional array access could be represented as a memory access vector: (2) Where α1,α2 … are linear combinations of e (3) Definition: When self-eviction occurs, two memory accesses M and M’ must be concurrent memory accesses which meet the following conditions: 1. M and M’ are issued by the same statement. 2. M and M’ are issued by the warps located in the high priority warp group. 3. M and M’ are issued simultaneously.
RESEARCH POSTER PRESENTATION DESIGN © 2015
www.PosterPresentations.com
Figure 2.Speedups
References [1] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. ISPASS 2009. [2] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. PLDI ’08. [3] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. SIGARCH Comput. Archit. News, 39(3):235–246, June 2011. [4] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving gpu performance via large warps and two-level warp scheduling. MICRO-44. [5] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. MICRO-45, pages 72–83, Washington, DC, USA, 2012. IEEE Computer Society.