Lifeng Liu1, Meilin Liu1, Chongjun Wang2

Viewer
Transcript

A hardware and compiler assisted programmable warp scheduler for GPGPUs 1 1 2 Lifeng Liu , Meilin Liu , Chongjun Wang 1.Wright State University, Department of Computer Science and Engineering 2.Nanjing University, Department of Computer Science and Technology

Introduc>on

Condition 1 & 3:

Condition 2:

u Propose a programmable warp scheduler. u Estimate the best size of high priority warp group by a compiler framework to avoid L1 cache thrashing. u Develop a compiler framework to analyze the L1 cache foot prints of a specified GPU program using an extended polyhedron model and insert scheduler control instructions at proper locations. u The hardware overhead of our programmable warp scheduler is small.

(5) (4)

-w, w’ : warp ID of M and M’ Cache block condition:

(6)

-gdimy,gdimx: grid size -bdimy,bdimx: block size -end0, end1…: loop boundaries

Our Contribu>ons u  Develop the hardware module of programmable warp scheduler for GPUs u  Add two PTX instructions to control the behaviors of the warp scheduler u  Propose an extended polyhedron model for GPU programs u  Develop a compiler framework based on the extended polyhedron model to insert scheduler control instructions automatically in GPU programs u  Evaluate the performance of our programmable warp scheduler on GPGPU-sim

Mo>va>on Example

-β,β’ : cache block index -γ, γ’: offset inside cache block

The cache blocks occupied by an m-dimensional array access could be estimated with (7)

Where

-E : the polyhedron defined by (3,4,5,6)

-F : the polyhedron defined by (3,4,5)

With Oθ obtained, we could initialize θ to be 1, and try to increase θ by 1 each time, until Oθ is larger than the total number of cache blocks in the L1 cache. Then we record the final value as the best high print warp group size. The estimated best high priority warp group size for ‘conv’ benchmark is reported in Figure 1.

Evalua>on Test platform: GPGPU-sim developed by Tor et al. [1]. The compiler framework is based on the polyhedron compiler framework PLUTO [2]. Group size estimation accuracy:

Figure 1. Performance vs. group size.(The simula/on result is obtained by running 2D convolu/on program on GP-‐GPU simulator that is conﬁgured with 8K, 16K and 32K L1 cache)

Due to the performance cliff phenomenon, it is important to select the best high priority group size.

Compiler Assisted Programmable Warp Scheduler

Table 1.Ex/ma/on accuracy The performance of our proposed programmable warp scheduler (PWS) is compared with four existing warp scheduling techniques: Loose round-robin (LRR), Two-level [4], Greedy then oldest (GTO) [3] and CCWS [5] on selected benchmark GPU kernels.

In the extended polyhedron model, an instance of a statement nested in an n-level loop could be represented by its execution vector (1) Where bidy and bidx represent block indexes and tidy and tidx represent thread indexes. i0,i1…represents loop indexes. An m-dimensional array access could be represented as a memory access vector: (2) Where α1,α2 … are linear combinations of e (3) Definition: When self-eviction occurs, two memory accesses M and M’ must be concurrent memory accesses which meet the following conditions: 1.  M and M’ are issued by the same statement. 2.  M and M’ are issued by the warps located in the high priority warp group. 3.  M and M’ are issued simultaneously.

RESEARCH POSTER PRESENTATION DESIGN © 2015

www.PosterPresentations.com

Figure 2.Speedups

References [1] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. ISPASS 2009. [2] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. PLDI ’08. [3] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. SIGARCH Comput. Archit. News, 39(3):235–246, June 2011. [4] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving gpu performance via large warps and two-level warp scheduling. MICRO-44. [5] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. MICRO-45, pages 72–83, Washington, DC, USA, 2012. IEEE Computer Society.

Lifeng Liu1, Meilin Liu1, Chongjun Wang2

theme theme. You can going t the ma workin. Adjust have to. The de the con. You can ... Add two PTX instructions to control the behaviors of the warp scheduler.

Download PDF

579KB Sizes 3 Downloads 37 Views

Report

Lifeng Liu1, Meilin Liu1, Chongjun Wang2

Recommend Documents