Inter-Block Scoreboard Scheduling in a JIT Compiler for VLIW Processors

Benoˆıt Dupont de Dinechin Research & Development Responsible STS Compilation Expertise Center STMicroelectronics Grenoble (France) [email protected]

Inter-Block Scoreboard Scheduling

Presentation Outline

Presentation Outline • JIT for VLIW Motivations • Scheduling Background • Scoreboard Scheduler • Inter-Block Scheduling • Experimental Results • Summary and Conclusions

4th HiPEAC Industrial Workshop – November 26th 2007

2

Inter-Block Scoreboard Scheduling

JIT for VLIW Motivations

JIT for VLIW Motivations Systems-On-Chip at STMicroelectronics • STMicroelectronics SoC(s) primary markets: – mobile devices (phones) – consumer electronics(set-top boxes, car entertainment) • STMicroelectronics SoC(s) typically comprise: – Host processors: ARM family, ST40/SH4 processors – Application processors: DSPs, VLIW-Media (ST200 family) – Dedicated or reconfigurable hardware blocks • Problem: how to expose application processors to third-party application developers beyond firmware API(s)? STMicroelectronics is investigating the CLI program representation

4th HiPEAC Industrial Workshop – November 26th 2007

3

Inter-Block Scoreboard Scheduling

JIT for VLIW Motivations

Microsoft .NET Common Language Infrastructure (CLI) • The Microsoft Common Language Infrastructure (CLI), is the open ECMA standard that supports .NET • Unlike Java, the CLI representation supports a wide variety of languages including C • MS Visual Studio 2003 generates CLI from C, however MS Visual Studio 2005 generates CLI from C++ (C no longer supported) • Active open-source community with GNU Portable.NET (pnet) focused on C and Mono focused on C# • Both projects provide a VEE with interpreter, various JITs, compilers and utilities, however they lack a C compiler Media-processing code is written in C/C++, can be compiled to CLI

4th HiPEAC Industrial Workshop – November 26th 2007

4

Inter-Block Scoreboard Scheduling

JIT for VLIW Motivations

CLI Just-In-Time Compilation at STMicroelectronics • STMicroelectronics developed a GCC4-based C to CLI compiler (gcc/st/cli) Ã warm reception from GCC4 researchers (INRIA) and from the Mono project, who need a C compiler • Between 2005 and 2007, STMicroelectronics developed a series of JIT for the ST231 VLIW and the ARM Ã about half the geometric performance of best static compilation on media processing • CLI executables are processor-neutral, so binding to a processor can be delayed and a single binary is flashed • Same C to CLI compilation chain for all processors in SoC Ã enable third-party developers to program the application processors (ST240 VLIW), not just the host processors (ARM) • Expect excellent fit with Software Component frameworks: CLI software components could be optimized after being loaded 4th HiPEAC Industrial Workshop – November 26th 2007

5

Inter-Block Scoreboard Scheduling

JIT for VLIW Motivations

The ST200 VLIW Media Family (ST210, ST220, ST231, ST240)

• 4-issue VLIW from the Lx architecture Faraboschi et al. [ISCA’00] • partially predicated with SELECT operations (Fisher style VLIW) 4th HiPEAC Industrial Workshop – November 26th 2007

6

Inter-Block Scoreboard Scheduling

JIT for VLIW Motivations

JIT for VLIW Challenges Addressed in this Work • Instruction Scheduling is a key optimization of VLIW code generation, however it is time- and memory-expensive • Java-centric JIT compilation either omits prepass scheduling, or restricts it to a few code paths (IBM Testarossa for the IBM zSeries 990 and the POWER4 processors) • JIT compilation of CLI media processing programs exposes more instruction-level parallelism than Java JIT compilation • VLIW processors that lack interlocking hardware require hazard-free postpass scheduling across all program paths • Expensive prepass schedules including software pipelines should not be destroyed by postpass scheduling Inter-Block Scoreboard Scheduling for postpass VLIW scheduling

4th HiPEAC Industrial Workshop – November 26th 2007

7

Inter-Block Scoreboard Scheduling

Scheduling Background

Scheduling Background

0

 1   2 

 4 

Dependence Graph

        3 

 5

Semi-Active Schedule

       













                  













                  













                  













       1 4 5             









                                         3          2      

To complete any task earlier, must change an execution sequence:

Produced by any ’greedy scheduler’ (no unenforced idling)

4th HiPEAC Industrial Workshop – November 26th 2007

8

Inter-Block Scoreboard Scheduling

Scheduling Background

Active Schedule

                  













                  













                  













                       







    5          4      

  1                      

To complete any task earlier, need to delay another task:

3

2

Produced by any ’operation scheduler’ (serial SGS) Non-Delay Schedule

                  













                  













                  













                  













1 5 4                   











                             2           3   

No execution resource is left idle if a task can start executing:

Produced by any ’cycle scheduler’ (parallel SGS)

4th HiPEAC Industrial Workshop – November 26th 2007

9

Inter-Block Scoreboard Scheduling

Scheduling Background

List Scheduling Properties Cycle Scheduler Scan time slots in increasing order • If two tasks may schedule at same slot, priority breaks ties Operation Scheduler Schedule tasks by priority order • The priority order must be a topological sort of dependences Cost Breakdown of List Scheduling • Build the dependence graph: O(n2 ) • Maintain and prioritize the dependence-ready tasks: O(n log n) • Check resource conflicts and schedule tasks: O(nmp) With n number of tasks, m number of execution resources, p maximum processing time of tasks

4th HiPEAC Industrial Workshop – November 26th 2007

10

Inter-Block Scoreboard Scheduling

Scheduling Background

Applications to Instruction Scheduling • Notions of Semi-Active, Active, Non-Delay apply to instruction scheduling provided reservation tables are monotonic: – the reservation row entries are non-increasing with time • It is known that Non-Delay ⊂ Active ⊂ Semi-Active and Non-Delay may exclude all optimal solutions (unlike Active) • In case of single-cycle reservation tables, Active schedules are Non-Delay schedules and reservation tables are monotonic • Classic instruction schedulers [Muchnich 1997] are cycle-based à Non-Delay schedules • Classic modulo schedulers [Rau 1995, Ruttenberg 1996] are operation-based (with backtracking) à Active schedules Prepass instruction schedules and modulo schedules are Active

4th HiPEAC Industrial Workshop – November 26th 2007

11

Inter-Block Scoreboard Scheduling

Scoreboard Scheduler

Scoreboard Scheduler Scoreboard Scheduler Principles Emulate a OOO superscalar processor without register renaming: • An operation is always scheduled withing a moving time window [current date, current date+window size] • The execution resources are tracked by a resource table: – a sliding window into the schedule reservation table • The dependences are maintained by using two action arrays: – encode RAW, WAR, WAW latencies with only two arrays – access actions and write actions record last time a dependence resource was accessed or written – a dependence resource is a (sub)register or Control or Memory or Volatile 4th HiPEAC Industrial Workshop – November 26th 2007

12

Inter-Block Scoreboard Scheduling

Scoreboard Scheduler

Scoreboard Scheduler Example 1 (window size = 5) issue=0 ldb $r23 = 9[$r16] current=0 | +0 +1 +2 +3 +4 +5 +6 +7 ----------|-----------------------ISSUE | 1 MEM | 1 CTL | ODD | EVEN | ----------|-----------------------Control | a $r16 | a | $r23 | aw aw w w w w Memory | a a

issue=0 add $r18 = $r18, -12 current=0 | +0 +1 +2 +3 +4 +5 +6 +7 ----------|-----------------------ISSUE | 2 MEM | 1 CTL | ODD | EVEN | ----------|-----------------------Control | a $r16 | a $r18 | aw aw w w $r23 | aw aw w w w w Memory | a a

4th HiPEAC Industrial Workshop – November 26th 2007

13

Inter-Block Scoreboard Scheduling

Scoreboard Scheduler

Scoreboard Scheduler Example 2 (window size = 5) issue=4 shl $r24 = $r24, 24 current=0 | +0 +1 +2 +3 +4 +5 +6 +7 ----------|-----------------------ISSUE | 3 2 1 3 1 MEM | 1 1 1 1 CTL | ODD | EVEN | ----------|-----------------------Control | a a a a a $b3 | aw aw aw w w | $r16 | aw aw aw aw aw w w $r18 | aw aw w w $r19 | aw aw w w $r20 | aw aw aw aw w w w w $r23 | aw aw aw aw aw w w $r24 | aw aw aw aw aw aw w w Memory | a a a a a

issue=5 add $r15 = $r15, $r24 current=1 | +0 +1 +2 +3 +4 +5 +6 +7 ----------|-----------------------ISSUE | 2 1 3 1 1 MEM | 1 1 1 CTL | ODD | EVEN | ----------|-----------------------Control | a a a a a $b3 | aw aw w w $r15 | aw aw aw aw aw aw w w $r16 | aw aw aw aw w w $r18 | aw w w $r19 | aw w w $r20 | aw aw aw w w w w $r23 | aw aw aw aw w w $r24 | aw aw aw aw aw w w Memory | a a a a

4th HiPEAC Industrial Workshop – November 26th 2007

14

Inter-Block Scoreboard Scheduling

Scoreboard Scheduler

Scoreboard Scheduling Properties Efficient implementation compared to list schedulers: • O(n) maintenance of the action arrays instead of O(n2 ) dependence graph construction • number of resource availability checks bounded by window size + columns count max • no priority management saves the O(n log n) contribution Definition 1 A fixed-point schedule is a schedule that is invariant when used as a priority list in an operation scheduler. Theorem 1 Scoreboard scheduling a fixed-point schedule yields the same. Corollary 1 Scoreboard scheduling an active schedule yields the same.

4th HiPEAC Industrial Workshop – November 26th 2007

15

Inter-Block Scoreboard Scheduling

Inter-Block Scheduling

Inter-Block Scheduling Inter-Region Scheduling Definition 2 The inter-region scheduling problem is scheduling each scheduling region such that the resource and dependence constraints inherited from the scheduling regions (transitive) predecessors, possibly including self, are satisfied. • Inter-region scheduling does not move code between regions • Basic technique is NOP padding at region boundaries (Open64) • Meld Scheduling [Abraham 1996] is a prepass inter-region scheduling technique demonstrated on superblocks • When the scheduling regions are reduced to basic blocks, we call it inter-block scheduling problem We solve postpass inter-block scheduling as a data-flow problem 4th HiPEAC Industrial Workshop – November 26th 2007

16

Inter-Block Scoreboard Scheduling

Inter-Block Scheduling

Inter-Block Scheduling Data-Flow Analysis A forward data-flow problem solved by a work-list algorithm: Propagated Facts The scoreboard scheduler states at the start and at the end of each basic block and the operations issue dates relative to the start of their basic block Transfer Function Scoreboard schedule the operations according to the order of their previous issue date moreover the issue date of any given operation cannot decrease Meet Function Translate time to nullify the scoreboard scheduler current dates at the end of predecessor basic blocks, then take the maximum of the resource table and the action array entries Theorem 2 The proposed data-flow analysis formulation for inter-block scheduling is monotone.

4th HiPEAC Industrial Workshop – November 26th 2007

17

Inter-Block Scoreboard Scheduling

Inter-Block Scheduling

Data-Flow Convergence and Fix-Points Theorem 3 The inter-block scoreboard scheduling data-flow analysis converges in bounded time. Thanks to the no-idling property of scoreboard scheduling Theorem 4 Any active schedule that obeys the inter-block scheduling constraints is a fixed-point of the inter-block scoreboard scheduling data-flow analysis. Use of NOP Padding to Improve Fix-Points • Loop headers schedules may suffer damaging effects from long pending dependence latencies originating in loop pre-headers • This is addressed by pre-padding the low-frequency paths that merge into a high-frequency path, as identified by the mutual most likely rule of Trace Scheduling

4th HiPEAC Industrial Workshop – November 26th 2007

18

Inter-Block Scoreboard Scheduling

Experimental Results

Experimental Results The STMicroelectronics CLI-JIT Expression Trees The CLI expressions of the evaluation stack are typed and converted to a tree form. Instruction Selection Machine-level instructions are generated and the ABI conventions are implemented. SSA Construction, SSA Destruction Coalesce register copies and ensure ISA register operand constraints. Register Allocation Linear-scan register allocation [Wimmer 2005] Postpass Scheduling Inter-block scoreboard scheduling Instruction Encoding Encode instructions, match bundle templates, encode instruction groups into bundles. Create & Patch Code Emit code, trampolines and relay jumps. 4th HiPEAC Industrial Workshop – November 26th 2007

19

Inter-Block Scoreboard Scheduling

Experimental Results

STMicroelectronics CLI-JIT compilation time breakdown

4th HiPEAC Industrial Workshop – November 26th 2007

20

Inter-Block Scoreboard Scheduling

Summary and Conclusions

Summary and Conclusions New postpass instruction scheduling addresses JIT compilation for VLIW processors: • Low postpass scheduling time compared to list scheduling, thanks to the emulation of a OOO-like dynamic scheduler • Produces correct schedules across all program paths à a requirement for non-interlocked VLIW processors • Does not change active schedules à does not change prepass schedules and software pipelines that are active • Fully implemented in the STMicroelectronics CLI JIT compiler for ST200 VLIW processors and ARM processors (V5e, V6) Future work: compute scheduling priorities for non-postpass scheduled code by reverse scoreboard scheduling. 4th HiPEAC Industrial Workshop – November 26th 2007

21

Inter-Block Scoreboard Scheduling in a JIT Compiler for ...

JIT compilation of CLI media processing programs exposes more instruction-level parallelism .... used as a priority list in an operation scheduler. Theorem 1 ... Theorem 3 The inter-block scoreboard scheduling data-flow analysis converges in ...

241KB Sizes 2 Downloads 134 Views

Recommend Documents

Inter-Block Scoreboard Scheduling in a JIT Compiler for ...
Classic List Scheduling ... schedules including software pipelines (cyclic schedules). Euro-Par 2008 ... propagation reminiscent of forward data-flow analysis.

Inter-Block Scoreboard Scheduling in a JIT Compiler for ...
by prepass scheduling and by software pipelining, provided register allocation ..... When the scheduling regions are reduced to basic blocks, we call ..... Conference on High Performance Embedded Architectures and Compilers, 2008. 5.

A JIT Compiler for Android's Dalvik VM (PDF) dl
A description for this result is not available because of this site's robots.txtLearn more

A Scheduling Method for Divisible Workload Problem in ...
previously introduced are based on the master-worker model. ... cess runs in a particular computer. ..... CS2002-0721, Dept. of Computer Science and Engi-.

A Scheduling Algorithm for MIMO DoF Allocation in ... - ECE Louisville
R. Zhu is with South-Central University for Nationalities, China. E-mail: [email protected]. Manuscript received August 25, 2014; revised January 19, 2015; ...... theory to practice: An overview of MIMO space-time coded wire- less systems,” IEEE

A Scheduling Method for Divisible Workload Problem in Grid ...
ing algorithms. Section 3 briefly describes our hetero- geneous computation platform. Section 4 introduces our dynamic scheduling methodology. Section 5 con-.

A Scheduling Algorithm for MIMO DoF Allocation in ... - ECE Louisville
Index Terms—Scheduling, multi-hop wireless networks, MIMO, degree-of-freedom (DoF), throughput maximization. ♢. 1 INTRODUCTION ...... Engineering (NAE). Rongbo Zhu (M'10) is currently a professor in the College of Computer Science of South-. Cent

Retargeting a C Compiler for a DSP Processor
Oct 5, 2004 - C source files produce an executable file that can execute on the DSP. The only .... The AGU performs all of the address storage and address calculations ... instruction can be seen here: Opcode Operands. XDB. YDB. MAC.

A Dynamic Scheduling Algorithm for Divisible Loads in ...
UMR exhibits the best performance among its family of algorithms. The MRRS .... by local applications (e.g. desktop applications) at the worker. The arrival of the local ..... u = (u1, u2, ... un) : the best solution so far, ui. {0,1} в : the value

Jit epartmetit of (notation
Apr 29, 2016 - Public Elementary and Secondary Schools Heads. All Others .... FL'. • co 0 F". •. 000P0,0. • oo. 0 0 0 17. 5. O co ••4 aq cc g. 0 co Et ,ct. • co. 4.

JIT CHG Application.pdf - MOBILPASAR.COM
o Patients who have a Grade 3 or 4 GVHD o Patients with Grade 1 or 3 GVHD whose skin is irritated by the CHG treatment o Patients undergoing short duration radiation therapy will not receive CHG treatment on all days in which radiation therapy is adm

Formal Compiler Construction in a Logical ... - Semantic Scholar
Research Initiative (MURI) program administered by the Office of Naval Research. (ONR) under ... address the problem of compiler verification in this paper; our main ...... Science Technical Report 32, AT&T Bell Laboratories, July 1975. Lee89 ...

Formal Compiler Construction in a Logical ... - Semantic Scholar
literate programming), and the complete source code is available online ..... class of atoms into two parts: those that may raise an exception, and those that do not, and ...... PLAN Notices, pages 199–208, Atlanta, Georgia, June 1988. ACM.

A User-Friendly Methodology for Automatic Exploration of Compiler ...
define the compiler-user interface using a methodology of ... Collection (GCC) C compiler [9]. .... For example, the Intel XScale processor has a special regis-.

1o/"A '1;JtT ;JIt"l
NOW,THEREFORE,inaccordancewithArticle55(1)of the Constitution, it is hereby proclaimed as follows: I. Short Title. This Proclamation may be cited as the ...

JIT Preop Checklist.pdf
Use your initials in the checkboxes, not a checkmark or “x”. What items on the check list may be overlooked? • Nickname: List the name that patient recognizes and responds to. ... patients must be sent via cart, regular bed or crib. Unacceptabl

Benchmarking the Compiler Vectorization for Multimedia Applications
efficient way to exploit the data parallelism hidden in ap- plications. ... The encoder also employs intra-frame analysis when cost effective. ... bigger set of data.

Concurrency-aware compiler optimizations for hardware description ...
semantics, we extend the data flow analysis framework to concurrent threads. .... duce two auxiliary concepts—Event Vector and Sensitivity Vector—in section 6, ...

SUPPLY MANAGEMENT FOR JIT NOTES 1.pdf
focused production, bottleneck. management, etc. were introduced on the. shop-floor (Shingo, 1981; Monden,. 1981a; Skinner, 1985; Goldratt, 1986),. and significant increases in productivity. and quality were achieved. It is important. to note that th

Guest lecture for Compiler Construction, Spring 2015
references and (user-defined) exceptions. ✓ modules, signatures, abstract types. The CakeML language. = Standard ML without I/O or functors. The verified machine-code implementation: parsing, type inference, compilation, garbage collection, bignums

Scheduling for Humans in Multirobot Supervisory Control
infinite time horizon, where having more ITs than can “fit” ... occurs more than average, on the infinite time horizon one ..... completion time graph of Figure 4a.

Scheduling Workforce and Workflow in a High Volume Factory
nated shift starting times, and may change job assignments at mid shift. In order to smooth the ..... computer implementation of this modeling system. In computer ...

Scheduling Periodic Tasks in a Hard Real-Time Environment⋆
This work was partially supported by Berlin Mathematical School, by DFG research center ..... the two last machines that we call ˜M1 and ˜M2. Let τ be a task ...