Inter-Block Scoreboard Scheduling in a JIT Compiler for VLIW Processors
Benoˆıt Dupont de Dinechin Research & Development Responsible Software, Tools and Services (STS) STMicroelectronics Grenoble (France)
[email protected]
Inter-Block Scoreboard Scheduling
Presentation Outline
Presentation Outline • JIT for Media Processing • Classic List Scheduling • Scoreboard Scheduling • Inter-Block Scheduling • ST200 VLIW Experiments • Observations and Conclusions
Euro-Par 2008 – August 28th 2008
2
Inter-Block Scoreboard Scheduling
JIT for Media Processing
JIT for Media Processing Systems-On-Chip (SoCs) at STMicroelectronics • STMicroelectronics SoC(s) used in: consumer electronics (set-top boxes, car infotainment), telecoms infrastructure, mobile phones • STMicroelectronics SoC(s) typically comprise: – Host processors: ARM family, ST40/SH4 processors – Application processors: DSPs, VLIW-Media (ST200 family) – Programmable hardware: processor with custom extensions, coarse-grained reconfigurable arrays (CGRA), GP-GPU • By using a processor-neutral program representation, and AOT or JIT compilation, C / C++ media processing code may dispatch to different processors ⇒ need byte-code for C / C++ programs The Microsoft .NET Common Language Infrastructure (CLI) standard Euro-Par 2008 – August 28th 2008
3
Inter-Block Scoreboard Scheduling
JIT for Media Processing
The ST200 VLIW Media Family (ST210, ST220, ST231, ST240)
• Lx architecture [ISCA’00], partial predication with SELECT • 63 × 32bit general registers, 8 × 1bit branch registers • Scheduled resources: 4×ISSUE, 1×MEM, 1×CTL, 2×ODD Euro-Par 2008 – August 28th 2008
4
Inter-Block Scoreboard Scheduling
JIT for Media Processing
JIT for Media Processing Post-Pass Scheduling Challenges • Achieve efficiency (code quality) and speed (compilation time) – C media processing kernels expose significantly more instruction-level parallelism that Java applications • Satisfy post-pass scheduling constraints along all program paths – Required on VLIW processors without interlocking hardware (MIPS ≡ Microprocessor without Interlocked Pipeline Stages) • Preserve pre-pass region schedules that are still valid and that satisfy post-pass scheduling constraints at boundaries – Not only local pre-pass schedules, but also global pre-pass schedules including software pipelines (cyclic schedules)
Euro-Par 2008 – August 28th 2008
5
Inter-Block Scoreboard Scheduling
JIT for Media Processing
Classic Approaches in Static and JIT Compilers • Open64-based ST200 VLIW production compiler: post-pass schedule superblock regions, insert NOPs between regions to prevent scheduling hazards • IBM Testarossa Java JIT compiler (zSeries 990 and POWER4): apply pre-pass and post-pass scheduling to a few code paths Proposed Approach: Inter-Block Scoreboard Scheduling • Scoreboard Scheduling is a restriction of classic Operation Scheduling that can be implemented efficiently • Inter-Block Scheduling is an iterative scheduling constraint propagation reminiscent of forward data-flow analysis • Combining these two techniques addresses all our “JIT for Media Processing Post-Pass Scheduling Challenges” Euro-Par 2008 – August 28th 2008
6
Inter-Block Scoreboard Scheduling
Classic List Scheduling
Classic List Scheduling Sample Dependence Graph Assume two execution units (scheduled resources) and 5 operations: 000 111 000 111 000 111 1 3 4 000 111 000 111 000 111 0 000 111 000 111 000 111 2 5 000 111
111 000 000 111
111 000 000 111
The dependence graph contains a dummy operation O0 Critical-Path Scheduling Priorities Defined as longest path from operation start to end of execution: Operation
O1
O2
O3
O4
O5
Execution Time
1
2
1
2
1
Priority
4
2
3
2
1
Euro-Par 2008 – August 28th 2008
7
Inter-Block Scoreboard Scheduling
Classic List Scheduling
Cycle Scheduling (Graham List Scheduling) • Schedule by non-decreasing time slot order • At each time slot, try to schedule all the dependence-ready operations in priority order
1111 0000 0000 1111 0000 1111 0000 1111 1 5 0000 1111 0000 1111 0000 1111 0000 1111
1111111 0000000 0000000 1111111 4 0000000 1111111 0000000 1111111
1111111 0000000 0000 1111 2 3 00000001111 1111111 0000
Cycle Scheduling produces ’Non-Delay Schedules’ • No execution resources are left idle if there exists an operation that could start executing • Non-Delay Schedules may not contain optimal schedules (for Makespan, Max-Lateness, and other regular measures) Euro-Par 2008 – August 28th 2008
8
Inter-Block Scoreboard Scheduling
Classic List Scheduling
Operation Scheduling • Consider operations in priority order, which must be a topological sort of the dependence graph • Schedule an operation at the earliest time slot possible
1111 0000 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 1 5 4 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111
1111 0000 0000000 1111111 3 2 00001111111 1111 0000000 Operation Scheduling produces ’Active Schedules’
• No operation can be completed earlier without delaying another operation • Active Schedules contain Non-Delay Schedules and also optimal schedules (for Makespan, Max-Lateness, etc.) Euro-Par 2008 – August 28th 2008
9
Inter-Block Scoreboard Scheduling
Classic List Scheduling
Cases of Unit Execution Time (Pipelined Execution) • Cycle Scheduling computes same as Operation Scheduling • Optimality proved for various shapes of dependence graph • Classic Graham performance bound for m resources: 2 −
1 m
• Performance bound for k types of resources, mi units of resource 1 i, and z + 1 maximum latency: (k + 1) − (z+1)∗max 0≤i
10
Inter-Block Scoreboard Scheduling
Scoreboard Scheduling
Scoreboard Scheduling Operation Scheduling within a time-window that never moves backward. • Any operation is scheduled within a time window [window start, window start+window size] of constant number of time slots • The window start cannot decrease and is lazily increased • Operations priority is incoming order, initial or as computed by pre-pass scheduler ⇒ direct reuse of pre-pass scheduling efforts Properties of Scoreboard Scheduling • Bound on number of resource checks proportional to window size • Same schedules as Operation Scheduling for window size à ∞ Corollary 1. Schedules produced by Operation Scheduling or Cycle Scheduling are invariant under Scoreboard Scheduling. Euro-Par 2008 – August 28th 2008
11
Inter-Block Scoreboard Scheduling
Scoreboard Scheduling
Scoreboard Scheduler Implementation (1) • The SABLE Java optimizer scheduler [Verbrugge 2002] replaces the dependence graph by an array of dependence lists (ADL), indexed by dependence record r, and constructed in O(n) time • A register dependence latencyi→j can be computed as follows: RAW Dependence WAW Dependence WAR Dependence
latencyi→j ≥
write stage[i][r] − read stage[j][r] + RAW[r]
(a)
latencyi→j ≥
RAW[r]
(b)
latencyi→j ≥
write stage[i][r] − write stage[j][r] + WAW[r]
(c)
latencyi→j ≥
WAW[r]
(d)
latencyi→j ≥
read stage[i][r] − write stage[j][r] + WAR[r]
(e)
latencyi→j ≥
WAR[r]
(f )
• We omit explicit dependences and compute latencies as needed, based on the last access date access actions[r] and the last write date write actions[r], of each dependence record r • We track scheduled resources uses with a classic resource table Euro-Par 2008 – August 28th 2008
12
Inter-Block Scoreboard Scheduling
Scoreboard Scheduling
Scoreboard Scheduler Implementation (2) try schedule Given an operation i, return the earliest dependenceand resource-feasible issue date≥window start; first, ensure: Effect
Constraints
Read[r]
issue date ≥
write actions[r] − read stage[i][r] + RAW[r]
Write[r]
issue date ≥
write actions[r] − write stage[i][r] + WAW[r]
issue date ≥
access actions[r]
then, increase issue date while conflicts with the resource table add schedule Schedule an operation i at issue date returned by try schedule; first, update access actions and write actions: Effect
Updates
Read[r]
access actions[r] ←
max(access actions[r], issue date + WAR[r])
Write[r]
access actions[r] ←
max(access actions[r], issue date + WAW[r])
write actions[r] ←
issue date + write stage[i][r]
window start ← max(window start, issue date − window size) then, update the resource table by adding reservation table[i] Euro-Par 2008 – August 28th 2008
13
Inter-Block Scoreboard Scheduling
Inter-Block Scheduling
Inter-Block Scheduling Scheduling of each basic block such that the resource and dependence constraints inherited from the predecessor basic blocks are also satisfied. Inter-Block Scheduling Constraint Solver Use a work-list algorithm to propagate the scoreboard scheduler states at the start and the end of each basic block until fixed-point. For each basic block extracted from the work-list: • From the start scoreboard scheduler state, scoreboard schedule operations in non-decreasing order of their previous issue dates ⇒ new issue dates and new end scoreboard scheduler state • For all successors of this basic block, meet their start scoreboard scheduler state with the end scoreboard scheduler state just obtained ⇒ if start state changes, put successor on the work-list Euro-Par 2008 – August 28th 2008
14
Inter-Block Scoreboard Scheduling
Inter-Block Scheduling
Scheduling Constraint Solver Meet Operator Given the end scoreboard scheduler state e, the start scoreboard scheduler state s, and a delay along the control-flow edge: • Elapse time so e’s window start reaches the issue date of the last operation plus one (zero if e’s basic block is empty) plus delay • Translate the time so that e’s window start becomes zero • Update s by taking the maximum of e and s entries in the resource table, the access actions and the write actions The Non-Decrease Rule of Inter-Block Scheduling Non-Decrease Rule: The operation issue dates must not decrease when rescheduling a basic block.
Euro-Par 2008 – August 28th 2008
15
Inter-Block Scoreboard Scheduling
Inter-Block Scheduling
Convergence and Fixed-Points Theorem 3. Inter-Block Scoreboard Scheduling converges in bounded time. Theorem 4. Any locally scheduled program that satisfies the inter-block scheduling constraints is a fixed-point of Inter-Block Scoreboard Scheduling. • Any pre-pass valid region schedule, which also satisfies the inter-block scheduling constraints at its boundary basic blocks, will be unchanged by Inter-Block Scoreboard Scheduling (pre-pass region schedules include: superblock schedules; trace schedules; wavefront schedules; software pipelines) • Inter-Block Scoreboard Scheduling of a program with enough NOP padding to satisfy the inter-block scheduling constraints converges with only one pass on each basic block Euro-Par 2008 – August 28th 2008
16
Inter-Block Scoreboard Scheduling
ST200 VLIW Experiments
ST200 VLIW Experiments Scoreboard Scheduler Compared to Cycle Scheduler • Compare the proposed scoreboard scheduler at window size = 15 to the ’Cycle Scheduling’ implementation of [Abraham 2000], improved with the linear-time dependence construction of [Verbrugge 2002], and a radix-4 heap unified ready queue • Use the Open64-based ST200 VLIW production compiler at -O3 connected to the CLI-JIT code generator, and collect valgrind --tool=callgrind x86 instruction fetch profiles • Code sent to the Scoreboard Scheduler is heavily optimized by the Open64 and is also pre-pass scheduled / software pipelined • Benchmarks from synthetic kernels and compute intensive-parts of the STMicroelectronics media processing firmware Euro-Par 2008 – August 28th 2008
17
Inter-Block Scoreboard Scheduling
ST200 VLIW Experiments
Benchmark Selection and Scheduling Results (1) Origin
Size
IPC
Cost Ratio
Perf. Ratio
Res. Query Ratio
mergesort
12
0.92
2.35
1.00
0.60
maxindex
12
2.00
2.52
1.00
0.67
fft32x32s
20
4.00
2.57
1.00
0.50
autcor
21
1.50
3.34
1.00
1.08
d6arith
27
0.87
2.78
1.00
0.60
sfcfilter
29
2.90
3.00
1.00
0.62
strwc
32
3.56
3.17
1.00
0.70
bitonic
34
3.78
3.55
1.00
1.00
floydall
52
1.41
3.62
1.00
0.67
pframe
59
1.59
3.82
1.00
0.63
Euro-Par 2008 – August 28th 2008
18
Inter-Block Scoreboard Scheduling
ST200 VLIW Experiments
Benchmark Selection and Scheduling Results (2) Origin
Size
IPC
Cost Ratio
Perf. Ratio
Res. Query Ratio
polysyn
79
2.55
5.95
1.19
1.29
huffdec2
81
0.80
4.23
1.00
0.56
fft32x32s
83
3.61
5.21
1.09
1.00
dbuffer
108
3.18
5.67
1.03
1.00
polysyn
137
3.51
7.29
1.03
1.50
transfo
230
3.59
9.00
1.16
1.04
qplsf5
231
2.96
8.91
1.13
0.11
polysyn
256
1.63
8.79
1.00
0.57
polysyn
297
3.23
9.95
1.04
0.76
radial33
554
3.26
18.78
1.21
1.95
Euro-Par 2008 – August 28th 2008
19
Inter-Block Scoreboard Scheduling
ST200 VLIW Experiments
Scheduling Time as a Function of Basic Block Size
Euro-Par 2008 – August 28th 2008
20
Inter-Block Scoreboard Scheduling
ST200 VLIW Experiments
Time Breakdown for Cycle Scheduling and Scoreboard Scheduling
• Resource cumulates the relative time spent in resource checking • For a clean VLIW like the ST200, resource checking is not a scheduler bottleneck ⇒ finite-state automata resource checking [Proebsting Fraser 1994][Bala Rubin 1995] is not justified
Euro-Par 2008 – August 28th 2008
21
Inter-Block Scoreboard Scheduling
ST200 VLIW Experiments
Experiments with the STMicroelectronics CLI-JIT Compiler Expression Trees The CLI expressions of the evaluation stack are typed and converted to a tree form Instruction Selection Machine-level instructions are generated and the calling conventions (ABI) are implemented SSA Construction, SSA Destruction Coalesce register copies and enforce ISA and ABI register operand constraints Register Allocation Linear-scan register allocation [Wimmer 2005] Post-Pass Scheduling Proposed inter-block scoreboard scheduler Instruction Encoding Encode instructions, match bundle templates, encode instruction groups into bundles Create & Patch Code Emit code, trampolines and relay jumps Euro-Par 2008 – August 28th 2008
22
Inter-Block Scoreboard Scheduling
ST200 VLIW Experiments
STMicroelectronics CLI-JIT Compilation Time Breakdown
• 10% geometric average compilation time in Post-Pass Scheduling Euro-Par 2008 – August 28th 2008
23
Inter-Block Scoreboard Scheduling
Observations and Conclusions
Observations and Conclusions JIT Compilation of C Media Processing Applications • New application domain for JIT and AOT compilation • More instruction-level parallelism to exploit than with Java • VLIW processors, possibly clustered, without interlocks Inter-Block Scoreboard Scheduling Main Results • Achieve efficiency (code quality) and speed (compilation time) • Satisfy post-pass scheduling constraints along all program paths • Preserve pre-pass region schedules that are still valid and that satisfy post-pass scheduling constraints at boundaries
Euro-Par 2008 – August 28th 2008
24
Inter-Block Scoreboard Scheduling
Observations and Conclusions
Other Contributions of this Work • Scoreboard Scheduling operates like the hardware scheduler of an out-of-order superscalar processor (without register renaming) ⇒ understand how Active Schedules produced by compilers can be reconstructed by hardware schedulers • The non-decrease rule, which protects valid cyclic schedules from acyclic rescheduling, is more generally applicable • Inter-Block Scheduling is mostly useful for post-pass scheduling Extensions of this Work • Scoreboard Scheduling for pre-pass scheduling on SSA form • Scoreboard Scheduling for processors with register renaming
Euro-Par 2008 – August 28th 2008
25