Inter-Block Scoreboard Scheduling in a JIT Compiler for ...

Viewer
Transcript

Inter-Block Scoreboard Scheduling in a JIT Compiler for VLIW Processors

Benoˆıt Dupont de Dinechin Research & Development Responsible Software, Tools and Services (STS) STMicroelectronics Grenoble (France) [email protected]

Inter-Block Scoreboard Scheduling

Presentation Outline

Presentation Outline • JIT for Media Processing • Classic List Scheduling • Scoreboard Scheduling • Inter-Block Scheduling • ST200 VLIW Experiments • Observations and Conclusions

Euro-Par 2008 – August 28th 2008

2

Inter-Block Scoreboard Scheduling

JIT for Media Processing

JIT for Media Processing Systems-On-Chip (SoCs) at STMicroelectronics • STMicroelectronics SoC(s) used in: consumer electronics (set-top boxes, car infotainment), telecoms infrastructure, mobile phones • STMicroelectronics SoC(s) typically comprise: – Host processors: ARM family, ST40/SH4 processors – Application processors: DSPs, VLIW-Media (ST200 family) – Programmable hardware: processor with custom extensions, coarse-grained reconfigurable arrays (CGRA), GP-GPU • By using a processor-neutral program representation, and AOT or JIT compilation, C / C++ media processing code may dispatch to different processors ⇒ need byte-code for C / C++ programs The Microsoft .NET Common Language Infrastructure (CLI) standard Euro-Par 2008 – August 28th 2008

3

Inter-Block Scoreboard Scheduling

JIT for Media Processing

The ST200 VLIW Media Family (ST210, ST220, ST231, ST240)

• Lx architecture [ISCA’00], partial predication with SELECT • 63 × 32bit general registers, 8 × 1bit branch registers • Scheduled resources: 4×ISSUE, 1×MEM, 1×CTL, 2×ODD Euro-Par 2008 – August 28th 2008

4

Inter-Block Scoreboard Scheduling

JIT for Media Processing

JIT for Media Processing Post-Pass Scheduling Challenges • Achieve efficiency (code quality) and speed (compilation time) – C media processing kernels expose significantly more instruction-level parallelism that Java applications • Satisfy post-pass scheduling constraints along all program paths – Required on VLIW processors without interlocking hardware (MIPS ≡ Microprocessor without Interlocked Pipeline Stages) • Preserve pre-pass region schedules that are still valid and that satisfy post-pass scheduling constraints at boundaries – Not only local pre-pass schedules, but also global pre-pass schedules including software pipelines (cyclic schedules)

Euro-Par 2008 – August 28th 2008

5

Inter-Block Scoreboard Scheduling

JIT for Media Processing

Classic Approaches in Static and JIT Compilers • Open64-based ST200 VLIW production compiler: post-pass schedule superblock regions, insert NOPs between regions to prevent scheduling hazards • IBM Testarossa Java JIT compiler (zSeries 990 and POWER4): apply pre-pass and post-pass scheduling to a few code paths Proposed Approach: Inter-Block Scoreboard Scheduling • Scoreboard Scheduling is a restriction of classic Operation Scheduling that can be implemented efficiently • Inter-Block Scheduling is an iterative scheduling constraint propagation reminiscent of forward data-flow analysis • Combining these two techniques addresses all our “JIT for Media Processing Post-Pass Scheduling Challenges” Euro-Par 2008 – August 28th 2008

6

Inter-Block Scoreboard Scheduling

Classic List Scheduling

Classic List Scheduling Sample Dependence Graph Assume two execution units (scheduled resources) and 5 operations: 000 111 000 111 000 111 1 3 4 000 111 000 111 000 111 0 000 111 000 111 000 111 2 5 000 111

111 000 000 111

111 000 000 111

The dependence graph contains a dummy operation O0 Critical-Path Scheduling Priorities Defined as longest path from operation start to end of execution: Operation

O1

O2

O3

O4

O5

Execution Time

1

2

1

2

1

Priority

4

2

3

2

1

Euro-Par 2008 – August 28th 2008

7

Inter-Block Scoreboard Scheduling

Classic List Scheduling

Cycle Scheduling (Graham List Scheduling) • Schedule by non-decreasing time slot order • At each time slot, try to schedule all the dependence-ready operations in priority order

1111 0000 0000 1111 0000 1111 0000 1111 1 5 0000 1111 0000 1111 0000 1111 0000 1111

1111111 0000000 0000000 1111111 4 0000000 1111111 0000000 1111111

1111111 0000000 0000 1111 2 3 00000001111 1111111 0000

Cycle Scheduling produces ’Non-Delay Schedules’ • No execution resources are left idle if there exists an operation that could start executing • Non-Delay Schedules may not contain optimal schedules (for Makespan, Max-Lateness, and other regular measures) Euro-Par 2008 – August 28th 2008

8

Inter-Block Scoreboard Scheduling

Classic List Scheduling

Operation Scheduling • Consider operations in priority order, which must be a topological sort of the dependence graph • Schedule an operation at the earliest time slot possible

1111 0000 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 1 5 4 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111

1111 0000 0000000 1111111 3 2 00001111111 1111 0000000 Operation Scheduling produces ’Active Schedules’

• No operation can be completed earlier without delaying another operation • Active Schedules contain Non-Delay Schedules and also optimal schedules (for Makespan, Max-Lateness, etc.) Euro-Par 2008 – August 28th 2008

9

Inter-Block Scoreboard Scheduling

Classic List Scheduling

Cases of Unit Execution Time (Pipelined Execution) • Cycle Scheduling computes same as Operation Scheduling • Optimality proved for various shapes of dependence graph • Classic Graham performance bound for m resources: 2 −

1 m

• Performance bound for k types of resources, mi units of resource 1 i, and z + 1 maximum latency: (k + 1) − (z+1)∗max 0≤i
10

Inter-Block Scoreboard Scheduling

Scoreboard Scheduling

Scoreboard Scheduling Operation Scheduling within a time-window that never moves backward. • Any operation is scheduled within a time window [window start, window start+window size] of constant number of time slots • The window start cannot decrease and is lazily increased • Operations priority is incoming order, initial or as computed by pre-pass scheduler ⇒ direct reuse of pre-pass scheduling efforts Properties of Scoreboard Scheduling • Bound on number of resource checks proportional to window size • Same schedules as Operation Scheduling for window size Ã ∞ Corollary 1. Schedules produced by Operation Scheduling or Cycle Scheduling are invariant under Scoreboard Scheduling. Euro-Par 2008 – August 28th 2008

11

Inter-Block Scoreboard Scheduling

Scoreboard Scheduling

Scoreboard Scheduler Implementation (1) • The SABLE Java optimizer scheduler [Verbrugge 2002] replaces the dependence graph by an array of dependence lists (ADL), indexed by dependence record r, and constructed in O(n) time • A register dependence latencyi→j can be computed as follows: RAW Dependence WAW Dependence WAR Dependence

latencyi→j ≥

write stage[i][r] − read stage[j][r] + RAW[r]

(a)

latencyi→j ≥

RAW[r]

(b)

latencyi→j ≥

write stage[i][r] − write stage[j][r] + WAW[r]

(c)

latencyi→j ≥

WAW[r]

(d)

latencyi→j ≥

read stage[i][r] − write stage[j][r] + WAR[r]

(e)

latencyi→j ≥

WAR[r]

(f )

• We omit explicit dependences and compute latencies as needed, based on the last access date access actions[r] and the last write date write actions[r], of each dependence record r • We track scheduled resources uses with a classic resource table Euro-Par 2008 – August 28th 2008

12

Inter-Block Scoreboard Scheduling

Scoreboard Scheduling

Scoreboard Scheduler Implementation (2) try schedule Given an operation i, return the earliest dependenceand resource-feasible issue date≥window start; first, ensure: Effect

Constraints

Read[r]

issue date ≥

write actions[r] − read stage[i][r] + RAW[r]

Write[r]

issue date ≥

write actions[r] − write stage[i][r] + WAW[r]

issue date ≥

access actions[r]

then, increase issue date while conflicts with the resource table add schedule Schedule an operation i at issue date returned by try schedule; first, update access actions and write actions: Effect

Updates

Read[r]

access actions[r] ←

max(access actions[r], issue date + WAR[r])

Write[r]

access actions[r] ←

max(access actions[r], issue date + WAW[r])

write actions[r] ←

issue date + write stage[i][r]

window start ← max(window start, issue date − window size) then, update the resource table by adding reservation table[i] Euro-Par 2008 – August 28th 2008

13

Inter-Block Scoreboard Scheduling

Inter-Block Scheduling

Inter-Block Scheduling Scheduling of each basic block such that the resource and dependence constraints inherited from the predecessor basic blocks are also satisfied. Inter-Block Scheduling Constraint Solver Use a work-list algorithm to propagate the scoreboard scheduler states at the start and the end of each basic block until fixed-point. For each basic block extracted from the work-list: • From the start scoreboard scheduler state, scoreboard schedule operations in non-decreasing order of their previous issue dates ⇒ new issue dates and new end scoreboard scheduler state • For all successors of this basic block, meet their start scoreboard scheduler state with the end scoreboard scheduler state just obtained ⇒ if start state changes, put successor on the work-list Euro-Par 2008 – August 28th 2008

14

Inter-Block Scoreboard Scheduling

Inter-Block Scheduling

Scheduling Constraint Solver Meet Operator Given the end scoreboard scheduler state e, the start scoreboard scheduler state s, and a delay along the control-flow edge: • Elapse time so e’s window start reaches the issue date of the last operation plus one (zero if e’s basic block is empty) plus delay • Translate the time so that e’s window start becomes zero • Update s by taking the maximum of e and s entries in the resource table, the access actions and the write actions The Non-Decrease Rule of Inter-Block Scheduling Non-Decrease Rule: The operation issue dates must not decrease when rescheduling a basic block.

Euro-Par 2008 – August 28th 2008

15

Inter-Block Scoreboard Scheduling

Inter-Block Scheduling

Convergence and Fixed-Points Theorem 3. Inter-Block Scoreboard Scheduling converges in bounded time. Theorem 4. Any locally scheduled program that satisfies the inter-block scheduling constraints is a fixed-point of Inter-Block Scoreboard Scheduling. • Any pre-pass valid region schedule, which also satisfies the inter-block scheduling constraints at its boundary basic blocks, will be unchanged by Inter-Block Scoreboard Scheduling (pre-pass region schedules include: superblock schedules; trace schedules; wavefront schedules; software pipelines) • Inter-Block Scoreboard Scheduling of a program with enough NOP padding to satisfy the inter-block scheduling constraints converges with only one pass on each basic block Euro-Par 2008 – August 28th 2008

16

Inter-Block Scoreboard Scheduling

ST200 VLIW Experiments

ST200 VLIW Experiments Scoreboard Scheduler Compared to Cycle Scheduler • Compare the proposed scoreboard scheduler at window size = 15 to the ’Cycle Scheduling’ implementation of [Abraham 2000], improved with the linear-time dependence construction of [Verbrugge 2002], and a radix-4 heap unified ready queue • Use the Open64-based ST200 VLIW production compiler at -O3 connected to the CLI-JIT code generator, and collect valgrind --tool=callgrind x86 instruction fetch profiles • Code sent to the Scoreboard Scheduler is heavily optimized by the Open64 and is also pre-pass scheduled / software pipelined • Benchmarks from synthetic kernels and compute intensive-parts of the STMicroelectronics media processing firmware Euro-Par 2008 – August 28th 2008

17

Inter-Block Scoreboard Scheduling

ST200 VLIW Experiments

Benchmark Selection and Scheduling Results (1) Origin

Size

IPC

Cost Ratio

Perf. Ratio

Res. Query Ratio

mergesort

12

0.92

2.35

1.00

0.60

maxindex

12

2.00

2.52

1.00

0.67

fft32x32s

20

4.00

2.57

1.00

0.50

autcor

21

1.50

3.34

1.00

1.08

d6arith

27

0.87

2.78

1.00

0.60

sfcfilter

29

2.90

3.00

1.00

0.62

strwc

32

3.56

3.17

1.00

0.70

bitonic

34

3.78

3.55

1.00

1.00

floydall

52

1.41

3.62

1.00

0.67

pframe

59

1.59

3.82

1.00

0.63

Euro-Par 2008 – August 28th 2008

18

Inter-Block Scoreboard Scheduling

ST200 VLIW Experiments

Benchmark Selection and Scheduling Results (2) Origin

Size

IPC

Cost Ratio

Perf. Ratio

Res. Query Ratio

polysyn

79

2.55

5.95

1.19

1.29

huffdec2

81

0.80

4.23

1.00

0.56

fft32x32s

83

3.61

5.21

1.09

1.00

dbuffer

108

3.18

5.67

1.03

1.00

polysyn

137

3.51

7.29

1.03

1.50

transfo

230

3.59

9.00

1.16

1.04

qplsf5

231

2.96

8.91

1.13

0.11

polysyn

256

1.63

8.79

1.00

0.57

polysyn

297

3.23

9.95

1.04

0.76

radial33

554

3.26

18.78

1.21

1.95

Euro-Par 2008 – August 28th 2008

19

Inter-Block Scoreboard Scheduling

ST200 VLIW Experiments

Scheduling Time as a Function of Basic Block Size

Euro-Par 2008 – August 28th 2008

20

Inter-Block Scoreboard Scheduling

ST200 VLIW Experiments

Time Breakdown for Cycle Scheduling and Scoreboard Scheduling

• Resource cumulates the relative time spent in resource checking • For a clean VLIW like the ST200, resource checking is not a scheduler bottleneck ⇒ finite-state automata resource checking [Proebsting Fraser 1994][Bala Rubin 1995] is not justified

Euro-Par 2008 – August 28th 2008

21

Inter-Block Scoreboard Scheduling

ST200 VLIW Experiments

Experiments with the STMicroelectronics CLI-JIT Compiler Expression Trees The CLI expressions of the evaluation stack are typed and converted to a tree form Instruction Selection Machine-level instructions are generated and the calling conventions (ABI) are implemented SSA Construction, SSA Destruction Coalesce register copies and enforce ISA and ABI register operand constraints Register Allocation Linear-scan register allocation [Wimmer 2005] Post-Pass Scheduling Proposed inter-block scoreboard scheduler Instruction Encoding Encode instructions, match bundle templates, encode instruction groups into bundles Create & Patch Code Emit code, trampolines and relay jumps Euro-Par 2008 – August 28th 2008

22

Inter-Block Scoreboard Scheduling

ST200 VLIW Experiments

STMicroelectronics CLI-JIT Compilation Time Breakdown

• 10% geometric average compilation time in Post-Pass Scheduling Euro-Par 2008 – August 28th 2008

23

Inter-Block Scoreboard Scheduling

Observations and Conclusions

Observations and Conclusions JIT Compilation of C Media Processing Applications • New application domain for JIT and AOT compilation • More instruction-level parallelism to exploit than with Java • VLIW processors, possibly clustered, without interlocks Inter-Block Scoreboard Scheduling Main Results • Achieve efficiency (code quality) and speed (compilation time) • Satisfy post-pass scheduling constraints along all program paths • Preserve pre-pass region schedules that are still valid and that satisfy post-pass scheduling constraints at boundaries

Euro-Par 2008 – August 28th 2008

24

Inter-Block Scoreboard Scheduling

Observations and Conclusions

Other Contributions of this Work • Scoreboard Scheduling operates like the hardware scheduler of an out-of-order superscalar processor (without register renaming) ⇒ understand how Active Schedules produced by compilers can be reconstructed by hardware schedulers • The non-decrease rule, which protects valid cyclic schedules from acyclic rescheduling, is more generally applicable • Inter-Block Scheduling is mostly useful for post-pass scheduling Extensions of this Work • Scoreboard Scheduling for pre-pass scheduling on SSA form • Scoreboard Scheduling for processors with register renaming

Euro-Par 2008 – August 28th 2008

25

Inter-Block Scoreboard Scheduling in a JIT Compiler for ...

Classic List Scheduling ... schedules including software pipelines (cyclic schedules). Euro-Par 2008 ... propagation reminiscent of forward data-flow analysis.

Download PDF

357KB Sizes 4 Downloads 169 Views

Report

Inter-Block Scoreboard Scheduling in a JIT Compiler for ...

Recommend Documents