Code Generator Optimizations for the ST120 DSP-MCU ...

Viewer
Transcript

Code Generator Optimizations for the ST120 DSP-MCU Core B. Dupont de Dinechin STMicroelectronics

„ F. de Ferriere

C. Guillon

STMicroelectronics

STMicroelectronics

A. Stoutchini

STMicroelectron

ABSTRACT The ST120 Digital Signal Processor - Micro-Controller Unit (DSP–MCU) core was designed by STMicroelectronics in order to meet the ever-increasing digital signal processing requirements of portable and consumer applications. Like other recent high-end DSP–MCU cores, the ST120 blends traditional DSP features with modern Instruction-Level Parallelism (ILP) capabilities. Compiler management of the ST120 features presents a unique challenge to the code generation. The ST120 Linear Assembly Optimizer (LAO) effectively exploits instructionlevel parallelism, while enabling compact code size. In this paper, we focus on the LAO implementation of the SSA representation, the IF-conversion, the SLIW scheduling, and the LAO improvements to register allocation. This includes solutions to problems that arise when compiler optimizations are applied to assembly-level, already predicated code.

1.

INTRODUCTION

Processors for portable applications are required to deliver high-performance digital signal processing at low power. In addition, the ROMs that contain application codes significantly contribute to system power consumption, so processors are also required to expose a compact instruction set. Traditional means of achieving high processor performance, such as superscalar implementation with branch prediction, out-of-order execution, and large register sets that consume instruction encoding space, cannot be used when both energy efficiency and code size are of primary concern. On the software side, use of Instruction-Level Parallelism (ILP) enhancing techniques such as loop unrolling, software pipelining [33] with modulo expansion [26], or branch straightening with tail duplication, is severely constrained by the near-zero tolerance of portable applications to code size increase. The ST120 core addresses these problems by decoupling the address computations from the data computations. Decoupled implementation consumes less energy and takes less die area than an out-of-order superscalar implementation, yet retains some of its important benefits. In particular,

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES’00, November 17-19, 2000, San Jose, CA. Copyright 2000 ACM 1-58113-338-3/00/0011 ...$5.00.

Figure 1: The ST120 architectural registers. access-execute decoupling makes loads to data registers execute with an apparent zero RAW latency, thus obviating the need for software pipelining and modulo renaming on most digital signal processing kernels. The ST120 core also supports an advanced predication model that includes branch shadow execution. Inside a branch shadow, predicated instructions may execute before the branch is resolved. This is especially effective for conditional branches that cannot be removed by predication, such as loop early exits. In this paper, we first introduce the ST120 architecture and micro-architecture features that challenge code generators. We then describe our approach to SSA optimization, predication, instruction scheduling, and register allocation, as implemented in our ST120 Linear Assembly Optimizer (LAO) tool. In the last section, we demonstrate the efficiency of our techniques by measuring how performance and code size are affected by running the LAO on the output of the best available ST120 production C compilers.

2. 2.1

THE ST120 DSP-MCU CORE ST120 Architecture Overview

The STMicroelectronics ST120 [34] is a Digital Signal Processor – Micro Controller Unit (DSP–MCU) core, designed for digital signal processing in telecommunication applications. In order to effectively support C/C++ compilers, DSP-MCUs combine traditional DSP features with VLIW and EPIC principles [16, 35, 12]. In particular, the ST120 Instruction Set Architecture (ISA) is a predicated load-store architecture, that defines three register sets (figure 1):

Data Registers This set contains 16 40-bit registers, any of which be can used as accumulators by the multiplyaccumulate instructions. Address Registers This set contains 16 32-bit registers, whose main purpose is to provide the base and offset of memory accesses. Control Registers This set includes all the specialized registers, including the Guard Register (GR) that contains the guards (predicate registers) used for instruction predication, and three hardware loop register sets. Architectural DSP support includes multiply-accumulate instructions that multiply 16-bit sub-registers and accumulate into 40 bits using signed, unsigned, or fractional arithmetic [12]. Other data instructions uniformly support 16-bit packed, 32-bit, and 40-bit, arithmetic for signed, unsigned, or fractional data types. Unlike older DSPs, the ST120 architecture does not introduce mode bits, as arithmetic types and rounding modes are explicit in the instructions. The ST120 addressing modes include auto-modification of base address pointers. Post-modification is the default case, although one addressing mode allows pre-modification for implementing software FIFOs. Other DSP addressing modes include modulo addressing, and bit-reversed addressing. Hardware loops enable the ST120 to iterate a block of instructions as a counted loop without explicit decrementing, testing, and branching. The ST120 also supports uncounted (infinite) hardware loops. The ST120 ISA defines two main instruction modes, respectively called GP16, and GP32. In the GP32 mode, instructions are encoded into 32-bit, can have up to four register operands, and can access all the architectural resources. In the GP16 mode, instructions are encoded into 16 bits, have two register operands (except ADD / SUB), and may only access limited architectural resources. The purpose of the GP16 mode is to enable compact code size on microcontroller tasks, following the principles pioneered by the ARM architecture with its Thumb instruction mode [1]. The two ST120 instruction modes share the same Application Binary Interface (ABI), and mode switching automatically takes place at function boundaries, depending on the alignment of the called function. A function byte address aligned 0 mod 4 triggers the GP32 instruction mode, while a function address aligned 2 mod 4 triggers the GP16 instruction mode. Within a function, explicit mode switching is also possible through dedicated instructions.

2.2

The ST120 Decoupled Implementation

The ST120 core implements access-execute decoupling [36], following the same design principles as in the floating-point unit of the MIPS R8000 [22]. In the GP16 and GP32 instruction modes, execution is in-order two-way superscalar. Aligned pairs of instructions are simultaneously decoded and expanded by the Control Unit (CU) into micro-instructions, which are sent to the dispatch queues of four parallel execution pipelines: two for the Address Unit / General Unit (AU/GU), and two for the Data Unit (DU) (figure 2). The ST120 decoupled implementation is best illustrated on loads to DU registers. For each DU load, the CU generates two micro-instructions: the effective address computation, which is dispatched to one of the AU/GU execution pipelines; the receive of the loaded value, which is dispatched

Figure 2: The ST120 decoupled implementation. to one of the DU execution pipelines. (Selection of the execution pipelines inside the AU/GU and the DU only depends on the static alignment of the load instruction.) The receive of the loaded value, and the execution of the subsequent DU micro-instructions, wait for the data to return from memory. The DU dispatch queues are deep enough to prevent these delays of DU execution from stalling the CU. From the programmer’s point of view, a DU load appears to return data in zero cycles to RAW dependent instructions. This feature is especially useful on a DSP–MCU, as it decreases the need for architectural registers, and also reduces the code size. In particular, software pipelining is not required for most digital signal processing kernels. On the other hand, the DU loads are blocking, meaning that the DU instructions that follow a DU load execute after the loaded value returns from memory, even if they do not use this value. Loads to the AU or the CU registers are nonblocking, but they do expose the full memory latency. In addition to the GP16 and GP32 instruction modes, the ST120 core implements a Scoreboarded Long Instruction Word (SLIW) instruction mode, where “instruction bundles” that are comprised of two AU/GU GP32 instructions, and two DU GP32 instructions, dispatch every cycle. In the SLIW mode, the data dependences are scoreboarded, provided they hold between instructions that belong to different bundles. Scoreboarding means that dynamic delays are inserted whenever required to enforce the dependences. Inside a bundle, the data dependences are not scoreboarded, but the zero-latency dependences are enforced, in particular the RAW dependences that originate from the DU loads. Both the inter-bundle scoreboarding and the intra-bundle zero-latency dependences significantly reduce the size of the SLIW code, compared to the equivalent VLIW code. In the memory layout of a SLIW bundle, the two AU/GU instructions always come first, followed by the two DU instructions. However, in order to make the SLIW mode more friendly in assembly language, the four GP32 instructions that make a SLIW bundle do not always appear in this memory order. Precisely, the ST120 assembler syntax defines the following SLIW “groupings” (SLIW bundle templates): Group 0 1 2 3 4

Instr. 1 Load Load Load DU-op DU-op

Instr. 2 Load DU-op DU-op DU-op DU-op

Instr. 3 DU-op DU-op DU-op Store|AU-op Store|AU-op

Instr. 4 DU-op Branch Store|AU-op Store|AU-op Branch

The purpose of these SLIW groupings is to enforce an order of the instructions within a SLIW bundle where the intra-bundle zero-latency RAW dependences are ordered from left to right. For the compiler however, assembly reordering of the instructions is merely cosmetic. The real challenge is to exploit the zero-latency intra-bundle dependences, which include RAW, WAR, and WAW, data dependences.

2.3

Predication and Branch Shadow

Modern architectural support for predicated execution is defined by the IMPACT-EPIC architecture, itself generalized from the Cydra-5 and PlayDoh architectures [2]. This architectural support, later referred to as the “IMPACT” predication model, introduces Predicate Define Instructions (PDIs) with the syntax: Gy type0, Gz type1 = (Rn cond Rp) . Here, Gx is the source predicate register, Gy and Gz are the destination predicate registers, and type0 and type1 are the predicate types defined as follows:

Figure 3: The LAO internal phases. EQW g0! GOTO g0? ADD g0? SUB

Unconditional Destination is written Gx ∧ (Rn cond Rp). Parallel-AND Destination is written 0 if Gx∧¬(Rn cond Rp). Parallel-OR Destination is written 1 if Gx ∧ (Rn cond Rp). Normal Destination is written (Rn cond Rp) if Gx. Disjunctive Destination is written 1 if Gx ∨ (Rn cond Rp). Conjunctive Destination is written 0 if ¬Gx∨¬(Rn cond Rp). The ST120 architecture allows almost all the GP32 instructions (including those in SLIW bundles) to be predicated. Predicate values are taken from the Guard Register, which contains 16 individual guards (guard G15 is always true). These guards are set by the Guard Modification Instructions (GMIs). Compared to the IMPACT model [2], the distinguishing features of the ST120 predication are: • A ST120 GMI may only define one guard. This scheme is easier to manage by the compiler predication algorithms, while losing little expressive power under techniques like the Program Decision Logic predication [2]. • All the ST120 GMIs are guardable, that is, they correspond to the Normal IMPACT PDIs. Unconditional PDIs that are the default on IMPACT are not so useful on the ST120 because non-guardable effects are not compatible with branch shadow execution (see below). • The ST120 GMIs include “Guard Combination Instructions” to perform logical operations (AND, OR, XOR, etc.) between individual guards. This fills the need for the Conjunctive and Disjunctive PDIs, that have been added recently to the IMPACT model [2]. • The ST120 GMIs do not include the Parallel-AND and Parallel-OR instructions, due to instruction encoding space constraints, and given their limited performance benefits for the issue width of two DU instructions. • The ST120 conditional branches are taken when their guard is false. Non-branch instructions are effectively executed when their guard is true. The main reason for this is the support of branch shadow execution. Branch shadow execution is an innovative ST120 feature whose purpose is to eliminate the not-taken conditional branch penalty. Use of the ST120 branch shadow is as follows:

g0, R1, R2 label R3, R3, R2 R0, R1, R3

// // // //

GMI sets g0 go to label ADD into R3 SUB into R0

if if if if

R1 g0 g0 g0

== R2 false true true

That is, when a conditional branch is predicated on the false value of a guard (g0!), subsequent instructions execute while the branch is being resolved, provided they are predicated on the true value of the same guard (g0?). Branch shadow execution thus achieves the same effects as the (nottaken) branch prediction and hardware speculative execution of superscalar processors, without the complexity and the high power requirements of such implementations [28].

3.

CODE GENERATOR OPTIMIZATIONS

3.1

The ST120 Linear Assembly Optimizer

A number of ST120 features challenge traditional compiler code generators. These features include the split AU – DU register file, the decoupled implementation, the SLIW instruction mode, the predication model with branch shadow, the DSP addressing modes, and the DSP arithmetic support. Compiler exploitation of these features is demonstrated by the ST120 Linear Assembly Optimizer (LAO). The purpose of the LAO is to convert a program written in Linear Assembly Input (LAI) language to the ST120 basic assembly language that is suitable for assembly, linking, and execution. Although initially designed for the ST120 assembly language programmers, the LAO has proved itself very effective at improving the ST120 C compiler output. The LAI language is a superset of the GP32 assembly language, extended by the ST120 assembly level intrinsic functions, and where symbolic register names can be freely used. From this input, the LAO produces a program where the GP32 and SLIW instruction-level parallelism is exploited, the ST120 intrinsic functions are expanded, and the symbolic registers are mapped to architectural registers. The different LAO processing phases are depicted in figure 3. After initial macro-processing and parsing of a LAI source file, an Assembler Intermediate Representation (AIR), which basically amounts to code and data streams, is built. This AIR is then lifted to a LAO Intermediate Representation (LIR), where the function boundaries and the data segments are identified, and where all the instruction effects are exposed. From there, some of the most important code generator optimizations that are applied are:

Loop Analysis Based on the DJ-graph technique [37]. Range Optimizations Collect integer ranges, and remove redundant sign extensions. Global Optimizations Include constant propagation, dead code elimination, and expression simplification. Loop Restructuring Unrolls loops, and maps inner loops to counted or uncounted hardware loops. Address Optimizations Exploit the DSP addressing modes, and balance expressions between the DU and the AU. Pattern Optimizations Recognize the DSP instruction patterns, and substitute more efficient code. Intrinsic Expansion Context-sensitive macro expansion, including the optimization of division by constants.

AU/GU. The decoupled implementation of the ST120 core creates other peculiarities. First, loads to the DU registers are blocking, a feature that challenges traditional instruction schedulers. Second, the zero-latencies must be exploited in order to benefit from the decoupled implementation. Third, the instruction scheduler has to exploit the branch shadow. Efficient register allocation is crucial on the ST120, that has a relatively small number of registers arranged in separate register files. The LAO register allocator is designed to efficiently allocate registers and generate spill code for predicated code, while reducing the number of MOVE operations through the application of the repeated register coalescing. The repeated register coalescing is also used during the conversion from the SSA representation, see § 3.2

3.2

The LAO SSA Representation

The Single Static Assignment (SSA) representation of programs enables efficient global analysis and optimizations [17, 39, 5]. In the LAO, a problem arises as the SSA representaIF-conversion Predication of single-entry control-flow retion has to deal with already predicated operations that may gions into superblocks (§3.3). come from the LAI code. This requires to support prediPrepass Scheduling Superblock scheduling, and software cated code in the LAO SSA representation, even though the pipelining with modulo scheduling (§3.4). LAO IF-conversion is performed after the SSA optimizations. The problem is that the SSA representation does not Register Allocation Effective register assignment and spilling express the predicated definitions merge points. of predicated code (§3.5). To address this issue, on each predicated operation, we Postpass Scheduling Superblock scheduling, branch shadow add an implicit use of the previous (perhaps predicated) definition of the defined register. We attach the following exploitation, and SLIW code linearization (§3.4). meaning to an implicit use: the defined register takes the Following these optimizations, the LIR is mapped back to result of the evaluation of the operation if this operation’s the AIR, which is then emitted as assembly code. predicate is true, otherwise it takes the result of the previous reaching definition. This allows to rename definitions that In the remainder of this section, we focus on the LAO affect the same register under different predicates, and to implementation of the Single Static Assignment (SSA) repperform safe SSA optimizations on predicated code. resentation of the program, the IF-conversion phase, the instruction scheduling phase, and the register allocation phase. When translating the program back from the SSA repThe main purpose of the SSA representation in the LAO is resentation, we have to express the fact that the two regto enable efficient global and loop analyses, and to ease imisters (the predicated definition and its implicit use) must plementation of the related optimizations. SSA-based analbe renamed to the same register. We implemented the SSA yses include loop induction variables, integer range propatranslation algorithm from [27], that provides a way to hangations, and simple alias analysis. SSA-based optimizations dle two-operand machine instructions, with the use and the include constant propagation, dead code elimination, useless definition sharing the same register. Another advantage of sign extension removal, expression simplification, generation the algorithm from [27] is that it enables the LAO implemenof auto-modification addressing modes, and expression baltation to correctly enforce the register targeting constraints ancing between the DU and the AU. Another benefit of the of the ST120 Application Binary Interface (ABI). SSA representation is that it renames many multiple regisOn the other hand, we found that such use of the algoter definitions, so the instruction scheduler is significantly rithm from [27] generates too many MOVE operations: not less constrained by WAR and WAW register dependences. only MOVEs are generated from translating the Φ-operations, Advanced IF-conversion is required on the ST120 core but also one MOVE is generated for each predicated defto optimize both code size and code performance. As the inition. These MOVEs are eliminated on the regions seST120 does not implement branch prediction, GOTO inlected for IF-conversion and software pipelining by applying structions execute in several cycles, even when the branch the repeated register coalescing algorithm (see §3.5) before is unconditional, or conditional not-taken. By using the prepass scheduling. The MOVE elimination is important at hardware loops, all performance-critical looping branches this point since otherwise both the IF-conversion and the are converted to early exit branches that are mostly not prepass instruction scheduler would have to deal with them. taken. The cost of early exit branches is minimized thanks to 3.3 The LAO IF-Conversion branch shadow exploitation. On other conditional branches, IF-conversion also yields large improvements as it eliminates The LAO implementation of the IF-conversion is based GOTO instructions, and because it enables the creation of on Fang’s algorithm [15]. This algorithm allows the IFlarger scheduling regions. conversion of single-entry acyclic control flow-regions, based Instruction scheduling is critical for code performance on on three steps: the first step assigns a control predicate to the ST120 core, due to its issue width of 4 instructions in each basic block in the region to be predicated; the second the SLIW mode, and because of the long register flow lastep assigns predicates to the region’s individual operations, tencies that appear when moving values from the DU to the with on-the-fly predicate promotion; in the third step, intra-

foreach operation ∈ block Gp ← getAssignedPredicate(operation) if isNotPredicated(operation) operation.predicate ← Gp else Gx ← operation.predicate if existsLocalRenaming(Gx , Gp ) G0x ← getLocalRenaming(Gx , Gp ) else G0x ← newRegister() if existsLocalDefinition(Gx ) localOptimize(operation, Gx , G0x , Gp ) else if isConditionalExitBranch(operation) insertOperation(G0x ← Gx ∨ ¬Gp ) else insertOperation(G0x ← Gx ∧ Gp ) end if end if setLocalRenaming(Gx , Gp , G0x ) end if operation.predicate ← G0x end if if isPredicateDefine(operation) setLocalDefinition(Gx , operation) killLocalRenaming(Gx ) end if end foreach

Input Code Pattern

region branches are removed, and the machine code that computes the control predicates is created. Fang’s algorithm has a number of interesting properties: it minimizes the number of control predicates; it inserts predicate definitions as high as possible in the dominator tree; and it control-speculates operations by predicate promotion. The IMPACT-1 compiler [29] achieves similar results through the application of predicate promotion (instruction promotion) on top of the seminal RK-algorithm [32]. The problems we had to address when implementing the LAO IF-conversion are (1) the identification of the control region to predicate, (2) the correct assignment of predicates to the conditional branches exiting the predicated region (these are not removed by IF-conversion), and (3) the correct processing of the predicated code supplied as input to the IF-conversion. The second problem is related to the fact that the ST120 branches are active on the guard value false. The third problem arises when processing LAI code that contains already predicated operations. The LAO heuristic for choosing the region for predication calculates a threshold such that when the shortest execution path is penalized beyond this threshold, the IF-conversion is not authorized. In order to address the other two issues, we modified the steps two and three of Fang’s algorithm. The pseudo-code in figure 4 implements the modifications required to assign the new predicates Gp to the operations of each basic block in the predicated region. When an operation is previously predicated with Gx , which is locally defined in the block, but has not been locally renamed (meaning that this is the first use of Gx as a predicate seen so far), function localOptimize is called to generate better code than the default code supplied to insertOperation in figure 4. Function localOptimize first identifies the Guard Modification Instruction (GMI) that defines the predicate operand Gx of a predicated operation, then applies one of the rewrite rules listed in figure 5, where LGD means Local Guard Definition, and DCE marks likely candidates for Dead Code Elimination. In case of a predicated GOTO operation (conditional exit branch), the first rule in figure 5 applies. The

Gx ← GOT O

GM I exit

(LGD)

GM I ...

(LGD)

Gx ?

Gx ← Ry ←

Gx ? Gx ?

Gx ← Ry ←

GM I ...

(LGD)

GM I Gx ∧ Gp

(LGD)

GM I ¬Gx ∧ Gp

(LGD)

GM I Gx ∧ Gp

(LGD)

GM I ¬Gx ∧ Gp

(LGD)

Gx !

GM I 1 GM I exit

(DCE)

← ← ← ←

GM I 0 GM I ...

(DCE)

Gp ? G0x ?

Gx G0x G0x Ry

G0x ? G0x ? G0x ?

Gx ← G0x ← Ry ←

GM I GM I ...

(DCE)

Gx ← GM I Gy ← 0 Gp ? Gy ← GM I Gp ? Gx ← GM I Gz ← 0 Gp ? Gz ← ¬GM I

(DCE)

Gp ? G0x ! Gp ?

Gx ← Gy ← Gx ← Gz ←

Gp ? Gp ?

Figure 4: Local predicate assignment in the LAO.

Optimized Predicated Code Gp ?

Gx ← Gy ← Gx ← Gz ←

Gx ← G0x ← G0x ← GOT O

Gp ?

Gp ? Gp ? Gp ? Gp ?

Gx ← Gy ← Gx ← Gz ←

(DCE)

GM I GM I

(DCE)

GM I ¬GM I

(DCE)

Figure 5: Rewrite rules of function localOptimize. second and the third rules apply to the operation patterns commonly found in already predicated LAI code. Finally, we reuse function localOptimize in the third step of Fang’s algorithm, in order to optimize the code that computes the control predicates of the successors of the current basic block. Assuming the current basic block predicate is Gp , Fang’s algorithm requires to generate code that computes Gy ← Gx ∧ Gp , and Gz ← ¬Gx ∧ Gp , where Gx is the predicate used by the current basic block tail conditional branch. If Gy has already been defined before, these contributions to the computation of Gy must be predicated with Gp . In case Gx is locally defined in the current basic block, one of the four last patterns in figure 5 applies.

3.4

The LAO Instruction Scheduler

The LAO instruction scheduler performs two main functions: before register allocation (prepass scheduling), modulo scheduling software pipelining is applied to all the inner hardware loops whose body is a superblock, while block scheduling is applied to all the other superblocks; after register allocation (postpass scheduling), block scheduling is applied to all the previously scheduled superblocks that have been modified due to spill code insertion. Superblocks refer to single-entry control flow-regions, whose basic blocks have one predecessor, and zero or one successor in the region [23]. The LAO assumes that the order found in the LAI code is significant, with the most frequent paths already straightened. Accordingly, all the LAO phases try to maintain this order, so the superblocks are selected greedily from the incoming sequence of basic blocks. Unlike [21], we do not currently apply tail duplication to build larger superblock, as this increases the code size. Since the LAO instruction scheduler runs after the IF-conversion, its superblocks actually contain predicated code like the hyperblocks of [29]. The main issues we had to address when implementing the LAO instruction scheduler are (1) dealing with the blocking DU loads of the ST120 decoupled implementation, (2) dealing with the zero-latency dependences, (3) software pipelin-

ing of the hardware loops, in particular with regards to modulo expansion [26], and (4) branch shadow exploitation. To address the blocking DU load problem, the LAO instruction scheduler tries to schedule these operations as close as possible to their first use, without enforcing a hard scheduling constraint. In order to achieve this, we used the lifetimesensitive scheduling framework that minimizes the MinLife [8] of the instruction schedule. This lifetime-sensitive framework, as well as the related MinBuffer framework [30], have already been shown to be effective at lowering the register pressure in modulo scheduling software pipelining [14]. The best part however is that these frameworks lead to a simple and efficient network flows formulation over an extension of the dependence graph [10]. In particular, we addressed the blocking DU loads problem in the MinLife simply by increasing fourfold the weight of the register lifetimes that originate from these loads. As a result, virtually all the DU loads end up being scheduled by the LAO instruction scheduler zero or one cycle away from their first use. Precise exploitation of the zero-latency dependences on the ST120 is important, as it reduces the register pressure, and also the code size especially in the SLIW mode. In a superscalar processor instruction scheduler, all the operations that are scheduled together at any given cycle, called the issue group, are assumed independent. Later in the code generator, when the instruction schedule is linearized back to a sequence of instructions, reordering is applied inside the issue groups to ensure that the processor run-time instruction issue logic will actually recreate the same instruction schedule (issue dates, functional resource bindings) as planned by the instruction scheduler. Such issue group reordering is always legal, as it contains independent operations. With some zero-latency dependences, the task of the instruction scheduler becomes more complex, as the operations inside an issue group are no longer independent. To solve this problem, we first use the fact that the ST120 runtime binding of reservation tables to an instruction only depends on the memory position (“slot”) of that instruction inside an aligned instruction pair (GP32) or quadruple (SLIW). The LAO instruction scheduler then exploits a ST120 machine modelization where the reservation tables contain extra resources to prevent the occurrence of issue groups that could violate zero-latency dependences. For instance, these reservations tables enable the (DU-LOAD, DUSTORE) aligned instruction pair, but forbid the equivalent (DU-STORE, DU-LOAD) instruction pair. Last, schedule linearization emits the issue groups in memory order. As a result, the LAO instruction scheduler can exploit the ST120 zero-latency dependences by putting dependent operations into the same issue group, and be assured that schedule linearization will not violate these dependences. The LAO software pipeliner is based on the Cray T3E modulo scheduling software pipeliner, that uses the Insertion Scheduling heuristic [9], the MinLife lifetime-sensitive scheduling framework [10], and a generalized software pipeline construction scheme that uniformly applies to the FOR and WHILE inner superblock loops [11]. In the LAO, software pipelining is only applied to the inner hardware loops, since otherwise the loop branching overhead dominates. Because the ST120 architecture does not include rotating register files, the software pipeliner relies on modulo expansion [26] to remove WAR and WAW register dependences.

The other alternative to modulo expansion, explicit moves from register to register in order to keep the register lifetimes shorter than the loop iteration interval, does not applies to the long register RAW latencies that result from moving data or guard values from the DU to the AU/GU. Fortunately, such long latencies mostly appear in scalar code, as the digital signal processing loops have simple addressing patterns that can be supported by the AU alone. A first concern with modulo expansion is the code size increase implied by kernel renaming and the epilog blocks. Another problem is that modulo expansion implies kernel unrolling, which naturally yields a software pipelined loop whose body contains multiple exits. However, multiple exit loops defeat the purpose of the ST120 counted hardware loops, where there is a single implicit exit at the end of the loop body. On such counted loops, the alternative is not to insert the early exits, and to create a remainder loop. Therefore, the LAO software pipeliner disables modulo renaming by default, unless explicitly overridden. Thanks to the decoupled implementation of the ST120 and its zerolatency DU loads, such disabling does not significantly impact the performance of the digital signal processing loops. Finally, the LAO instruction scheduler is in charge of the branch shadow exploitation. To do so, we first apply a simple localized form of predication, where the non-predicated operations that must follow the branch are predicated with the same guard as the branch. For these operations, the control dependence from the branch is replaced by a data dependence from the GMI that produces this guard. Then during instruction scheduling, a priority function on the candidate issue slots penalizes scheduling inside the branch shadow for the operations that would stall branch shadow execution

3.5

The LAO Register Allocator

The LAO register allocator is designed (1) to effectively allocate registers, (2) to reduce the number of MOVE operations generated by the SSA translation at the control flow and the predicate merge points (§3.2), and (3) to efficiently handle spilling in the user predicated code while not having the full information on the predicate relationships. The LAO register allocator is effective, thanks to the implementation of the graph coloring register allocation technique [6], and the provision of a Predicate Query System (PQS) based on the Predicate Partition Graph (PPG) [25]. The LAO register allocator includes optimistic register coloring [3], rematerialization of constants [4], iterated register coalescing [19], interference graph splitting by resource type (data, address, and control registers on the ST120), and implements the local spill optimizations of [7]. To reduce the number of MOVE operations, the LAO register allocator implements repeated coalescing, which introduces two improvements over the iterated register coalescing of George & Appel [19]: it incrementally updates the interference graph during coalescing, and it extends the iterated coalescing allocator automaton in order to take advantage of the more accurate interference information. During the graph coloring register allocation, a register to register MOVE operation can be removed only if its source and destination live ranges do not interfere, i.e. there is no edge in the interference graph connecting these two live ranges. In such cases, existing graph coloring register allo-

a1 := def a2 := a1 + 3 if (!cond) goto L2 L1:

a1 := def a2 := a1 + 3 if (!cond) goto L2 L1:

a3 a1 a2 if

:= a2 + b := a3 := a3 (cond) goto L1

L2:

a2 := a2 + b a1 := a2 if (cond) goto L1 L2: return a1

return a1 (a)

(b)

Figure 6: Live range interferences. cators coalesce the two live ranges into one, and discard the MOVE operation [7, 4, 19, 31]. However while doing so the interference information are not completely updated. For illustration, consider the code in figure 6. In (a), the live ranges of variables a1 and a2 interfere. After coalescing of the live ranges a2 and a3 in (b), such interference no longer exists. In the original interference graph however, the interference edge between these live ranges remains, thus preventing the coalescing of a1 and a2. The LAO coalescing phase uses a directed graph, and labels the interference arcs with an interference count. This allows the incremental update of the interference graph after each coalescing. To take advantage of the more accurate interference information, the LAO implementation extends the iterated coalescing allocator automaton of [19]. In [19], Briggs’s conservative coalescing test is applied to each MOVE operation in turn. As a result, each MOVE falls into one of two states: it is either coalesced, effectively merging the two live ranges, or, in case the two live ranges interfere, it is frozen, and the live ranges are kept separate. Our algorithm adds an other state to the iterated coalescing allocator automaton such that when the live ranges interfere, the MOVE operation is kept in the coalescing list instead of being frozen. If another MOVE operation is coalesced, the coalescing phase will repeatedly try to recoalesce previously rejected MOVE operations, until no more coalescing is possible. Such repeated coalescing increases the opportunities for merging variable live ranges for two reasons: first, it reduces the side effects of the operation ordering in the coalescing process; second, upon coalescing two live ranges the interference graph is updated, thus creating more opportunities for the further conservative coalescing tests. To efficiently handle the generation of spill code, the LAO register allocator implements several advanced heuristics. As it has already been shown in [20], special care must be taken to ensure the convergence of graph-based register allocation in presence of predicated code. In order to handle this problem, the LAO register allocator inserts KILL pseudooperations every time a predicated life range is spilled. This effectively limits the interference with the new temporary variables generated by the spill code. Finally, the LAO register allocator locally optimizes the spill code (RELOAD) placement when multiple uses of a variable are candidates to RELOAD optimization [7], but are predicated with different predicates. The LAO register allocator performs predicate promotion of the RELOAD operations. As a result, a single RELOAD is speculatively executed under the TRUE predicate for all the candidates. This type of speculation is always beneficial due to the im-

Figure 7: LAO speedups on the LCC code.

Figure 8: LAO speedups on the CCC code. plementation of the predicated LOADs on the ST120.

4. 4.1

EXPERIMENTAL RESULTS LAO on C Benchmark Codes

In this section, we present the results of experiments where the LAO is applied on a C benchmark that includes: some basic digital signal processing kernels; integer Discrete Cosine Transform (DCT) variants; searching, sorting, and string searching algorithms; and some predication-intensive codes like the SGML attribute white space normalization (strtrim.1), and the UNIX utility wc (strwc.1). Effective exploitation of the ST120 architecture by the LAO is first demonstrated on a retargeted version of the LCC compiler [18]. In figure 7, the GP32 assembly code produced by LCC is compared to the LCC output using virtual registers, and further optimized by the LAO. The best speedups obtained are above 6, while the geometric mean speedup is 2.45. These large speedups are explained in part by the hardware loop mapping and the IF-conversion, as these are not performed by the LCC GP32 code generator. We then experimented with the best ST120 Commercial C Compiler (CCC) available. In figure 8, we compare CCC with all its ST120 specific optimizations enabled, to CCC without the ST120 specific optimizations but further improved by the LAO. The CCC ST120 specific optimizations include IF-conversion, hardware loop mapping, and generation of auto-modification addressing modes. Here again,

Figure 9: LAO expansions on the CCC code.

Figure 11: LAO IF-conversion (LIC+NIC) compared to No IF-conversion (NIC).

Figure 10: LAO speedups using the SLIW mode.

Figure 12: LAO IF-conversion (LIC+SIC) compared to Simple IF-conversion (SIC).

the speedups obtained are significant with a 1.28 geometric average. The LAO effects on code size are a 1.05 geometric average increase (figure 9). This code size expansion remains reasonable, given the performance improvement. We also measured the efficiency of the LAO in extracting the SLIW instruction level parallelism. In SLIW mode, up to four GP32 instructions may execute every cycle, compared to the two-way superscalar GP32 execution. In figure 10, we compare the performance of the CCC improved by LAO with an option that maps inner loops to SLIW mode, to the same CCC improved by LAO that generates GP32 code by default. The geometric average speedup of SLIW on our benchmarks is 1.11, with 20% – 57% speedups on DSP loops, and small performance variations otherwise. The slowdowns can be explained by the more constraining SLIW slotting rules and the GP32/SLIW mode switching overhead. Further performance improvements of SLIW mode are expected when the LAO implements memory access pairing to reduce the memory bank conflicts [38].

4.2

Effects of the LAO Optimizations

In this section, we present experimental figures that focus on the two most specific features of the LAO implementation. The first feature is the LAO IF-conversion, that includes a specific region selection scheme, and a modification of the predication algorithm of Fang (§3.3). The second feature is the repeated register coalescing improvement of the iterated register coalescing of George & Appel (§3.5). The effectiveness of the LAO IF-conversion (LIC) is demon-

strated by applying it on two kinds of input (1) LCC generated, non IF-converted code (NIC), (2) CCC generated code, where simple structure-based IF-conversion (SIC) of IF-ELSE statements is performed. The SIC cases are especially interesting, as they stress the capability of the LAO to optimize already predicated code supplied as input. Figure 11 displays the speedups obtained when LIC applied to NIC is compared to NIC alone, using a set of kernels where our region selection heuristic is applied. The LAO IF-conversion (LIC) achieves a geometric mean speedup of 1.51, with a maximum of 3.31 for strwc.1. These results clearly demonstrate the importance of the IF-conversion to successfully exploit the ST120 instruction-level parallelism. Figure 12 displays the speedups obtained when LIC applied to SIC is compared to SIC. Here, the LAO IF-conversion achieves a geometric mean speedup of 1.72, with a maximum of 2.33 for strwc.1. This demonstrates that the LAO IF-conversion of regions significantly improves performance over the simple IF-ELSE conversion, and that the management of already predicated code by the LAO is very effective. Due to the conservative nature of the LAO IF-conversion region selection heuristic, none of the kernels are degraded by the optimization. Although we believe that most of the opportunities for profitable IF-conversion are exploited by the current region selection scheme, we still have to compare it to profile guided region selection. However, the code expansion induced by a generalized region selection, that implies tail duplication, may limit the overall benefit. The behavior of the repeated coalescing allocator was mea-

Iterated Coalescing Repeated Coalescing Relative Variation

DU MOVEs 36183 29659 -18.0%

AU MOVEs 23109 21074 -8.8%

All MOVEs 59292 50733 -14.4%

All SPILLs 21546 21508 -0.2%

Table 1: Repeated Coalesing compared to Iterated Coalescing on dynamic instruction counts.

Figure 14: LAO expansions on application codes.

Figure 13: LAO speedups on application codes. sured on two major DSP applications, the efr 5.1.0 and the amr 2.0.0 vocoders from the ETSI [13]. In table 1, column “DU MOVEs” counts the dynamic number of the DU MOVE instructions, column “AU MOVEs” counts the dynamic number of the AU MOVE instructions, column “All MOVEs” sums these columns, and column “All SPILLs” counts the dynamic number of the spill instructions. The repeated coalescing effectively coalesces significantly more MOVE instructions than the iterated coalescing method, as 14.4% less dynamic MOVEs are executed. At the same time, the spill code instruction count remains stable, with a 0.2% decrease. Thus, the conservative coalescing test performed in the graph-based register allocation framework is still effective, as no additional spill code is introduced by the new LAO coalescing scheme.

4.3

LAO on Application Codes

The ST120 core is primarily targeted at portable and consumer applications, such as digital cellular phones, ADSL modems, hard disk drive control, etc. In this section, we focus on the LAO results on the following applications: cxmodem Software modem control code. servo Hard disk drive digital control loop. efr 5.1.0 ETSI Enhanced Full Rate (EFR) vocoder [13]. amr 2.0.0 ETSI Adaptive Multi Rate (AMR) vocoder [13]. itu G723 ITU Dual Rate speech coder [24]. No modifications were applied to the source code of the ETSI and the ITU reference applications, except for a change in an include file to remap the basic ETSI DSP operators to the ST120 DSP intrinsic functions. Figure 13 compares CCC under maximum optimization, to CCC improved by the LAO (CCC+LAO), and to CCC improved by the LAO with automatic SLIW mapping of the inner loops (CCC+LAO.SLIW). The best speedups are obtained on the amr 2.0.0: 1.52 for CCC+LAO, and on the itu G723: 1.77 for CCC+LAO.SLIW. Note that cxmodem and

servo are not affected by the LAO SLIW mapping scheme, thus giving no improvements over GP32. Code size (figure 14) of all the CCC+LAO cases show little variations. On the other hand, code size expansions are in the 20% – 24% range with inner loop SLIW mapping (CCC+LAO.SLIW). Such expansions are acceptable, given the performance improvements. We also expect inner loop SLIW mapping to be less expensive in code size once we implement some kind of profiling feedback in the LAO.

5.

SUMMARY AND CONCLUSIONS

The ST120 core implements access-execute decoupling, an innovative predication model, and multiple instruction modes including the SLIW mode, an improved VLIW execution where four instructions issue every cycle. These features enable high-performance, low-power consumption, and compact code size, on digital signal processing applications. Because of these features however, the ST120 core presents unique challenges for the compiler code generation. In this paper, we introduce the ST120 Linear Assembly Optimizer (LAO), a tool designed to successfully meet these challenges, and used to fill the need for automatic optimization of the ST120 assembly code. We present solutions to several compiler optimization problems: the SSA translation and the IF-conversion in presence of already predicated code; instruction scheduling for a decoupled implementation; and repeated register coalescing, our improvements to the graph-based coalescing register allocator. The 1.35 – 1.52 (GP32) and 1.57 – 1.77 (SLIW) speedups obtained on the industry standard speech coders, when optimizing the assembly code output of the leading ST120 industry C compiler, demonstrate the efficiency of the LAO technology. We expect further improvements after these compilers are enhanced to emit Linear Assembly Input (LAI) code, instead of the basic ST120 assembly code.

6.

REFERENCES

[1] Advanced Risc Machines: An Introduction to Thumb ARM DVI-0001A, 1995. [2] D. August, J. Sias, J. Puiatti, S. Mahlke, D. Connors, K. Crozier, W. Hwu: The Program Decision Logic Approach to Predicated Execution 26th Annual International Symposium on Computer Architecture – ISCA’99, May 1999. [3] P. Briggs, K. D. Cooper, K. Kennedy, L. Torczon: Coloring Heuristics for Register Allocation SIGPLAN ’89 Conference on Programming Language Design and Implementation – PLDI’89, July 1989.

[4] P. Briggs, K.D. Cooper, L. Torczon: Improvements to Graph Coloring Register Allocation ACM Transactions On Programming Languages And Systems 16, 3, May 1994. [5] P. Briggs, K. D. Cooper, T. J. Harvey, L. T. Simpson: Practical Improvements to the Construction and Destruction of Static Single Assignment Form Software Practice and Experience 28, 8, July 1998. [6] G. J. Chaitin, M. A. Auslander, A. K.Chandra, J. Cocke, M. E. Hopkins, P. W. Markstein: Register Allocation via Coloring Computer Languages 6, 47-57, 1981. [7] G. J. Chaitin: Register Allocation and Spilling via Graph Coloring SIGPLAN’82 Symposium on Compiler Construction, 1982. [8] B. Dupont de Dinechin: Simplex Scheduling: More than Lifetime-Sensitive Instruction Scheduling 1994 International Conference on Parallel Architecture and Compiler Techniques – PACT’94, 1994. [9] B. Dupont de Dinechin: Insertion Scheduling: An Alternative to List Scheduling for Modulo Schedulers 8th International Workshop on Languages and Compilers for Parallel Computing – LCPC’95, LNCS #1033, Colombus, Ohio, Aug. 1995. [10] B. Dupont de Dinechin: Parametric Computation of Margins and of Minimum Cumulative Register Lifetime Dates 9th International Workshop on Languages and Compilers for Parallel Computing – LCPC’96, LNCS #1239, San Jose, California, Aug. 1996. [11] B. Dupont de Dinechin: A Unified Software Pipeline Construction Scheme for Modulo Scheduled Loops PaCT’97 – 4th International Conference on Parallel Computing Technologies, LNCS #1277, Yaroslavl, Russia, Sep. 1997. [12] B. Dupont de Dinechin, C. Monat, P. Blouet, C. Bertin: DSP-MCU Processor Optimization for Portable Applications Elsevier Microelectronic Engineering Journal, MIGAS-2000 International Summer School on Advanced Microelectronics, Special Issue, 2001. [13] European Telecommunications Standards Institute – ETSI: GSM Technical Activity, SMG11 (Speech) Working Group, http://www.etsi.org. [14] A. E. Eichenberger, E. S. Davidson: Efficient Formulation for Optimal Modulo Schedulers 1997 SIGPLAN Conference on Programming Language Design and Implementation – PLDI, 1997. [15] J.Z. Fang: Compiler Algorithms on If-Conversion, Speculative Predicates Assignment and Predicated Code Optimizations 9th International Workshop on Languages and Compilers for Parallel Computing – LCPC’96, LNCS #1239, Aug. 1996. [16] P. Faraboschi, G. Brown, J. A. Fisher, G. Desoli, F. Homewood: Lx: a Technology Platform for Customizable VLIW Embedded Processing 27th Annual International Symposium on Computer Architecture – ISCA’00, June 2000. [17] J. Ferrante, B. K. Rosen, M. N. Wegman, F. K. Zadeck: An Efficient Method for Computing Static Single Assignment Form SIGPLAN’89 Conference on Principles Of Programming Languages (POPL), 1989. [18] C. Fraser, D.R. Hanson: A Retargetable C Compiler: Design And Implementation Addison-Wesley, 1995. [19] L. George, A.W. Appel: Iterated Register Coalescing ACM Transactions On Programming Languages And Systems 18, 3, May 1996. [20] D. M. Gillies, D. R. Ju, R. Johnson, M. Schlansker: Global Predicate Analysis and its Application to Register Allocation 29th International Symposium on Microarchitecture – MICRO-29, Dec. 1996.

[21] R. E. Hank, S. A. Mahlke, R. A. Bringmann, J. C. Gyllenhaal, W. W. Hwu: Superblock Formation Using Static Program Analysis 26th International Symposium on Microarchitecture – MICRO-26, 1993. [22] Peter-Yan-Tek Hsu: Design of the R8000 Microprocessor IEEE Micro, 1993. [23] W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, D. M. Lavery: The Superblock: An Effective Technique for VLIW and Superscalar Compilation The Journal of Supercomputing, 7, 1993. [24] International Telecommunication Union – ITU: DSP Group, http://www.itu.int. [25] R. Johnson, M. Schlansker: Analysis Techniques for Predicated Code 29th Annual International Symposium on Microarchitecture – MICRO-29, Dec. 1996. [26] M. Lam: Software Pipelining: An Effective Scheduling Technique for VLIW Machines SIGPLAN’88 Conference on Programming Language Design and Implementation – PLDI, 1988. [27] A. L. Leung, L. George: Static Single Assignment Form for Machine Code ACM SIGPLAN’99 Conference on Programming Language Design and Implementation – PLDI, 1999. [28] S. Manne, D. Grunwald, A. Klauser: Pipeline Gating: Speculation Control for Energy Reduction 25th International Symposium on Computer Architecture – ISCA’98, July 1998. [29] S.A. Mahlke, D.C. Lin, W.Y. Chen, R.E. Hank, R.A. Bringmann: Effective Compiler Support for Predicated Execution Using the Hyperblock 25th international symposium on Microarchitecture – MICRO-25, 1992. [30] Q. Ning, G. R. Gao: A Novel Framework of Register Allocation for Software Pipelining SIGPLAN’93 Symposium on Principles of Programming Languages, Jan. 1993. [31] J. Park, S. M. Moon: Optimistic Register Coalescing 1998 International Conference on Parallel Architecture and Compiler Techniques – PACT’98, Oct. 1998. [32] J. Park, M. Schlansker: On Predicated Execution Technical Report HPL-91-58, Hewlett-Packard Software and Systems Laboratory, May 1991. [33] B. R. Rau, C. D. Glaeser: Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing 14th Annual Microprogramming Workshop, Oct. 1981. [34] STMicroelectronics: ST120 DSP-MCU CORE Reference Guide http://us.st.com/stonline/. [35] M. S. Schlansker, B. R. Rau: EPIC: An Architecture for Instruction-Level Parallel Processors HP Labs Technical Report HPL-1999-111, http://www.hpl.hp.com/techreports/, 1999. [36] J. E. Smith: Decoupled Access / Execute Computer Architecture ACM Transactions on Computer Systems, 2, 4, Nov. 1984. [37] V. C. Sreedhar, G. R. Gao, Y.-F. Lee: Identifying Loops Using DJ Graphs ACM Transactions On Programming Languages And Systems 18, 6, Nov. 1996. [38] A. Stoutchinin: An Integer Linear Programming Model of Software Pipelining for the MIPS R8000 Processor PaCT’97 – 4th International Conference on Parallel Computing Technologies, LNCS #1277, Yaroslavl, Russia, Sep. 1997. [39] M. Wolfe: Beyond Induction Variables SIGPLAN’92 Conference on Programming Language Design and Implementation, 1992.

Code Generator Optimizations for the ST120 DSP-MCU ...

Permission to make digital or hard copies of all or part of this work for personal or classroom use is .... In the SLIW mode, the data dependences are scoreboarded, provided they hold ...... servo Hard disk drive digital control loop. efr 5.1.0 ETSI ...

Download PDF

505KB Sizes 3 Downloads 292 Views

Report

Code Generator Optimizations for the ST120 DSP-MCU ...

Recommend Documents