Software Pipelining on the Cray MPP: History, Status and Perspectives Benoˆıt Dupont de Dinechin December 10, 1997 Abstract This document describes the current status of the software pipeliner I am developing for the Cray MPP machines, its features, and the amount of work that remains to be done. For the sake of completeness, I also include a short overview of the software pipelining technique itself, and a history of the previous software pipelining experiments I have conducted while at Cray.

1

About Software Pipelining

What is software pipelining Software pipelining can be thought of as an extension of bottom loading. The bottom loading technique creates a new loop body where the LOADs fetch data for the next iteration, instead of the current one. Software pipelining creates a new loop body, called a kernel, which contains instructions belonging not only to the current iteration i, but also instructions belonging to iterations i + 1, i + 2, i + 3. . . of the former loop. In addition these displaced instructions are not restricted to be LOADs. A (simplistic) survey of software pipelining techniques is available in [2]. What pieces of code does software pipelining apply to Current software pipelining techniques apply to innermost DO-loops with no early exits. Given the ability to execute some instructions speculatively, such as LOADs and register-register arithmetic operations, current techniques would apply to any loop whose body is a single basic block. Because dependency cycles in a loop prevent good overlap between iterations, it is expected that maximum benefits come from software pipelining when it is applied to vector loops. 1

How important is software pipelining In the SGI / MIPSpro compiler, software pipelining is the main back-end optimization. When turned on, it boosts the performance by 35% averaged over the whole SPEC FP 92 test suite (which means that the inner loops themselves are speeded up by a large degree). However, to reach these levels of performance increases, a full stack of software pipeliner related optimizations must be turned on. Unfortunately, these optimizations are not very useful without a software pipeliner, so they are not available in compilers not designed with software pipelining in mind. Where does the software pipeliner run in a compiler Software pipelining is basically an instruction scheduling technique. As such, the best place to run this technique is in the compiler back-end, before register assignment. The pipeliner replaces bottom loading, if present, and also the block scheduling of the loop body which would normally occur. After register assignment, if spill code has been inserted in the new loop body created by the pipeliner, a block scheduling phase is likely to be beneficial. Figure 1 illustrates more precisely where the software pipeliner is inserted in the Cray MPP compiler back-end. What kind of specific optimizations does a software pipeliner need According to John Ruttenberg, the single most important optimization is IF-conversion, which creates single basic block loops from loops with conditionals inside. Right after comes inter-iteration redundant STORE-LOAD elimination, and then “recurrence cycle squashing” techniques which apply whenever a recurrence cycle happens to compute a sum or a product. Currently, redundant STORE-LOAD elimination across iterations is available in the Cray MPP compilers, while I plan to implement recurrence cycle squashing late in 1996. What kind of information does a software pipeliner need The best software pipeliner in the world cannot achieve better overlap between iterations than what its dependency graph (called scheduling graph) allows. So computing accurate scheduling graphs is the main requirement before a software pipeliner increases performance. To summarize, dependencies come from a) control dependencies, which can be removed in the case of DO-loops with no early exits, or if instructions can be speculatively executed; b) memory dependencies, which only the middle-end knows enough about; c) register dependencies, which can be removed with register renaming techniques and modulo expansion, and also by knowing which registers are alive upon loop exit. 2

Code Generation

Global Register Renaming

Build the control tree Perform register flow analysis Single-assign pseudo-registers

Branch Optimization I

Peephole Optimization I

Coalesce pseudo-registers Remove no-op instructions Optimize type conversions

Prepass Block Scheduling

Register Assignment

Software Pipeliner Build the control tree Perform register-flow analysis Color pseudo-registers and spill

Branch Optimization II

Peephole Optimization II

Postpass Block Scheduing

Object Code Emission

Figure 1: Insertion of a software pipeliner in a typical back-end. How much work does building a software pipeliner represent In John Ruttenberg PLDI’96 paper [20], it is mentioned that the MIPSpro software pipeliner is 33,000 lines long. It took 4 years to develop and tune this pipeliner, and the current SGI implementation still includes code 3

from the Cydrome software pipeliner. Of these 33,000 lines, 6,000 are devoted to the modulo renaming techniques, and to generation of software pipeline fill and drain code. Unfortunately, the corresponding algorithms are not described anywhere in the literature, even though they “account for a large part of implementing a working pipeliner”. What are the recent developments in the field of software pipelining Although the modulo scheduling technique was described by Bob Rau in 1981 [17], and the modulo expansion technique by Monica Lam in 1987 [13], it is only since 1994 that papers describing working software pipeliners made it clear that simple modulo scheduling techniques are not effective in practice [18, 12]. In fact, it seems that a backtracking capability of the scheduling engine is required in order to build good schedules. Simple and fast heuristics get trapped in “trashing” behavior whenever they have to schedule instructions which belong to a large strongly connected component of the scheduling graph, even if the related recurrence constraints are not really constraining the schedule. It is interesting to note that the first industrial implementation of the modulo scheduling technique used backtracking [21], but at that time it was believed that backtracking was only required because of the irregular architecture of the FPS-164. What are the current research topics in the field of software pipelining Computing optimal schedules, at the expense of very high compile times (more than 3 days for the SPEC FP 92 test suite), is a recent development which can be credited mainly to McGill University [20]. For the first time it has been possible to evaluate how good all the software pipelining heuristics published so far really are. There are still no satisfactory solutions for software pipelining combined with register assignment, nor for software pipelining of conditional loops without IF-conversion. And no one knows how to schedule well for very dynamic machines, although there are no questions that software pipelining is still useful, albeit less than in the case of static machines.

2

Software Pipelining at Cray

I started software pipelining experiments four years ago, when I was sent by CEA to the Cray Research Park to follow the development of the Cray T3D. At that time (1992) it was believed that software pipelining was really no more difficult to implement that a block scheduler. Four years later, the 4

software pipeliner is still not part of a released compiler, although this might happen Real Soon Now. To summarize how this time was spent: 1992 I spent 5 months at the Cray Research Park, designing an accurate processor modelization of the DEC Alpha 21064 (EV-4), and enhancing the DEC public-domain Alpha instruction scheduler so it would deal with pseudo-registers. I also selected the modulo scheduling technique discovered by Bob Rau [18] as the basis for the implementation of a forthcoming software pipeliner. While modulo scheduling is now widely recognized as the mainstream software pipelining technique, back then the choice was not so clear. The main competitors of modulo scheduling at the time were the “unroll and find pattern” techniques pioneered by A¨ıken [1], and global compaction methods such as Percolation Scheduling by Nicolau. 1993 During a 6-month stay at the Cray Research Park, I implemented modulo scheduling for the DEC Alpha 21064, within the “Simplex Scheduling” framework I formulated at the occasion [6, 7]. Because of the use of a simplex solver, the pipeliner ended up with very high running times (averaging O(mn3 ), where n is the number of instructions and m is the number of arcs in the scheduling graph). Moreover, since I had access neither to accurate memory dependency informations, nor to register lifetime informations, from the Cray compiler, software pipelining was hardly improving performance unless I turned on “unsafe” flags. These flags made the pipeliner ignore the most constraining dependencies in the scheduling graph, but because they had such a bulk effect, the resulting pipelined code was incorrect most of the time. In short, this first pipelining experiment was inconclusive. 1994 While in France, I continued developing the pipeliner at a reduced pace, due to my other duties at CEA, and because the telnet lines connections were slow between France and the Cray Research Park. Nevertheless I reimplemented most of the software pipeliner, mainly to reduce its running time. In the process I wrote from scratch a sparse simplex solver, and as a result the pipeliner running times lowered to O(mn2 ). The pipeliner was still slow, and I was experiencing heavy trashing of the scheduling engine because the scheduling graphs I was getting from the MPP compilers looked like single big strongly connected regions. A study of this phenomenon led me to the formulation of the “Insertion Scheduling” technique [8], which would avoid trashing and thus enable good software pipelines to be built in O(mn) 5

time. This Insertion Scheduling technique was implemented during the 5 weeks I spent at the Cray Research Park in September 1994. 1995 After I presented Simplex Scheduling at the PACT’94 conference in Montreal in August 1994 [6], Prof. Gao, chairman of the conference and leader of the ACAPS laboratory at McGill University, offered me a position on his team. Meanwhile, the automatic parallelization project I was in charge at the CEA, started in 1992, was scheduled to end in late 1995. So I decided to apply to a scientist exchange program, in order to get funding for a 1-year stay at the McGill University. This would give me the opportunity to work full time on the software pipeliner, as a member of a research group that was widely recognized in the field. After I was granted the 1-year support I needed, thanks to CEA, I terminated my other projects, and joined Prof. Gao’s team at McGill University. By that time the Insertion Scheduling had demonstrated his excellent abilities of building good software pipelines quickly, even though it had to drag Simplex Scheduling behind [9]. In view of these results, Tom MacDonald decided that it was time to plan a final evaluation of software pipelining techniques, which if successful would eventually be integrated into Cray production compilers. 1996 While at McGill University, I started looking for a way to retain Simplex Scheduling’s unique abilities of building schedules parametrically with the software pipeline initiation interval, and of minimizing cumulative register lifetimes, without having to maintain a complex and slow lexico-parametric simplex solver. After I discovered a fault in the Ning–Gao POPL’92 paper [16], agreed by Ning [15], I unified both Simplex Scheduling and the Ning–Gao technique [for scheduling while minimizing register lifetimes] under a new, simple, and very efficient network flow formulation based on the scheduling graph. The paper describing these results has been published at the LCPC’96 conference in August 1996 [10]. Solving this network flow formulation is currently so fast that the slowest part of the software pipeliner is now calling the middle-end memory dependency query functions [10]. In the second half of 1996, I came four times at the Cray Research Park to implement from scratch a new software pipeliner in C (the two previous software pipeliners were written in C++). A decision with far reaching consequences I made back in 1992 was to develop the software pipeliner in C++. While this language is widely advertised to increase programmer productivity by a large factor, I had a hard 6

time carrying my developments across machines and compilers. Even though in France I could use the excellent xlC compiler for IBM RS6K, in Cray I had to deal with the C-front based C++ 1.x compilers. These compilers did not follow the emerging C++ Draft Standard closely enough, so I had to spend large amounts of time redesigning my algorithms and libraries in order to turn around the “sorry, not implemented” C-front messages. Also, because the main program in MPP compilers is not C++ code, I had to collect the names of the static objects constructors and destructors in the .o files in order to call them myself, a process I had to rework from a C++ compiler release to the other. However I believed there was no way I could develop the whole thing in C, because I had to design and implement sophisticated algorithms such as a lexico-parametric simplex solver. Then I started my work at McGill, using a SGI workstation for software development. In the meantime, the Cray C++ compiler upgraded to version 2.0, which features the Edison C++ front-end. Fortunately, the SGI C++ compiler also uses the Edison C++ front-end, which closely tracks the standard and is thus highly compatible with the IBM xlC compiler. As a result, the difficulties related to porting my C++ codes to Cray PVP machines vanished. The problem that C++ codes making a heavy use of templates are one order of magnitude slower to compile than C code remained though. During the first months at McGill, I finalized three tools aimed at producing clean, documented, and maintainable C++ code. The first tool, called MyLib, is actually a C++ template library of container types. The second is a literate programming tool called MyWeb I implemented (in C++) after I exhausted the capabilities of the available public-domain literate programming tools. The third tool of my environment is a source code generator called MyGen, which translates an Object Description Language into C++ or C source code. Thanks to the upgrade of the Cray C++ compilers, I could at last benefit from these tools while developing on Cray PVP machines. As I started writing the current version of the software pipeliner, I realized that I could develop it 100% in C, whereas my initial intent was to write only glue code to the MPP back-end in C, and the rest in C++. Reverting back to C only became possible after: a) the discovery of the network flow formulation of lifetime-sensitive scheduling, which eliminates the need of implementing a lexico-parametric simplex solver; b) I extended the MyGen tool to generate even more C code; c) I translated back to the corresponding C packages some useful MyLib containers such as Stack, Sequence, Heap, Mapping, and Graph, thus making easier to develop in C while retaining a strong data-abstraction programming style. 7

3

Work performed at the Cray Park in 1996 – 1997

3.1

April 1st – May 5th 1996

The purpose of this first stay was to rewrite all the software pipeliner code which depended on the compiler, or the target architecture. The idea was to resolve these dependencies once and for all, in order to be able to continue other software pipelining developments remotely from McGill. However, thanks to the network flow formulation and two other new techniques described below, I ended up with a first working version of the software pipeliner by the time I left the Cray Park. To summarize, the work performed during this first stay consisted in: • Port of the MyLib, MyWeb, and MyGen, software engineering tools on PVP machines, and extensions of MyGen to generate more C code. • Development of an upper interface with the MPP back-end data structures, including access to accurate memory dependency information, and to register live information on entry of the pipelinable loop exit blocks. These critical informations were incomplete or missing during the previous software pipelining experiments. At this time they had become accurate enough to expose the potential of software pipelining. • Redesign of the processor modelization, in order to make it simpler, and targeted to the DEC Alpha 21164 (EV-5). A processor modelization involves describing the processor architectural registers, the instruction set, the number of type of functional units in the processor, and the latencies associated with the different type of register dependencies. • Construction of the scheduling graph from the loop body instruction sequence, and debugging of the network flow techniques. At that point the software pipeliner was able to compute free schedules, that is, schedules of the loop body assuming no functional unit conflicts. Fortunately, free schedules can be directly used to build software pipelines, as discovered by Gasperoni. As a result, it became possible to begin testing the software pipeliner for correctness. • Reconstruction of the software pipeline from the loop body (free) schedule. Here I discovered a new way of reconstructing software pipelines, which simplified to a large extent this task. Indeed older techniques require that one identifies the counter and the trip count of 8

the loop in the code stream, and then decrements the trip count by the degree of overlap of the software pipeline, in order to prevent unsafe behavior1 . This technique allows one to reconstruct completely safe software pipelines, without identifying the loop counter, nor decrementing the loop trip count. • Implementation of a replacement for modulo expansion, called inductive relaxation, which applies whenever the register variable to expand is a simple induction. Simple inductions are additive induction variables with only one update in the loop body, and whose induction step is an immediate constant. When applicable, the inductive relaxation technique is better than modulo expansion, because the resulting schedule uses less register space, and require less kernel unrolling. For more details about these two techniques, please refer to [11].

3.2

August 19th – September 26th 1996

The purpose of this second stay at the Cray Park was to implement the main missing features of the software pipeliner: • Scheduling with resource (functional units) constraints. Here I reimplemented the Insertion Scheduling technique [9], which is simple and effective. By studying the software pipelines generated, I reworked the scheduling priority function in order to achieve a better balance between recurrence constraints, and resource constraints. • Modulo expansion. This technique is usually considered cumbersome to implement, however by reusing the algorithms of inductive relaxation, I ended up with a very simple implementation (also described in [11]). During testing of modulo expansion, I encountered a serious problem: I assumed that all the pseudo-registers were statically singleassigned, since the software pipeliner is run after the scalar renamer. This assumption does not hold on architectures with conditional move instructions, in particular on the Alpha AXP architecture. • Scalar renaming. To correct the bugs triggered by conditional moves in the loop bodies, I could either ignore these loops, or add a scalar renaming capability to the software pipeliner. I chose the latter, and underwent into deep modifications of the software pipeliner code. The 1

The exact same problem exists with bottom loading.

9

end result is that the software pipeliner can be called on loops which have not been scalar renamed by the back-end. However the scalar renaming phase of the back-end is still useful, as it computes as sideeffects the register live informations at basic block boundaries, which are used by the software pipeliner to remove register dependencies. • Various low-level changes, such as grabbing the loop info bits passed by the middle-end (PDGCS). As a result, many (expensive) calls to the memory dependency check functions of PDGCS are bypassed. Also, the “remainder” loops created by unrolling are no longer software pipelined. More calls to to memory dependency check functions could be bypassed if the middle-end was able to compute the ivdep (vector memory dependencies) bit precisely)2 . Currently, ivdep is only set when the loop is parallel (no memory loop-carried dependencies).

3.3

November 20th – December 6th 1996

The purpose of this third stay at the Cray Research Park was to integrate the software pipeliner sources and construction procedures into the 3.0 PL, to remove known bugs, and to collect performance data. • The software pipeliner sources are maintained in a MyWeb source format, which consists in two files: Design.web, and Implementation.web. The Design.web file describes the main data-structures of the pipeliner. When processed by MyWeb, Design.web generates a series of .gen files, which are MyGen source files. Processing the .gen files by MyGen produces Objects.c and Objects.h. The Implementation.web file is expanded by MyGen in a series of .c, .h, and .i files. The .i files are included both by Objects.h, and by the .c files generated from Implementation.web. After some discussion, it was decided that only the .c, .h, and .i files, and not the .web or .gen files, should be maintained in the 3.0 PL. Later, when the MyWeb and the MyGen tools will be replaced by new XML tools, then the real sources of the software pipeliner will be maintained in the 3.0 PL. • Detailed performance analysis of the software pipelined loops was conducted on the Livermore kernels. 2

On non-parallel vector loops, the time spent in memory dependencies check functions currently dominates the time taken by the rest of the software pipeliner.

10

3.4

January 13th – February 7th 1997

The purpose of this fourth stay was to address some of the performance problems exposed by the previous experiments: • To reduce of the very long compile times related to the use of the software pipeliner, I implemented a fix suggested by Tom MacDonald: reuse the pseudo-register names generated by renaming and modulo expansion from one loop to another. This resulted in significant decreases in compile times, because the main contributors to the backend processing time are the bit vector operations performed during register assignment. Indeed the length of the bits vectors is proportional to the number of global pseudo-registers, and the register assignment processing time is quadratic with the length of these vectors. • Another improvement to reduce compile-time consisted in performing most of the memory dependency computations inside the software pipeliner itself, instead of calling PDGCS. Using PDGCS dependency check routines is slow for non-parallel loops because these routines are called up to 8 times per pair of memory references, in order to find an accurate collision distance. Implementing dependency computations inside the software pipeliner was required for the implementation of memory grouping techniques. • I implemented a first version of memory grouping, which resulted in some performance improvements but also in increased register pressure. As a result however, some loops which used to software pipeline were no longer software pipelined. • I added code to compute the actual register pressure of a software pipeline. Whenever such pressure exceeds a given threshold, the original loop is left unchanged, as previous experiments demonstrated that the performance of software pipelines augmented whith back-end generated spill code yields poor performance. The main reason spill-code negates the benefits of software pipelining is that software pipelined loops are not rescheduled after register assignment. One reason is that the back-end structures do not record that some memory instructions move from one loop iteration to the other as a result of software pipeline construction. Rescheduling software pipelines without this knowledge would yield incorrect memory dependencies. 11

• The study of spill code problems led to the discovery by Jim Galarowicz and Tom MacDonald of a bug in the back-end: the parameter registers were considered live across the whole program, while in fact they are only live in the code blocks which contain subroutine calls. This explains why spill code appeared in software pipelines with pressures as low as 21 floating point registers. The bug was corrected by Jim Galarowicz. • Last, I experimented with NOOP padding. While investigating why the performance improvements were inconclusive, I discovered a bug in the back-end block alignment algorithm. This bug was submitted to Bill Fulton, who later corrected it.

3.5

April 2nd – April 16th 1997

The purpose of this fifth stay was to kill four software pipeliner SPRs: • Two SPRs turned out to result from memory corruption originating outside the software pipeliner. The others were traced to incorrect information supplied by PDGCS. Namely, the parallel bit was set for loops which contain loop-invariant read and writes to the same memory location. Greg Fisher made the case that the problem was too complicated to correct in PDGCS. So I fixed it in the software pipeliner, by clearing the parallel bit in cases loop-invariant memory accesses appear in the loop body. This “fix” is not general, and the Greg Fisher bugs may reappear in some contrived cases as long as the problem is not taken care of in PDGCS. • I worked with Dave Prigge, in charge of the production of the Cray programming environment documentation. I wrote the “Software Pipelining” section of the “Cray Fortran 90 Optimization Manual”, and I also edited for correctness some entries in the glossary of that document. The document sent to Dave Prigge is reproduced in appendix A. • I proposed and discussed the specification of the CONCURRENT directive with Greg Fisher and Tom MacDonald. The need for a new directive appeared while evaluating the performance of the benchv.f from CEA. This code, written by Gerard Meurant, implements eleven conjugate gradient algorithms to solve the Poisson problem. After this stay, the software pipeliner appeared to perform well, and was reported by Jeanmarie Conner to be stable enough to warrant its distribution with the standard Fortran 90 and C / C++ compilers for the Cray T3E. 12

These compilers were released in June 1997, as part of the Programming Environment 3.0. In particular, in depth performance studies performed by Jeff Brooks and Ed Anderson [3] from the benchmarking department, led to the conclusion that software pipelining, in combination with loop unrolling, was on average the best performing setting of the Fortran 90 compiler.

3.6

May 27th – June 17th 1997

The purpose of this sixth stay was performance tuning: • I retried NOOP padding, now that the alignment bug was fixed. Again, the results showed slight improvements in some loops, and noticeable degradations on other loops. This led to the discovery that in many loops, single LDAs (LoaD Address constant) instructions were expanded in a series of arithmetic instructions by the back-end after software pipelining. Such expansion degrades the performance for the same reason as spill code does. Moreover, in the case of NOOP padded code, LDA expansion almost certainly results in misaligned code. • In the optimization guide by Ed Anderson, Jeff Brooks and Tom Hewitt [3], it appeared that the software pipeliner did not improve upon unrolling + bottom loading for the simple loop: DO I = 1, 256 A(I) = B(I) + 2.5 * C(I) END DO The reason for the lower performance of the software pipeliner was that it assumed Dcache hit for loop-invariant LOADs, such as the load to the floating-point constant 2.5 above. In fact on the Alpha 21164 floating-point data may never reach the Dcache in case of heavy STORE traffic. The fix was to assume Scache latency (9 cycles) for all floating-point LOADs. • Subsequently I discovered a more general explanation why unrolling + bottom loading sometimes outperform software pipelining. In singleblock loops, loop-invariants computations such loads from the floatingpoint constants pool, and also LDAs, are flagged with a bit that enables the corresponding instructions to be hoisted out of the loop after

13

scheduling, provided the register pressure is not too high. The hoisting algorithm only works on single block loops, so it is not applicable to most software pipelines (because kernel unrolling required by modulo expansion turns the software pipeline into a multi-block loop). So hoistable instructions were not pulled out of software pipelines, whereas they were hoisted out of bottom loaded loops. Hoist candidates should be hoisted before a loop is submitted to the software pipeliner scheduler. Besides reducing the resource requirement of the loop, this would take care of the LDA expansion problem since all LDAs in single block loops are flagged as hoist candidates. Such hoisting was not implemented because time was running out. • Last, a meeting was organized with Tom MacDonald, Marge Verstegen (MPP codegen group leader), Tim Keller (Cray lawyer), and I, about the extension of my work for Cray Research. Indeed the current Cray – CEA agreement terminates in december 1997.

3.7

August 4th – August 13th 1997

The purpose of this sixth stay was performance tuning, and implementation of the CONCURRENT directive: • Only those LDA instructions are that flagged with a need expansion are expanded into a series of hardware LDAs and shift instructions. Since expansion occurs after register assigment, there is no way the pipeliner can control it or anticipate its effects on instruction slotting. So I decided to turn pipelining off for loops containing such LDAs. • I layed the ground work necessary for implementation of the CONCURRENT directive. However, the front-end and PDCGS were not ready at the time to supply the necessary information. So CONCURRENT ended up not being activated in the pipeliner.

3.8

October 27th – November 6th

The initial goals of this seventh stay was to address some performance and functional regressions of the pipeliner, implement LDA hoisting for LDAs that need expansion, and complete the support of the CONCURRENT directive: • Functional and performance regressions were fixed. Functional problems were again related to incorrect parallel flag supplied by PDGCS. Performance regressions were related to PDGCS pattern matching of 14

kernel loops being disabled, and to a bug in the LOAD grouping algorithm of the software pipeliner. • Thanks to Sean Palmer, who wrote the appropriate code template, I implemented compiler messages for the software pipeliner. Compiling with the option -Omsgs now emits detailed messages about what the software pipeliner does and why. • CONCURRENT is implemented and works, at last. What remains to be done is to find real codes where this directive makes a significant difference, so as to create performance test cases. One of the CEA codes qualify (k.f), but more work (in France) is needed to find other applicable codes. • Thanks to Neal Gaarder, NOOP padding now works. This also explains the performance anomaly I was experiencing with SAXPY: misaligning instructions used to increase performance. Latest SAXPY performances (MFLOPS): Iterations: No padding, unroll 8, pipeline Padding, unroll 8, pipeline Math library SAXPY

n=100 140 154 162

n=500 189 214 233

n=1000 203 231 244

These performances are quite satisfactory, given that the library SAXPY routine has been hand-coded in assembly language, and optimized by a search among many possible instruction schedules.

4

Conclusions

As of June 6th 1997, the software pipeliner is released with the default Fortran 90 and C / C++ Cray T3E compilers. As reported by the benchmarking department [3], the software pipeliner is a significant contributor to the Cray T3E performance, providing up to 30% speedups for some of the NAS kernels. Although impressive, these speedups are far from the 35% overall obtained by SGI’s software pipeliner on the SPEC FP codes [20]. The main reason for the difference is that SGI’s pipeliner results from more that 10 years of development, refinment, and tuning, as it was built on the foundations provided by the Cydrome loop optimizer and modulo scheduler. At SGI, John Ruttenberg spent four years on that software pipeliner. The main features missing from the Cray T3E software pipeliner, as compared to SGI’s software pipeliner, are: IF-conversion, so that loop with 15

conditionals can be software pipelined; incremental register spilling in the software pipelining process, so that any software pipelinable loop benefits from the technique, regardless of the register pressure generated; squashing of associative recurrences and reductions, so that dot products, and tridiagonal / penta-diagonal solvers yield high-performance software pipelines. Another contributing factor in favor of SGI’s software pipeliner performance is the ability of MIPS processors to compute square roots by hardware. In the Alpha 21164, there is no square root instruction, so loops with square roots call a subroutine; unfortunately, subroutine call disable software pipelining. There is hope, however, since recent versions of the T3E are based on the Alpha 21164A, which implements the latest specification of the Alpha architecture, including square root instructions. Let us recall that SGI’s software pipeliner was initially planned to be ported to the Cray T3E. These plans were later cancelled, due to the huge amount of work required by such a port. Thus providing additional features to the Cray T3E software pipeliner is required, in order to approach the efficiency of SGI’s software pipeliner. This will require further work, and a new agreement between Cray and CEA. Fortunately, the advanced theoretical foundations of the Cray T3E software pipeliner, as well as its robust architecture, will significantly reduce the complexity of adding new features.

Acknowledgements Several persons at Cray were actively involved in making the software pipelining experiment a success. Steven Reinhardt initiated the project in 1992. David Judd and Tom MacDonald were the main supporters, as they faithfully provided much needed moral support, administrative help, and technical advices, for four years. I especially enjoyed being a member of the MPP codegen group while it was managed by Marjorie Verstegen. Many thanks to all the past and present members of that group, especially Andrew Meltzer, Jim Galarowicz, Janet Eberhart, Mark Cruciani, and Sean Palmer. Thanks to Greg Fisher, in charge of MPP optimizations in PDGCS, and to Don Ferry, in charge of compiler performance testing. Special thanks to Jeanmarie Conner, for her thorough work in software pipeliner testing, and for her faith in the success of my work. On the CEA side, my deep thanks go to Gerard Meurant and to Patrick Lascaux, who were the prime supporters of the software pipelining experiment from the beginning. I also thank Daniel Verwaerde, Fran¸cois Robin, and Monique Patron, for their continued interest and support in my work at CEA. Thanks to the DGA, for allowing me to work full-time on the 16

software pipelining techniques during a 18-month stay at McGill University, Montreal. Without such an opportunity, I would never have been able to advance my research fast enough, while being successfull at delivering a commercial product.

References [1] A. Aiken, A. Nicolau “Optimal Loop Parallelization” Proceedings of the SIGPLAN’88 Symposium, 1988. [2] V. H. Allan, R. Jones, R. Lee, S. J. Allan “Software Pipelining” ACM Computing Surveys, Sep. 1995. [3] E. Anderson, J. Brooks, T. Hewitt “The Benchmarker’s Guide to Single-processor Optimization for T3E Systems”, Document Cray Research, May 1997. [4] R. L. Sites “Alpha AXP Architecture” Digital Technical Journal, vol. 4, no. 2, 1992. [5] J. C. Dehnert, R. A. Towle “Compiling for Cydra 5” Journal of Supercomputing, vol. 7, pp. 181–227, May 1993. [6] B. Dupont de Dinechin “An Introduction to Simplex Scheduling” PACT’94, Montreal, Aug. 1994. [7] B. Dupont de Dinechin “Simplex Scheduling: More than LifetimeSensitive Instruction Scheduling” PRISM research report 1994.22, available under anonymous ftp to ftp.prism.uvsq.fr, July 94. [8] B. Dupont de Dinechin “Fast Modulo Scheduling Under the Simplex Scheduling Framework” PRISM research report 1995.01, available under anonymous ftp to ftp.prism.uvsq.fr, Jan 95. [9] B. Dupont de Dinechin “Insertion Scheduling: An Alternative to List Scheduling for Modulo Schedulers”, Proceedings of 8th international workshop on Language and Compilers for Parallel Computers, LNCS #1033, Columbus, Ohio, Aug. 1995. [10] B. Dupont de Dinechin “Efficient Computation of Margins and of Minimum Cumulative Register Lifetime Dates”, Proceedings of 9th International Workshop on Language and Compilers for Parallel Computers, San Jose, California, Aug. 1996. 17

[11] B. Dupont de Dinechin “A Unified Software Pipeline Construction Scheme for Modulo Scheduled Loops”, 4th International Conference on Parallel Computing Technologies (PaCT’97), LNCS #1277, Yaroslavl, Russia, Sept. 1997. [12] R. A. Huff “Lifetime-Sensitive Modulo Scheduling” Proceedings of the SIGPLAN’93 Conference on Programming Language Design and Implementation, Albuquerque, June 1993. [13] M. Lam “A Systolic Array Optimizing Compiler” Ph. D. Thesis, Carnegie Mellon University, May 1987. [14] M. Lam “Software Pipelining: An Effective Scheduling Technique for VLIW Machines” Proceedings of the SIGPLAN’88 Conference on Programming Language Design and Implementation, 1988. [15] Q. Ning “Re: Question about the POPL paper”, private communication, Feb. 1996. [16] Q. Ning, G. R. Gao “A Novel Framework of Register Allocation for Software Pipelining” Proceedings of the ACM SIGPLAN’93 Symposium on Principles of Programming Languages, Jan. 1993. [17] B. R. Rau, C. D. Glaeser “Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing” IEEE / ACM 14th Annual Microprogramming Workshop, Oct. 1981. [18] B. R. Rau “Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops” IEEE / ACM 27th Annual Microprogramming Workshop, San Jose, California, Nov. 1994. [19] B. R. Rau, M. S. Schlansker, P. P. Tirumalai “Code Generation Schemas for Modulo Scheduled Loops” MICRO-25 / 25th Annual International Symposium on Microarchitecture, Portland, Dec. 1992. [20] J. Ruttenberg, G. R. Gao, A. Stoutchinin, W. Lichtenstein “Software Pipelining Showdown: Optimal vs. Heuristic Methods in a Production Compiler” Proceedings of the SIGPLAN’96 Conference on Programming Language Design and Implementation, Philadelphia, May 1996. [21] R. F. Touzeau “A Fortran Compiler for the FPS-164 Scientific Computer” ACM SIGPLAN 84 Symposium on Compiler Construction, 1984. 18

A

Software Pipelining (contribution to the “Cray T3E Fortran 90 Optimization Manual”)

Software pipelining is an advanced scheduling technique that overlaps the execution of successive loop iterations in order to achieve 100% utilization of one of the processor’s scheduled resources (such as floating-point pipelines, integer pipelines, and cache memory bandwidth). Software pipelining applies to innermost DO-loops and (DO) WHILE-loops, provided the loops contain no conditional code or subroutine calls. In particular, loops that call intrinsic functions not supported by the processor hardware (such as SQRT) must be vectorized first in order to benefit from software pipelining.

A.1

How Software Pipelining Works

The way software pipelining works to achieve 100% of utilization one of the processor scheduled resources is easily understood on a simple example: DO 1 I = 1, N Y(I) = X(I) END DO When this loop is compiled, it is translated in Alpha assembly instructions, which are then block-scheduled. The resulting code would look like the following loop, in which each line corresponds to one processor clock period: I = 1 DO T = X(I) Y(I) = T I = I+1 IF (I.GT.N) EXIT END DO Without software pipelining, the processor issues on an average less that one instruction per clock period, because the execution of successive loop iterations is sequential (as opposed to overlapped). Indeed the loop above takes 5 clock periods per iteration. (Assuming hits in the data cache, which has a latency of two clock periods.) However, by overlapping the execution of successive iterations and by creating a new loop body, software pipelining results in a average throughput 19

of one iteration started every 2 clock periods, as illustrated below. This initiation interval of 2 clock periods, along with the fact that every iteration l m now takes 6 clock periods to complete, imply that an overlap of 3 = 62 has been achieved. I = 1 T1 = X(I) I = I+1 T2 = X(I) DO Y(I-1) = T1; I = I+1 IF (I.GT.N) EXIT; T1 = X(I) Y(I-1) = T2; I = I+1 IF (I.GT.N) EXIT; END DO

T2 = X(I)

After the transformation, the new loop has multiple exits, uses twice as much register space, and reorders the update to the loop induction variable I = I+1 relative to its use in the store to Y(). But the throughput has increased by a factor of 2.5, and the two integer pipelines of the EV5 processor are kept 100% busy within the loop.

A.2

Candidate Loops for Software Pipelining

Theoretically, a software pipeliner is only guaranteed to achieve 100% utilization of one of the processor scheduled resources when they are no recurrences in the loop. (Updates of induction variables do not count as recurrences.) This means that parallel loops and vector loops should provide the best candidates for software pipelining. In practice, traditional instruction scheduling will already achieve a very good use of the processor scheduled resources whenever the loop body contains enough parallel instructions. Matching either of the following cases makes it likely that software pipelining will significantly increase the performance of a parallel or vector loop: • The loop body is not too large. (An approximate value for large is a loop that translates to more than 64 Alpha assembly instructions.) On large loop bodies with many parallel instructions, the software pipeliner will exhaust the available processor registers sooner than the default block-scheduler.

20

• The loop body is not memory bound. Since most memory events, such as merging in the miss address file or write buffer, are difficult to predict at compile-time, the software pipeliner cannot manage the memory bandwidth resource accurately. In particular, unrolling parallel or vector loops, either manually or by using compiler flags and directives, will result in loops whose body is large and contains many parallel instructions. Software pipelining such loops will yield moderate performance improvements, if any. In case a loop contains one or more recurrences, the maximum amount of overlap between loop iterations the software pipeliner can achieve decreases. However, software pipelining can still provide significant performance increases if one or more of the following conditions is satisfied: • The recurrence can be ignored, because it holds between iterations that are too distant for being overlapped. A typical example is: DO I = P+1, N X(I) = A(I) + X(I-P) END DO Here, if P is some variable with a value larger than about 3, the loop will be translated in a high-performance software pipeline, provided the compiler knows about the lower bound of the value P. Such information is provided by using a !DIR$ CONCURRENT directive. Please note that if P is a constant known at compile-time in the example above, the compiler will eliminate the load instructions to carry the value across iterations in registers. In that case the !DIR$ CONCURRENT directive is no longer required nor useful. • The loop body contains enough instructions that are not involved in a recurrence cycle. In that case, it is likely that the loop initiation interval will be constrained more by scheduled resources availability than by the recurrence itself. A typical example might look like: DO I = 2, N X(I) = A(I) + X(I-1) Y(I) = B(I+1)*R + B(I-1)*S Z(I) = Z(I) - Y(I)*C(I) ... END DO 21

! ! ! !

recurrence unrelated work unrelated work more work

As a consequence, it is worthwhile to fuse parallel and vector loops with recurrent loops before software pipelining, provided the resulting loop body does not grow too large and does not become memory-bound. • The recurrence cycle contains a string of operations that can be reassociated by the software pipeliner. A typical example is an unrolled dot product, which assumes the following form when it reaches the software pipeliner: S = 0.0 DO I = 1,N,3 T0 = X(I)*Y(I) S = S + T0 T1 = X(I+1)*Y(I+1) S = S + T1 T2 = X(I+2)*Y(I+2) S = S + T2 END DO By using expression reassociation, which is enabled for pipeline levels 2 and 3 (though deferred in early compiler releases), the loop will be transformed as follows before software pipelining: S = 0.0 DO I = 1,N,3 T0 = X(I)*Y(I) T1 = X(I+1)*Y(I+1) R1 = T0 + T1 T2 = X(I+2)*Y(I+2) R2 = R1 + T2 S = S + R2 END DO In the former loop, the recurrence constrains the software pipeline initiation interval to be greater than or equal to 12 clock periods: that is, 3 times the latency of the floating-point add operation. In the transformed loop, the lower bound on the initiation interval is reduced 4 clock periods. Since there are 6 LOAD operations to execute for every iteration, with up to 2 of them issued every clock periods, the maximum throughput the loop could achieve without the recurrence 22

would be an initiation interval of 3. The recurrence has become much less constraining. • The recurrence cycle contains a control dependency, which can be removed by enabling speculative execution. A typical example of control dependency recurrence can be found is a search loop: DO IF (A(I).LT.0.0) EXIT I = I+1 END DO Here the iterations cannot be overlapped because it is not safe to start loading A(I) for the next iteration before the test is complete. By enabling speculative execution (pipeline level 3), the user asserts that it is safe to start executing the next few iterations before the exit condition of the current iteration is resolved. Note: speculative execution is inherently unsafe, and does not bring any performance increase on regular DO loops, which have a known trip count. Making a loop safer for speculative execution usually involves extending the array dimensions, and padding the extra elements with some defined value.

A.3 A.3.1

Optimizing a Program with Software Pipelining Software Pipelining Optimization Levels

The -O pipelinen options specify various levels of software pipelining ranging from no software pipelining, at -O pipeline0, to software pipelining combined with expression reassociation and speculative execution of loads and arithmetic operations, at -O pipeline3. -O pipeline0 disables software pipelining. Default. -O pipeline1 enables standard software pipelining. Numeric results obtained at this level do not differ from results obtained at pipeline0. -O pipeline2 enables standard software pipelining, and expression reassociation. Numeric results here could differ from those obtained at pipeline1 because of expression reassociation. Implementation of expression reassociation is deferred, so -O pipeline2 is currently equivalent to -O pipeline1. 23

-O pipeline3 enables speculative software pipelining, and expression reassociation. Implementation of expression reassociation is deferred. Speculative software pipelining means that speculative execution of loads and arithmetic operations is enabled. Speculative execution could lead to floating-point exceptions and operand range errors. At the -O pipeline1, -O pipeline2, and -O pipeline3 levels, compile times will be longer, but execution times may be shorter. Speculative software pipelining is mainly useful for WHILE-loops. For DO-loops, the -O pipeline3 option does not yield shorter execution times than the -O pipeline2 option.

A.3.2

Use of Source-Level Directives

Currently, software pipelining levels cannot be specified with directives. Nevertheless, two source-level directives are important in many cases to enable the software pipeliner to deliver its full potential. These are: • !DIR$ CONCURRENT SAFE DISTANCE=n The specifications are: – Honor dependencies on scalar variables. – Assume a collision distance of n or more iterations for non-scalar dependencies. – In case SAFE DISTANCE=n is not present, ignore non-scalar dependencies. This directive enables the user to provide information about the array dependencies that are hard for the compiler to disambiguate. Such information is useful to free the software pipeliner from the constraints normally associated with memory recurrences. A typical use is: !DIR$ CONCURRENT SAFE_DISTANCE=3 DO I = P+1, N X(I) = A(I) + X(I-P) END DO Here, the compiler will assume the that dependency between X(I) and X(I-P) has a collision distance greater than or equal to 3.

24

• !DIR$ IVDEP This directive directs the compiler to ignore any vector dependency. A vector dependency is a either a self-dependency on one of the loop body statements, or a dependency that goes from a statement to an earlier statement in the loop body. Ignoring vector dependencies removes all recurrences, so the software pipeliner may achieve better overlap between iterations. The main problem with !DIR$ IVDEP is that all vector dependencies are ignored, including those that obviously hold. Another problem is that !DIR$ IVDEP also triggers loop vectorization, which may or may not be what the user needs. To summarize, the main difference between these two directives is that !DIR$ IVDEP ignores all vector dependencies, including scalar dependencies, while !DIR$ CONCURRENT assigns a collision distance to all non-scalar dependencies whether they are vector or nor. A.3.3

Interactions with Other Optimizations

Vectorization Besides acting on memory dependencies that may constrain the software pipeliner, vectorization has the effect of splitting a loop that contain calls to intrinsic functions. The resulting loops make perfect candidates for software pipelining, since they no longer contain subroutine calls, are smaller than the original loop, and do not contain memory recurrences. Loop Splitting The purpose of loop splitting, when invoked by itself, is to split a given loop into smaller loops that make more efficient use of the stream prefetching hardware. Obviously such loops cannot grow too large, because no more that six memory reference streams are supported by the hardware. As a result, loop splitting is quite likely to see its results improved when combined with software pipelining. Loop Unrolling When combined with software pipelining, the results of loop unrolling usually show improvements, especially when it is applied to take further advantages from the stream prefetching hardware. However, in cases where unrolling creates large loop bodies containing many independent instructions, the software pipeliner may actually run out of register space without knowing for sure it does (because software pipelining is not tightly coupled to the register allocator). In such cases, software pipelining combined with loop unrolling may decrease the performance when compared to loop unrolling alone. 25

B B.1

Final Performances (PE 3.0.2) Mail from Greg Fisher (November 17th 1997)

Thanks to Sean and Anh, the source of the regressions from 3.0.1 have been tracked down to the use of a different set of libraries in Frinnie’s directory. (Sean, could you please follow up with Frinnie and Neal and make sure they are aware of the subpar performance of these libraries? It may be that they are debug libraries, not intended for performance evaluation use). With that resolved, it appears that the 3.0.2 pipeliner provides 3-6% across-the-board GMR improvement over unrolling, with the exception of linpack at 26%. The biggest loser is X42 in APR with a 45% slowdown; we will probably need to look more closely at this. The 3.0.2 pipeliner also provides a 2-6% across-the-board GMR improvement over the 3.0.1 pipliner. Even X42 improved 10% over from 3.0.1. In other words, Benoit’s latest mods did not cause an across-the-board regression. With the functional regression tests returning positive as well, I have recommmended that Sean integrate Benoit’s mods from his latest visit. Thanks to the pipeliner feature team for making Benoit’s latest visit a positive one! Greg PS. I have attached the specifics of Anh’s latest pipeliner performance runs: one comparing unroll vs unroll,pipeline with CG 338, the other comparing CG 337 vs 338 with unroll,pipeline.

B.2

Pipeliner Performance Result

Date: From: Subject:

November 11, 1997 Anh Tran Pipeliner Performance Result

Note:

You can view this report anytime at: http://wwwsdiv.cray.com/PUBLIC/cats/pe/tip

Please see seven tables attached below Non-Dedicated t3e hubble on November 11, 1997 The following tests were using the same compiler FE 19 PDGCS 51 MPPCCG 338 using -Ounroll2 option

26

and

-Ounroll2,pipeline2 options

REGRESSION: Livkern Loop 10 with -Ounroll2,pipeline2 Loop 14 X42 (APR suite) ADM (Perfect suite) DYFESM (Perfect suite) IMPROVEMENT: Livkern Loop 2 with -Ounroll2,pipeline2 Loop 3 Loop 5 Loop 11 Loop 12 Loop 18 Linpack Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 Naskern CFFT2D Naskern CHOLSKY BDNA (Perfect suite) TRFD (Perfect suite)

27

is 10.4% slower than -Ounroll2 is 9.4% slower is 45.3% slower is 6.0% slower is 13.5% slower

is is is is is is is is is is is is is is is is is is

27.0% 10.4% 8.5% 14.7% 17.3% 5.1% 12.4% 15.6% 29.2% 23.3% 33.7% 36.8% 29.2% 17.0% 13.2% 21.8% 10.2% 29.5%

faster than -Ounroll2 faster faster faster faster faster faster faster faster faster faster faster faster faster faster faster faster faster

Livermore Kernels (LIVKERN) Vector Execution Times in Megaflops A B Ratio Kernel # t3e 195138_U2 t3e 195138_UP2 A / B --------- -------------------- -------------------- --------1 210.302 230.211 0.914 2 54.447 74.629 0.730 3 116.900 130.445 0.896 4 121.746 123.987 0.982 5 65.362 71.396 0.915 6 86.281 90.932 0.949 7 197.193 194.801 1.012 8 118.067 117.732 1.003 9 78.622 81.184 0.968 10 91.671 83.015 1.104 << 11 59.942 70.267 0.853 12 99.344 120.086 0.827 13 20.820 20.728 1.004 14 20.187 18.460 1.094 << 15 35.061 34.839 1.006 16 50.428 49.483 1.019 17 58.984 59.470 0.992 18 98.349 103.634 0.949 19 70.713 71.938 0.983 20 41.725 42.330 0.986 21 238.506 242.422 0.984 22 29.635 30.470 0.973 23 99.524 101.723 0.978 24 81.213 81.146 1.001 --------- -------------------- -------------------- --------Averages 89.376 93.555 0.963 Geometric Mean Ratio 0.960 Harmonic Mean Ratio 0.956 Table 1: Livermore Kernels (LIVKERN)–Vector Execution Times in Megaflops

28

Linpack Execution Times in Megaflops A B Ratio Test # t3e 195138_U2 t3e 195138_UP2 A / B --------- -------------------- -------------------- --------1 80.246 91.556 0.876 2 80.781 95.760 0.844 3 69.620 98.381 0.708 4 74.639 97.270 0.767 5 67.322 101.569 0.663 6 65.216 103.235 0.632 7 61.002 100.397 0.608 8 82.864 99.840 0.830 --------- -------------------- -------------------- --------Averages 72.711 98.501 0.741 Geometric Mean Ratio 0.735 Harmonic Mean Ratio 0.728 Table 2: Linpack–Execution Times in Megaflops NAS Kernel Benchmarks Execution Times in Megaflops A B Ratio Program t3e 195138_U2 t3e 195138_UP2 A / B --------- -------------------- -------------------- --------BTRIX 48.500 48.460 1.001 CFFT2D 17.450 20.100 0.868 CHOLSKY 11.910 15.230 0.782 EMIT 64.400 63.750 1.010 GMTRY 45.310 45.310 1.000 MXM 296.750 300.100 0.989 VPENTA 32.810 33.450 0.981 --------- -------------------- -------------------- --------Averages 73.876 75.200 0.947 Geometric Mean Ratio 0.944 Harmonic Mean Ratio 0.940 Table 3: NAS Kernel Benchmarks–Execution Times in Megaflops 29

APR Performance Suite Execution Times in CPU Seconds A B Ratio Test t3e 195138_U2 t3e 195138_UP2 B / A --------- -------------------- -------------------- --------APPBT 7.591 7.495 0.987 APPSP 7.713 7.610 0.987 BARO 3.042 3.084 1.014 EMBAR 1.582 1.564 0.989 FFT1 1.290 1.325 1.027 GRID 7.626 7.632 1.001 ORA 7.901 7.891 0.999 PDE1 2.460 2.497 1.015 SCALGAM 1.367 1.311 0.959 SHALLOW77 30.794 30.914 1.004 SHALLOW90 27.000 27.071 1.003 SWM256 19.302 19.854 1.029 TOMCATV 9.365 9.800 1.046 TRANS1 7.574 7.278 0.961 X42 6.769 9.838 1.453 --------- -------------------- -------------------- --------Averages 9.425 9.678 1.032 Geometric Mean Ratio 1.026 Harmonic Mean Ratio 1.022 Table 4: APR Performance Suite–Execution Times in CPU Seconds

30

APR Performance Suite Compilation Times in CPU Seconds A B Ratio Test t3e 195138_U2 t3e 195138_UP2 B / A --------- -------------------- -------------------- --------APPBT 59.877 75.078 1.254 APPSP 54.251 71.421 1.316 BARO 18.968 23.092 1.217 EMBAR 2.448 2.539 1.037 FFT1 5.425 5.770 1.064 GRID 4.370 5.111 1.170 ORA 2.838 2.789 0.983 PDE1 6.322 6.534 1.034 SCALGAM 7.411 7.918 1.068 SHALLOW77 8.933 11.547 1.293 SHALLOW90 8.059 9.624 1.194 SWM256 8.300 10.596 1.277 TOMCATV 4.744 6.777 1.429 TRANS1 2.309 2.367 1.025 X42 10.323 13.086 1.268 --------- -------------------- -------------------- --------Averages 13.639 16.950 1.175 Geometric Mean Ratio 1.168 Harmonic Mean Ratio 1.161 Table 5: APR Performance Suite–Compilation Times in CPU Seconds

31

Perfect Suite Execution Times in CPU Seconds A B Ratio Test t3e 195138_U2 t3e 195138_UP2 B / A --------- -------------------- -------------------- --------ADM 8.745 9.272 1.060 ARC2D 74.191 70.831 0.955 BDNA 26.228 23.546 0.898 DYFESM 3.571 4.053 1.135 FLO52 18.254 18.108 0.992 MDG 75.054 73.212 0.975 MG3D 330.099 330.200 1.000 OCEANP 68.450 66.660 0.974 QCD 5.514 5.433 0.985 SPEC77 42.282 43.124 1.020 TRACK 3.301 3.156 0.956 TRFD 8.463 5.967 0.705 --------- -------------------- -------------------- --------Averages 55.346 54.464 0.971 Geometric Mean Ratio 0.966 Harmonic Mean Ratio 0.960 Table 6: Perfect Suite–Execution Times in CPU Seconds

32

Perfect Suite Compilation Times in CPU Seconds A B Ratio Test t3e 195138_U2 t3e 195138_UP2 B / A --------- -------------------- -------------------- --------ADM 72.263 84.862 1.174 ARC2D 50.238 65.191 1.298 BDNA 58.702 68.443 1.166 DYFESM 34.896 45.821 1.313 FLO52 51.541 67.500 1.310 MDG 14.155 15.594 1.102 MG3D 42.850 50.779 1.185 OCEANP 38.572 58.213 1.509 QCD 22.581 37.303 1.652 SPEC77 64.289 79.277 1.233 TRACK 18.116 18.255 1.008 TRFD 15.107 18.997 1.257 --------- -------------------- -------------------- --------Averages 40.276 50.853 1.267 Geometric Mean Ratio 1.257 Harmonic Mean Ratio 1.247 Table 7: Perfect Suite–Compilation Times in CPU Seconds

33

Software Pipelining on the Cray MPP: History, Status ...

Dec 10, 1997 - 92 test suite (which means that the inner loops themselves are speeded up by a large ... ment before a software pipeliner increases performance. To summa- .... Meanwhile, the automatic parallelization project. I was in ... at last benefit from these tools while developing on Cray PVP machines. As I started ...

186KB Sizes 2 Downloads 151 Views

Recommend Documents

Software Pipelining on the Cray MPP: History, Status ...
Dec 10, 1997 - a new loop body where the LOADs fetch data for the next iteration, instead of .... I was getting from the MPP compilers looked like single big strongly connected ... and leader of the ACAPS laboratory at McGill University, offered me a

A tutorial on the mpp package
The tutorial uses package ggplot2 and plyr quite a bit, and assumes the reader ... The mpp package is designed for the analysis of multiple independent sets of ...

SCAN: a Heuristic for Near-Optimal Software Pipelining
taken from speech coding, audio and video applications. These loops .... SIGPLAN 1997 conference on Programming language design and imple- mentation ...

Language Constructs for Data Locality - CRAY Chapel - Cray Inc.
Apr 28, 2014 - Page 24 .... Myths About Scalable Programming Languages, Chamberlain. IEEE Technical Committee on Scalable Computing (TCSC) Blog,.

Language Constructs for Data Locality - CRAY Chapel - Cray Inc.
Apr 28, 2014 - statements that are not historical facts. These statements ... multicore desktops and laptops .... Chapel defines all forall loops in terms of leader-.

Impact of banana plantation on the socio-economic status and ...
3rd Agri-Business Economics Conference, Apo View. Hotel, Philippines ... only to very few individuals/company ... of biodiversity. •destroyed some infrastructure ...

The specific status of populations on Madagascar ...
Australian Museum, 6 College Street, Sydney, New South Wales 2010, Australia (LC) .... Royal Ontario Museum, Toronto; TM—Transvaal Museum,. Pretoria ...

Rorty, Aristotle on the Metaphysical Status of Pathe.pdf
Rorty, Aristotle on the Metaphysical Status of Pathe.pdf. Rorty, Aristotle on the Metaphysical Status of Pathe.pdf. Open. Extract. Open with. Sign In. Main menu.

MPP flyer and letter.pdf
Page 1 of 3. Page 1 of 3. Page 2 of 3. Page 2 of 3. Page 3 of 3. Page 3 of 3. MPP flyer and letter.pdf. MPP flyer and letter.pdf. Open. Extract. Open with. Sign In.

2017-12-15-sustentacion-MPP-nestor-perez.pdf
Page 1 of 1. Page 1 of 1. 2017-12-15-sustentacion-MPP-nestor-perez.pdf. 2017-12-15-sustentacion-MPP-nestor-perez.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying 2017-12-15-sustentacion-MPP-nestor-perez.pdf.

SCWR Status
Feb 28, 2013 - Evolutionary development from current water cooled reactors. • Cooled .... Technology development ongoing with a focus on GIF objectives of ...

Effect of homeownership on Unemployment Status and ...
analysis of unemployment duration: we estimate the probabilities of being a homeowner .... They find the unconditional and conditional probability of un-.

PDF The Mythical Man-Month: Essays on Software ...
The Clean Coder: A Code of Conduct for Professional Programmers (Robert C. Martin) · Working Effectively with Legacy Code · Soft Skills: The software developer's life manual · Release It!: Design and Deploy Production-Ready Software (Pragmatic Progra