Non-Intrusive Dynamic Application Profiling for ...

Viewer
Transcript

Non-Intrusive Dynamic Application Profiling for Multitasked Applications Karthik Shankar, Roman Lysecky Department of Electrical and Computer Engineering University of Arizona, Tucson, AZ {karthik1, rlysecky}@ece.arizona.edu

ABSTRACT Application profiling – the process of monitoring an application to determine the frequency of execution within specific regions – is an essential step within the design process for many software and hardware systems. Profiling is often a critical step within hardware/software partitioning utilized to determine the critical kernels of an application. In this paper, we present a non-intrusive dynamic application profiler (DAProf) capable of profiling an executing application by monitoring the application’s short backwards branches, function calls, function returns, as well as efficiently detecting context switches to provide accurate characterization of the frequently executed loops within multitasked applications. DAProf can accurately profile multiple tasks within a software application with 98.5% accuracy using as little as 10% additional area compared to an ARM9 processor.

Categories and Subject Descriptors C.4 [Computer Systems Organization] Performance of Systems – Measurement Techniques.

General Terms Design, Performance.

Keywords Profiling, multitasking, real-time embedded systems, dynamic optimizations, dynamic hardware/software partitioning.

1. INTRODUCTION Application profiling – the process of monitoring an application to determine the frequency of execution within specific regions – is an essential step within the design process for many software and hardware systems. Profiling has long been utilized to identify the most frequently executed regions of a software application such that software developers can focus their efforts on optimizing those regions. While static – or offline – profiling is feasible for some applications, dynamic profiling is essential for dynamic optimization techniques. Optimization techniques that leverage dynamic profiling include dynamic binary translation and optimization [3][7], creating multiple specialized software or hardware implementations that can be dynamically selected at runtime [13], storing frequently executed code regions within a low-power loop cache [9][14], and just-in-time compilation. Dynamic profiling is also an essential task in warp processing [15] – a dynamic hardware/software partitioning approach in which critical kernels within an executing software application are

re-implemented as custom hardware circuits in an on-chip FPGA. As with many dynamic optimization approaches, warp processing relies on accurate, dynamic profiling to determine which software kernels are potential candidates for hardware implementation. Most previous profiling approaches – intended for desktop computing – introduce runtime overhead, either inserting additional code into the application or interrupting the processor at particular intervals to sample the processor’s registers. A common software-based profiling approach involves instrumenting the application by adding code to count frequencies of the desired code regions [10][12]. For example, if we wish to count the frequency of execution of a subroutine, we can add code to the beginning of the subroutine that increments an associated counter variable. To reduce runtime overhead, other profiling approaches use statistical sampling techniques [1][6][19]. Such methods either interrupt the microprocessor at certain intervals or create an additional software task for profiling and then read the program counter and other internal registers to statistically determine execution behavior. For embedded systems, both instrumentation and statistical profiling approaches potentially change the behavior of the application and incur significant runtime overhead. In the case of real-time systems, which are usually designed with very tight timing constraints, the slightest run time overhead can lead to missed deadlines and potential system failure. For further details and discussion of existing software and hardware based profiling techniques, we refer the interested reader to [16][18]. Due to the limitations of these software profiling methodologies, designers have often resorted to using hardware based profiling approaches. Of notable interest, is the frequent loop detection profiler that non-intrusively monitors the instruction addresses seen on the memory bus and profiles loop iterations by monitoring short backwards branches [8]. A short backwards branch is any branch instruction whose target address has a short negative offset and are typically used to branch at the end of the loop iteration. Whenever a short backwards branch occurs, the frequent loop detection profiler updates a small cache of short backwards branch frequencies and maintains a list of relative branch frequencies. For hardware/software partitioning approaches utilizing profiling to guide the partitioning process, the frequent loop detection profiler can provide a relative ranking of loops to guide the order in which loops are analyzed for hardware implementation. However, without further simulation or application analysis, such limited profiling information may lead to suboptimal hardware/software partitioning results, as performance improvements cannot be accurately estimated using only relative branch execution frequencies. In [16], we presented an efficient, non-intrusive dynamic application profiler (DAProf) capable of profiling an executing application by monitoring the application’s short backwards branches and providing detailed profiling statistics for characterizing loop execution behavior, including loop executions, average iterations per execution, and percentage of execution

SBB FUNC

μP

DAPROF

I$

RET IADDR

D$

Figure 1. Overview of Dynamic Application Profiler (DAProf) integration with microprocessor system utilizing signals for detecting short backwards branches (sbb), function calls (func), function returns (ret). time. However, DAProf was originally designed to profile single task applications. As many embedded systems are multitasked, consisting of several tasks executing within a lightweight kernel or operating system, the original DAProf design incorrectly interprets context switches as nonexistent loop execution behavior, thus leading to inaccurate profiling information. In this paper, we present a non-intrusive, dynamic application profiler with support for profiling multitasked applications by dynamically detecting context switches – in addition to monitoring short backwards branches, function calls, and function returns – to provide detailed and accurate loop execution statistics for embedded applications. The extended DAProf design provides an efficient method for non-intrusively detecting context switches to provide accurate profiling results for multitasked applications. In addition, DAProf now provides designers with the option to selectively profile specific tasks, functions, library code, or systems calls while filtering out those elements that are not of immediate interest, thereby providing greater profiling accuracy and detail for the profiled elements.

2. MULTITASKED DYNAMIC APPLICATION PROFILER

IADDR

SBB FUNC

RET CS IADDR

RET CS IADDR IOFFSET

PROFILER CONTROLLER

RET

SBB FUNC

PROFILER FIFO

SBB FUNC

PROFILER TASK FILTER

Figure 1 presents an overview of DAProf for multitasked applications. DAProf non-intrusively monitors a microprocessor’s instruction bus to detect short backwards branches, function calls, function returns, and context switches. DAProf considers a short backwards branch as any branch instruction whose target address is a negative offset of less than 1024, which corresponds to small loops containing less than 256 instructions. While the DAProf could decode the instructions on the instruction bus, we currently assume the microprocessor provides a one-bit output, sbb, indicating a short backwards branch has been executed, a one-bit

output func signal indicating a function call is being executed, and a one-bit output ret indicating a function has returned. Such support would require minor modification to a microprocessor’s decoding logic. To detect function calls and function returns, DAProf requires the address from which a function was called and the address to which the function returned. Figure 2 presents the internal architecture of the extended DAProf design with support for multitasked applications. DAProf consists of a profiler task filter for specifying the tasks to filter and detecting context switches, a profiler FIFO to synchronize between the microprocessor and profile cache, a profile cache that stores all relevant profile statistics for those loops being profiled, and a profiler controller that analyzes the short backwards branches, function calls, function returns, and context switches to update the profiling statistics within the profile cache.

2.1 Profiler Task Filter The profiler task filter is primarily utilized to non-intrusively detect context switches between the tasks being profiled. It is implemented as a programmable array storing the starting and ending address of each task, or any region of code, to be profiled. The profiler task filter provides great flexibility in profiling a multitasked application by allowing a designer the option to selectively profile specific tasks, functions, library code, systems calls, etc., while filtering out those elements that are not of immediate interest. To detect context switches, the task filter monitors the processor’s instruction bus to determine which task is currently executing or that no profiled tasks are executing. Whenever a change in context is detected – either from one profiled task to another or between a profiled task and a nonprofiled task – the profiler task filter will assert the cs output along with outputting the current address at which the change was detected. The profiler task filter also filters the sbb, func, ret, and iAddr from the processor for all non-profiled regions of code. In other words, if the currently executing instruction does not fall within a profiled task, or code region, then the sbb, func, and ret inputs from the processor will be ignored.

2.2 Profiler FIFO The profiler FIFO monitors the sbb, func, ret, and cs signals from the profiler task filter. Whenever a profile event is detected, the profiler FIFO stores the Tag and Offset for short backwards branch, the originating address of function calls, the return address of function returns, or the instruction address immediately after a context switch, which are provided by the profiler task filter. The profiler FIFO includes a small FIFO that stores the

PROFILE CACHE FOUND

TAG

(30)

OFFSET

(8)

CURRITER AVGITER

(14)

(17)

EXECS

(16)

INLOOP INFUNC

(1)

(1)

IN C S

(1)

FOUND INDEX REPLACEINDEX

DYNAMIC APPLICATION PROFILER (DAPROF) Figure 2. Architectural overview of Dynamic Application Profiler (DAProf) supporting multitasked applications consisting of a Profiler Task Filter, Profiler FIFO, Profiler Controller, and Profile Cache (bit widths for profile cache entries shown in parentheses).

address of interest, short backwards branch offset when needed, and an encoding indicating if the entry is a short backwards branch, function call, function return, or context switch. In addition, the profiler FIFO is used to synchronize between the operating frequency of the microprocessor and profiler task filter and the operating frequency of the internal DAProf design because the microprocessor and profiler task filter may operate at a higher clock frequency. As short backwards branches do not occur on every clock cycle, the internal DAProf profiler design need not operate at the same frequency of the microprocessor. A typical loop of interest within a software application consists of at least two to three instructions in addition to the short backwards branch at the end of the loop. We experimentally determined that the smallest profiled loop within the applications considered consists of 4 instructions. Hence, it should be sufficient to assume that short backwards branches on average occur no more than once every four instructions, implying the internal DAProf design can efficiently operate at one fourth the operating frequency of the microprocessor. However, the profiler FIFO should be large enough to accommodate bursts of short backwards branch activity that may occur periodically as an application executes. In addition, the profiler needs to monitor function call, function return, and context switch events. The combined frequency of all profiling events is not expected to increase the maximum expected frequency of such events. This is evident in the fact that both function calls and function returns require at least several instructions for maintaining the application’s execution stack that limits their overall frequency. Similarly, a context switch between tasks requires dozens of instructions to store and restore tasks’ contexts.

2.3 Profile Cache The profile cache is small memory that maintains the current profiling results and intermediate information needed for loop identification, iteration and execution profiling statistics, loop execution monitoring, and determining which profile cache entry should be replaced when new loops are executed. We currently consider a profile cache with 32 entries, which is sufficiently large to profile the embedded software applications considered within this paper – although a larger profile cache may be needed for significantly larger applications.

2.3.1 Loop Identification Profiled loops are identified within the profile cache by the address of the loop’s short backwards branch, which serves as the Tag entry for the cache, and by the loop’s Offset determined by the profiler FIFO. Considering a 32-bit ARM processor and a byte addressable memory, the lower two bits for all instruction addresses will be identical. Hence, the profile cache’s Tag entry is a 30-bit entry that stores the most significant 30 bits of a loop’s short backwards branch address. The profile cache’s Offset entry is an 8-bit entry that corresponds to the size of the loop in number of instructions. As described earlier, both the Tag and Offset for a loop are calculated by the Profiler FIFO and provide the mechanism for identifying loop bounds.

2.3.2 Iteration/Execution Statistics The main profiling information stored within the profile cache includes loop executions, average iterations per loop execution, and loop iterations for the current execution. Loop Executions provide the number of times a loop has been executed throughout the application execution. As the DAProf is intended to monitor an application over extended execution

periods, regardless of the number of bits used to represent loop executions, the number of loop executions will eventually become saturated. DAProf utilizes a 16-bit entry for loop executions that allows 65,536 loop executions to be profiled without saturations. As discussed in the following section, whenever a loop’s executions become saturated, DAProf’s profiler controller will adjust the loop executions for all entries, thereby maintaining a list of relative executions and ensuring all entries do not eventually become saturated The Current Iterations provides the number of times a loop has iterated for the current loop execution and is stored within the profile cache as a 14-bit entry. As a 14-bit entry, DAProf can accurately profile loops with a maximum of 16384 iterations per execution, which is very well suited for most applications. The Average Iterations stores the average number of times a loop iterates per execution. As many loops do not iterate a fixed number of times per execution, the average iterations cannot be accurately stored as an integer value. Instead, the profile cache stores the average iterations as a 17-bit fixed point number using 3 bits for the fractional part.

2.3.3 Loop/Function/Context Switch Monitoring The profile cache contains a 1-bit InLoop flag utilized to indicate a loop is currently being executed. The InLoop flag is essential in determining if the execution of a short backwards branch corresponds to a new execution or an additional iteration for the current execution. A 1-bit InFunc flag is utilized to indicate a loop has called a function that is currently being executed. In addition, a 1-bit InCS flag is utilized to indicate a loop’s execution has been interrupted due to a context switch. The InFunc and InCS flags are essential in ensuring that the InLoop flag for a loop that has called a function or whose execution has been interrupted due to a context switch is not incorrectly reset.

2.3.4 Associativity The Associativity of the profile cache potentially provides tradeoff between cache size/performance and profiling accuracy. With a fully associative profile cache, the replacement policy must compare all entries within the cache to determine the entry with the smallest total iterations, thereby requiring large hardware resources and reducing the overall performance of the DAProf. However, a fully associative profile cache may provide better profiling accuracy as the replacement policy, which is discussed in more detail in the next section, can select from amongst all cache entries. Decreasing the associativity of the profile cache provides increased performance and smaller area requirements by reducing the number of entries the replacement policy must consider, but at the potential cost of reduced accuracy.

2.3.5 Freshness & Replacement Policy The replacement policy incorporated within the profile cache uses total loop iterations to determine which entry will be replaced when a new loop is executed, where the entry with the lowest total iterations will be replaced. The total loop iterations are calculated as the product of the average iterations and executions. While this policy performs relatively well on its own, newly executed loops may not execute or iterate quickly enough to avoid being immediately replaced. To solve this problem, the profile cache includes a 3-bit loop Freshness value that represents how recently a loop has been executed or iterated, where a larger freshness indicates the loop has been more recently executed. The freshness value is utilized within the replacement policy to only consider loops for

replacement if the loops are not fresh – a loop that is not fresh has a freshness value of zero. A 3-bit freshness entry allows up to seven loops per task to be considered fresh and allows newly executed loops to be profiled for an extended duration before their profile cache entry will be considered during replacement. However, it is necessary to consider the relation between the profile cache’s associativity, freshness, and the number of tasks within the application being profiled. For a profile cache with a small associativity, a large maximum freshness, or for an application with a large number of tasks, all entries within the same cache set may be considered fresh and should not be selected for replacement. To avoid this potential problem, the maximum Freshness value needs to be determined based on the profile cache associativity and the number of tasks being profiled, as calculated by the following equation:

2.4 Profiler Controller Figure 3 provides pseudocode for DAProf’s profiler controller. The profiler controller interfaces with the profiler FIFO and updates the profiling results for the current loops within the profile cache. The profiler controller either receives the sbb signal along with the calculated branch offset, iOffset, the func signal, ret signal, or cs signal from the profiler FIFO in addition to a found, foundIndex, and replaceIndex signals from the profile cache. The found and foundIndex signals indicate if the current short backwards branch is found within the profile cache and at what location. The replaceIndex provides the index for the loop entry that will be replaced if the current short backwards branch is not found. In all cases the address of the instruction of interest is provided by the iAddr signal from the profiler FIFO. Whenever a short backwards branch is detected, the profiler controller will determine if the loop is found within the cache. If the loop is found and the loop is currently executing – as indicated by the loop’s InLoop flag – the short backwards branch execution indicates a loop iteration has been detected and the loop’s current iterations are incremented. Otherwise, if the loop is not currently being executed, a new loop execution is detected. For new loop executions, the profile controller increments the loop’s executions, sets the InLoop flag, sets the current iterations to one, decrements the freshness value for all other loops within the current task, and sets the freshness of the current loop to the maximum freshness. Finally, if the profiler controller detects that the loop’s executions have become saturated, the executions for all loops will be divided by two. In addition to ensuring that the executions for all loops never become saturated, this approach provide a mechanism for monitoring the dynamic nature of an application in which loops that were once considered important may no longer be executed as time progresses. Initially, a previously executed loop’s high total iterations may ensure the loop is not replaced during profiling. However, after several saturations have been encountered, the reported total iterations will be decreased relative to other loops and can be replaced if the loop is no longer executed. If a loop’s backwards branch is not found within the profile cache, the profiler controller will replace the entry within the cache as indicated by replaceIndex. The profiler controller initializes this profile cache entry by setting the Tag and Offset to those of the newly profiled loop’s, setting the executions to one, setting the InLoop flag, setting the current iterations to one,

DAProf (iAddr, iOffset, sbb, func, ret, cs, found, foundIndex, replaceIndex): 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49.

if ( cs ) { for all i, InCS[i] = InLoop[i] for all i, if ( InLoop[i] && (iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i]) InCS[i] = 0 } if ( func ) { for all i, InFunc[i] = InLoop[i] } else if ( ret ) for all i, if ( (InFunc[i] || InCS[i] ) && (iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i]) ) { InFunc[i] = 0 InCS[i] = 0 } } else if ( sbb ) { if ( found ) { if ( InLoop[foundIndex] ) CurrIter[foundIndex] = CurrIter[foundIndex] + 1 else { for all i, if ( !InCS[i] ) Fresh[i] = Fresh[i] – 1 Execs[foundIndex] = Execs[foundIndex] + 1 CurrIter[foundIndex] =1 InLoop[foundIndex] =1 Fresh[foundIndex] = MaxFresh if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 } } else { for all i, if ( !InCS[i] ) Fresh[i] = Fresh[i] – 1 Tag[replaceIndex] = iAddr Offset[replaceIndex] = iOffset CurrIter[replaceIndex] =1 AvgIter[replaceIndex] =0 Execs[replaceIndex] =1 InLoop[replaceIndex] =1 Fresh[replaceIndex] = MaxFresh InFunc[replaceIndex] =0 InCS[replaceIndex] =0 } } for all i, if ( InLoop[i] && !InFunc[i] && !InCS[i] && !(iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i]) ) { InLoop[i] = 0 AvgIter[i] = (AvgIter[i]*7 + CurrIter[i])/8 }

Figure 3. Pseudocode for DAProf profiler controller with support for multitasked applications. decrementing the freshness value for all other loops within the current task, and setting the freshness of the newly executed loop to the maximum freshness. Whenever a context switch is detected, the profiler controller first sets the InCS flags for all currently executing loops, i.e., those loops whose InLoop flag is still set. This can be efficiently implemented simply by copying all InLoop entries to the corresponding InCS entries within the profile cache. The profiler controller then determines which loops, if any, will resume execution as the result of the context switch. Thus, if address after a context switch falls within the bounds of any loops whose InLoop flag is set, the InCS flag for those loops is reset.

,

3. EXPERIMENTAL RESULTS We consider three alternative profiler implementations including a fully associative, 16-way associative, and 8-way associative DAProf designs. DAProf was implemented in Verilog and synthesized using Synopsys Design Compiler targeting a UMC 0.18 m technology. For a fully associative implementation, DAProf requires 132,714 gates (2.2 mm2) and can execute at a maximum operating frequency of 460 MHz. The area required for the fully associative DAProf design is approximately 19% of the area of an ARM9 with 32KB cache implemented within a 0.18 m technology. The 16-way associative DAProf design requires 93,194 gates (1.5 mm2) with a maximum operating frequency of 529 MHz. Finally, the 8-way associative DAProf design requires only 71,121 gates (1.2 mm2) with a maximum operating frequency of 600 MHz. The 8-way associative DAProf design requires only 10% of the area of an ARM9 processor. We note that DAProf’s profile cache is currently implemented using registers. By re-implementing the profile cache using SRAM, we anticipate that a more area efficient design can be created, although we leave this as future work. To analyze the accuracy of the DAProf design, we compare the profiling results of DAProf with that of an accurate simulation/instrumentation based profiling method. We created a set of 12 multitasked applications consisting of two to five tasks, where each task corresponds to an application from the MiBench benchmark suite [11]. Table 1 presents an overview of the various multitasked applications considered. All applications were executed within the RTEMS operating system [17].

RAWDAUDIO

RAWCAUDIO

QSORT

STRINGSEARCH

BIT COUNT

DIJKSTRA

TIFF2RGBA

TIFF2BW

FFT

SUSAN

MT2.1 MT2.2 MT2.3 MT2.4 MT2.5 MT2.6 MT2.7 MT3.1 MT3.2 MT3.3 MT3.4 MT4.1 MT4.2 MT5.1

DJPEG

Table 1. Overview of multitasked applications composed of applications from the MiBench benchmark suite.

CJPEG

Whenever a function call is detected, the profiler controller sets the InFunc flags for all currently executing loops, i.e., those loops whose InLoop flag is still set. This can be efficiently implemented simply by copying all InLoop entries to the corresponding InFunc entries within the profile cache. Whenever a function return is detected, the profiler controller resets the InFunc and InCS flags for those loops that contain the address of the function return’s destination, i.e., the loops from which the corresponding function was called. We note that if a function call is executed from the innermost loop of a nested loop, the InFunc flag for all loops within the nested loop structure will be set. On return from that function call, the profiler controller must reset the InFunc flags for all loops within the nested loops. For all profiling events, the profiler controller checks all entries of the profile cache whose InLoop flag is set to determine if the application is still executing within those loops. The profiler controller also utilizes the InFunc and InCS flags to ensure that the InLoop flag is not incorrectly reset during a function call or context switch. For all detected short backwards branches, function calls, function returns, and context switches, the profiler controller checks all entries of the profile cache whose InLoop flag is set and whose InFunc and InCS flags are not set to determine if the application is still executing within those loops. If a loop is no longer being executed, the profile controller resets the InLoop flag and updates the loop’s average iterations. DAProf’s profiler controller utilizes a weighted average in which the previous average iterations accounts for 7/8th and the current iterations account for 1/8th of the calculated average iterations, as provided by the following equation:

We analyzed the profiling accuracy in terms of percent error in reported average iterations and executions for a fully associative, 16-way associative, and 8-way associative DAProf designs for the most frequently executed loops within each multitasked application. For each application, the analysis includes the top ten loops overall within the application, in addition to the top two loops within each task – consisting of 10 to 16 loops combined depending on the application. Figure 4 presents the percentage error in average iterations of a fully-associative, 16-way associative and 8-way associative DAProf design for the multitasked applications. The percent error in average iterations is calculated as the sum of differences between the reported and actual average iterations divided by the sum of the actual average iterations as follows:

On average, DAProf achieves excellent profiling results with an error in reported average iterations of 1.3%, 1.3%, and 1.5% for a fully associative, 16-way associative, and 8-way associative implementations, respectively. In the best case, DAProf incurs an error of less than one thousands of one percent for the application MT3.3, for all the three implementations. DAProf produced a maximum error of 6% for the application MT3.4 for the 8-way associative design. This error is primarily due to execution behavior of one loop with the application that iterates a single time per execution. In these instances, the backwards branch at the end of the loop is never executed, and our profiling approach is unable to detect that the loop execution, thus leading to reduce profiling accuracy for loops with similar behavior. Figure 5 presents the percentage error in loops executions of a fully-associative, 16-way associative and 8-way associative DAProf design for the multitasked applications. Because of unavoidable execution saturations, the loop executions reported

6%

Figure 4. Percentage error in average iterations of fullyassociative, 16-way associative and 8-way associative DAProf for the multitasked applications presented in Table 1. by DAProf may not directly correspond to the actual total number of loop executions. Thus, the percent error in reported loop executions is calculated as follows:

Figure 5. Percentage error in loop executions of fullyassociative, 16-way associative and 8-way associative DAProf for the multitasked applications presented in Table 1. [3] [4] [5] [6]

, in which the number of execution for each loop is calculated as that ratio of the reported loop executions of each loop to the total loop executions of the top loops. On average, DAProf achieves excellent profiling accuracy for reported loop executions, resulting in an average error of only 0.5% for all associativities. For the applications MT2.1 and MT3.2, the reported loop executions incur an error of 2.6% and 2.2%, respectively. This error can again be attributed to the execution behavior of a few loops that only iterate a single time per execution. For all other applications, a maximum error of 0.5% is achieved.

4. CONCLUSIONS The dynamic application profiler (DAProf) provides an efficient, non-intrusive profiler capable of accurately profiling multitasked applications executing within an operating system. While a fully associative or 16-way associative design provide slightly improved accuracy, an 8-way associative DAProf can profile an application executing on a 600 MHz processor with an average profiling accuracy of 98.5% and 99.5% for average iterations and loop executions, respectively, while requiring only 10% area overhead. Thus, an 8-way DAProf provides an excellent balance between performance, profiling accuracy, and area.

5. ACKNOWLEDGMENTS

[7] [8] [9] [10] [11] [12] [13] [14]

[15] [16]

This research was supported in part by Toyota InfoTechnology Center and the National Science Foundation (CNS-0844565).

[17]

6. REFERENCES

[18]

[1]

[2]

Anderson, J., L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S.-T. Leung, R. Sites, M. Vandevoorde, C. Waldspurger, W. Weihl. Continuous Profiling: Where Have All the Cycles Gone? ACM Trans. on Computer Systems, Vol. 15, No. 4, 1997. Arnold, M and B. G. Ryder. A Framework for Reducing the Cost of Instrumented Code. Conf. on Programming Language Design and Implementation (PLDI), 2001.

[19]

Bala, V., E. Duesterwald, S. Banerjia. Dynamo: A Ttransparent Runtime Optimization System, Conf. on Programming Language Design and Implementation (PLDI), 2000. Ball T. and J. Larus. Efficient Path Profiling. Intl. Symp. on Microarchitecture (MICRO), 1996. Burger, D., T.M. Austin. The SimpleScalar Tool Set, Version 2.0. University of Wisconsin-Madison Computer Sciences Department Technical Report #1342, June 1997. Dean, J., J. Hicks, C. Waldspurger, G. Chrysos. ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors. Intl. Symp. on Microarchitecture (MICRO), 1997. Ebcioglu, K., E. Altman, M. Gschwind, S. Sathaye. Dynamic Binary Translation and Optimization. IEEE Trans. on Computers, Vol. 50, 2001. Gordon-Ross, A., F. Vahid. Frequent Loop Detection using efficient Non-Intrusive On-Chip Hardware. IEEE Trans. on Computers (TC), Vol. 54, 2005. Gordon-Ross, A., S. Cotterell, F. Vahid. Exploiting Fixed Programs in Embedded Systems: A Loop Cache Example. IEEE Computer Architecture Letters, January 2002. Graham, S.L., P.B. Kessler, M.K. McKusick. gprof: a Call Graph Execution Profiler. Symp. on Compiler Construction, 1982. Guthaus, M., J. Ringenberg, D. Ernst, T. Austin, T. Mudge, R. Brown. MiBench: A Free, Commercially Representative Embedded Benchmark Suite. Workshop on Workload Characterization, 2001. Hazelwood, K., A. Klauser. A Dynamic Binary Instrumentation Engine for the ARM Architecture. Conf. on Compilers, Architectures, and Synthesis for Embedded Systems (CASES), 2006. Lakshminarayana, G., et al. Common-Case Computation: A HighLevel Technique for Power and Performance Optimization. Design Automation Conference (DAC), 1999. Lee, L.H., Moyer, B., Arends, J. Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops. Intl. Symp. on Low Power Electronics and Design (ISLPED), 1999. Lysecky, R., G. Stitt, F. Vahid. Warp Processors. ACM Trans. on Design Automation of Electronic Systems (TODAES), Vol. 11, No. 3, 2006. Nair, A., R. Lysecky. Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization. Conf. on Compilers, Architectures, and Synthesis for Embedded Systems (CASES), 2008. Real-Time Operating System for Multiprocessor Systems (RTEMS), http://www.rtems.org, 2008. Tony J, Khalid M. Profiling Tools for FPGA-Based Embedded Systems: Survey and Quantitative Comparison. Journal of Computers, Vol. 3, No. 6, June 2008. Zhang, X., Z. Wang, N, Gloy, J. Chen, M. Smith. System Support for automatic Profiling and Optimization. Intl. Symp. on Operating Systems Principles, 1997.

Non-Intrusive Dynamic Application Profiling for ...

relies on accurate, dynamic profiling to determine which software kernels are potential ..... Automation Conference (DAC), 1999. [14] Lee, L.H., Moyer, B., Arends ...

Download PDF

822KB Sizes 2 Downloads 110 Views

Report

Non-Intrusive Dynamic Application Profiling for ...

Recommend Documents