Efficient Hardware-Based Non-Intrusive Dynamic Application Profiling

Viewer
Transcript

Efficient Hardware-Based Non-Intrusive Dynamic Application Profiling AJAY NAIR, KARTHIK SHANKAR, AND ROMAN LYSECKY University of Arizona ________________________________________________________________________ Application profiling – the process of monitoring an application to determine the frequency of execution within specific regions – is an essential step within the design process for many software and hardware systems. Profiling is often a critical step within hardware/software partitioning utilized to determine the critical kernels of an application. In this paper, we present an innovative, non-intrusive dynamic application profiler (DAProf) capable of profiling an executing application by monitoring the application’s short backwards branches, function calls, and function returns. The resulting profile information provides an accurate characterization of the frequently executed loops within the application providing a breakdown of loop executions versus loop iterations per execution. DAProf achieves excellent profiling accuracy with an average accuracy of 98% for loop executions, 97% for average iterations per execution, and 95% for percentage of execution time. In addition, the presented dynamic application profiler incurs as little as 11% area overhead compared to an ARM9 microprocessor. DAProf is ideally suited for rapidly profiling software applications and dynamic optimization approaches such as dynamic hardware/software partitioning in which detailed loop execution information is needed to provide accurate performance estimates. Categories and Subject Descriptors: C.4 [Computer Systems Organization] Performance of Systems – Measurement Techniques. General Terms: Design, Performance Additional Key Words and Phrases: Profiling, non-intrusive profiling, dynamic optimization, dynamic hardware/software partitioning, critical kernels, embedded systems. ACM Reference Format: Nair, A., Shankar, K., and Lysecky, R. 2010. Efficient Hardware-Based Non-Intrusive Dynamic Application Profiling. ACM Trans. Embedd. Comput. Syst. ?, ?, Article ? (? 2010), 28 pages. DOI = ? http://doi.acm.org/?

________________________________________________________________________

1. INTRODUCTION Application profiling – the process of monitoring an application to determine the frequency of execution within specific regions – is an essential step within the design process for many software and hardware systems. Profiling has long been utilized to identify the most frequently executed regions of a software application such that software developers can focus their efforts on optimizing those regions [Graham et al. 1982]. ________________________________________________________________________ This research was supported by the Toyota InfoTechnology Center and the National Science Foundation (CNS0844565). Authors' addresses: Ajay Nair, Karthik Shankar, and Roman Lysecky, Department of Electrical and Engineering, University of Arizona, Tucson, AZ 85721; email: [email protected], [email protected], [email protected]. Permission to make digital/hard copy of part of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date of appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. © 2010 ACM 1073-0516/01/0300-0034 $5.00

________________________________________________________________________

Binary translation and dynamic optimization techniques rely on dynamic profiling to determine frequently executed sequences of instructions to improve performance by either caching the binary translation results or by re-compiling the code sequences with higher optimization effort [Bala et al. 2000; Chernoff et al. 1998; Ebcioglu et al. 2001; Klaiber 2000]. Profiling has been utilized to create specialized software [Calder et al. 1997; Chung et al. 2001; Yellin 2003] or hardware implementations [Lakshminarayana et al. 1999], of which an application can statically or dynamically select to execute to improve performance or reduce power consumption. A small low-power loop cache can be used to store frequently executed loops determined through application profiling [Bellas et al. 1999; Gordon-Ross et al. 2002; Lee et al. 1999]. Profiling is also a critical step within hardware/software partitioning approaches in which an application is partitioned into software executing on a microprocessor and one or more hardware coprocessors. Profiling is often utilized to determine the frequently executed regions – or critical kernels – of an application. Partitioning these critical kernels to hardware has been shown to provide application speedups of 10-100X [Bellas et al. 1999; Guo et al. 2005; Keane et al. 2004; Lysecky et al. 2006; Venkataramani et al. 2001] or reduce energy consumption by as much as 99% [Henkel 1999; Stitt et al. 2004; Stitt and Vahid 2002]. Such approaches are effective because many software applications follow the 90-10 rule of thumb that estimates 90% of an application’s execution time is spent executing 10% of the application’s code. The performance speedup (SHW/SW) after partitioning one or more software loops to hardware can be estimated using the following equations: S HW / SW =

TSW THW / SW

THW / SW = TSW − TSW (Loop) + THW(Loop) + TComm TComm = Execs * (TInit + TSync )

where, TSW is the software only execution time, THW/SW is the execution of the partitioned application, TSW(Loop)€ is the software only execution time of the loops being partitioned to hardware, THW(Loop) is the hardware execution time of the partitioned loops, and TComm is the communication requirements for initializing and synchronizing with the hardware implementation. The communication time can be calculated as the number of times the hardware is executed (Execs) multiplied by the sum of the initialization time (TInit) and synchronization time (TSync), where TInit and TSync are the time required to transfer any required data between the software and hardware before and after the hardware loop execution. While the hardware execution time can be estimated using total loop

iterations, the communication requirements depend on the number of times the loop is executed and can have a significant impact on overall speedup. Loops with higher executions and lesser iterations per execution will have greater communication requirements compared to similar loops with fewer executions but higher iterations per execution. Consider an application in which two potential loops, Loop A and Loop B, have been identified as candidates for partitioning to hardware, but hardware resources are only available to implement one of the partitioned loops. As reported by the frequent loop detection profiler [Gordon-Ross and Vahid 2005], Loop A has a total iteration count of 10,000 whereas Loop B has a total iteration count of 12,000. Furthermore, it is known that Loop A and Loop B account for 33% and 40% of the execution time, respectively. Without additional information, a hardware/software partitioning approach will select Loop B to implement in hardware, as dictated by Amdahl’s Law. However, if Loop A executes 5 times and iterates 2,000 times per execution, and Loop B executes 6,000 times and iterates 2 times per execution, the communication requirements of Loop B may severely impact the overall speedup, and Loop A may be a better candidate for partitioning to hardware. Hence, detailed loop execution information including loops executions and iterations per execution are essential to avoid suboptimal partitioning results. While static – or offline – profiling is feasible for many applications, dynamic profiling is essential for most dynamic optimization techniques. For example, warp processing is a recent computing technology that dynamically and autonomously reimplements critical software kernels as hardware coprocessors within an on-chip FPGA [Lysecky et al. 2006]. A warp processor dynamically detects a software binary’s critical kernels, re-implements those kernels as a custom hardware circuit in an on-chip FPGA, and replaces the software kernel by a call to the new hardware implementation of that kernel, all without any designer effort or knowledge thereof. As with many dynamic optimization approaches, warp processing relies on accurate, dynamic profiling to determine which software kernels are potential candidates for hardware implementation. In this paper, we present an area efficient, non-intrusive dynamic application profiler (DAProf) capable of profiling an executing application by monitoring the application’s short backwards branches, function calls, and function returns to provide detailed loop execution statistics including loop executions and average iterations per execution for frequently executed loops. In Section 2, we provide a comprehensive overview of existing software and hardware based profiling methods, discuss how such approaches

can potentially change the behavior of the application and incur significant runtime overhead, and specifically highlight the non-intrusive, frequent loop detection profiler presented in [Gordon-Ross and Vahid 2005]. In Section 3, we present a detailed overview of our original DAProf design presented in [Nair and Lysecky 2008] that profiles an application by only monitoring short backwards branches and provides detailed information regarding loop execution behavior, including the breakdown of loop executions versus average iterations per execution – providing both additional profiling information and improved accuracy compared to the frequent loop detection profiler. In Section 4, we introduce the problem of function call interference that can result in inaccurate profiling and present our extended DAProf design with support for monitoring function calls and function returns that overcomes the problem of function call interference. In Section 5, we present details of various DAProf hardware implementations along with experimental results demonstrating the accuracy of DAProf – both without and with function support – for a significantly expanded set of applications within the MiBench benchmark suite [Guthaus et al. 2001]. Finally, in Section 6, we conclude and discuss future research directions, including a summary of current efforts in supporting multitasked applications.

2. RELATED WORK 2.1 Software-based Profiling Most previous profiling approaches, being intended for desktop computing systems, introduce runtime overhead. In particular, either they insert additional code into the application binary, or they interrupt the processor at particular intervals to sample the processor’s registers. However, for embedded systems, runtime overhead is often not acceptable, since very tight real-time constraints must be met. A common software-based profiling approach involves instrumenting the application by adding code to count frequencies of the desired code regions [Graham et al. 1982; Hazelwood and Klauser 2006]. For example, if we wish to count the frequency of execution of a subroutine, we can add code to the beginning of the subroutine that increments an associated counter variable. Edge profiling is a form of code instrumentation used to determine the frequency of branches, or edges, between basic blocks of the application [Pettis and Hansen 1990]. The frequency of edge transitions is tracked by incorporating simple counters within the resulting application code. [Ball and Larus 1996] proposed a similar instrumentation approach, but with reduced code and runtime overheads, that profiles paths spanning multiple basic blocks by inserting

counters at the end of each target path. Alternatively, a dynamic application instrumentation approach has been proposed that utilize a virtual machine and just-intime (JIT) compiler to dynamically insert profiling code within the target application [Hazelwood and Klauser 2006]. In general, instrumentation techniques are flexible and straightforward. However, the additional code added to the application can potentially change the execution behavior of the application and incur performance losses due to the execution of code required for instrumentation and due to the resulting undesired side effects, including cache pollution and register spills [Gordon-Ross and Vahid 2005]. To reduce runtime overhead, other profiling approaches use statistical sampling techniques [Anderson et al. 1997; Dean et al. 1997; Zhang et al. 1997]. Such methods either interrupt the microprocessor at certain intervals or create an additional software task for profiling and then read the program counter and other internal registers to statistically determine execution behavior. For example, ProfileMe [Dean et al. 1997] randomly selects instructions to record dynamic execution information, including the current program counter, status of the pipeline stage, cache misses, etc., by utilizing an interrupt to monitor the application and statistically creating an application profile. Statistical sampling techniques can achieve average profiling accuracy of 85% with less than 3% runtime overhead [Anderson et al. 1997]. For embedded systems, both instrumentation and statistical profiling approaches potentially change the behavior of the application and incur significant runtime overhead. In case of real-time systems, which are usually designed with very tight timing constraints, the slightest run time overhead can lead to missed deadlines and potential system failure. Another approach to software based profiling is simulation, in which an application is run on an instruction set simulator – such as the SimpleScalar simulator [Burger and Austin 1997] – with the simulator keeping track of detailed profiling information. While accurate, simulation based profiling is extremely slow, especially when simulating a system-on-a-chip (SOC), where simulating an application for several hours may cover only a few seconds of real time, thereby limiting how much of an application’s execution can be realistically profiled. Furthermore, setting up such simulations can be difficult if not impossible for embedded systems, due to the complex external environments that may also need to be modeled. 2.2 Hardware-based Profiling Due to the limitations of these software profiling methodologies, designers have often resorted to using hardware based profiling approaches. Logic analyzers can be used to

profile an application by attaching the analyzer’s probes to the bus being monitored. While once effective, logic analyzers are ineffective solutions due to the high degree in integration with today’s SOCs that prohibits direct access to a processor’s instruction bus. Many SOCs provide a JTAG [IEEE 2001] interface that can potentially be utilized to halt the processor’s execution and read internal registers within the microprocessor or SOC. While useful for validation, verification, and debugging, the JTAG interface is inefficient method for profiling an application as it requires significant runtime overhead and changes the execution behavior of the application. Alternatively, many embedded processors provide trace/debug interfaces that can be utilized to monitor the execution of a software application in real-time [ARM Ltd. 2009]. While suitable for development purposes in which the trace/debug data can be processed by a host computer, processing trace/debug data in real-time requires significant processing capabilities that typically exceed the capabilities of the processor being profiled. Some microprocessors include on-chip hardware to assist software developers in profiling an executing program [Intel Corp. 2005; Sprunt 2002; Zagha et al. 1996]. Such hardware consists primarily of event counters to monitor events such as cache misses, pipeline stalls, branch mispredictions, etc. Some counters may be configured to detect a particular bus address, but there are typically only a few of these as having many of them would be costly in terms of size. Thus, the programmer must dynamically reconfigure those counters during program execution to obtain a more complete profile. Not only can this lead to inaccuracy, but also reconfiguring those counters require additional software instructions. While several approaches have been proposed to simplify and standardize the application interface to available hardware counters [Berrendorf et al. 2003; Brown et al. 2000], these approaches incur significant runtime overhead with similar ramifications as software-based profiling methods. Reconfigurable hardware counters [Schulz et al. 2005; Shannon and Chow 2004] can be directly synthesized along with processors cores to monitor communication between software and hardware processing components within multiprocessor systems. These reconfigurable hardware-based counters could be utilized to profile the software execution itself but would require additional instructions to communicate with and reconfigure the hardware counters. Such an approach would incur similar overheads compared to instrumentation based profiling methods or processor based performance counters – albeit with significantly reduced overhead. ProMem [Lysecky et al. 2002], is a non-intrusive hardware approach that incorporates an efficient pipelined binary tree structure that can be utilized to non-intrusively observe

target patterns on a processor’s instruction bus. The ProMem design provides a profiling throughput of one pattern per cycle, allowing it to be integrated with any target processor. However, for profiling software applications, the target patterns of the profiler must be preloaded with the addresses of instructions one wishes to monitor, such as branch instructions or basic block entry points. Because these addresses are not known a priori, dynamically profiling a software application using ProMem would incur considerable overhead to determine the target patterns and sort those patterns into the required tree organization. Altera’s performance counter core [Altera, Inc. 2009; Tong and Khalid 2007] can be utilized to efficiently profile software executing on the Nios2 soft processor core with minimal insertion of additional instructions to profile user specified regions of code. For each region of code a designer wishes to profile, simple macros can be inserted to software applications to start and stop the hardware cycle counters. While providing minimally obtrusive behavior for enabling and disabling hardware counters, execution statistics beyond cycle counts cannot directly be monitored. Furthermore, all regions of code to be profiled must be determined a priori in order to insert the necessary code to profile those regions. Similarly, a programmable coprocessor for profiling has been proposed [Zilles and Sohi 2001] to offload common profiling tasks from the main processor thereby significantly decreasing profiling overhead. However, as with most coprocessor implementations, the main processor is directly responsible for controlling the profiling. In addition, the coprocessor executes instructions to perform the profiling tasks, potentially requiring dozens of instructions to process each event being profiled – potentially limiting the profiling coprocessor’s ability to profile detailed execution behavior of software loops. Of notable interest is the frequent loop detection profiler that non-intrusively monitors the instruction addresses seen on the memory bus and profiles loop iterations by monitoring short backwards branches [Gordon-Ross and Vahid 2005]. A short backwards branch (sbb) is any branch instruction whose target address has a short negative offset and are typically used to branch backwards at the end of the loop iteration. Whenever a short backwards branch occurs, the frequent loop detection profiler updates a small cache – perhaps just 16 or 32 entries – that stores the frequencies of the short backwards branches. When any of the registers storing the branch frequencies become saturated, the profiler shifts all cache entries right by one bit, thereby maintaining a list of relative branch frequencies – or relative total loop iteration counts – while ensuring all branch

SBB

DAPROF

µP

SBB IADDR IOFFSET

PROFILER CONTROLLER

IADDR

SBB FIFO

SBB

I$ IADDR

D$

PROFILE CACHE FOUND

TAG

(30)

OFFSET CURRITER AVGITER EXECS INLOOP FRESH

(8)

(14)

(17)

(16)

(1)

(3)

FOUNDINDEX REPLACEINDEX

DYNAMIC APPLICATION PROFILER Fig. 1. Dynamic Application Profiler (DAProf) overview including microprocessor and DAProf integration and DAProf architecture consisting of short backwards branch FIFO (SBB FIFO), Profiler Controller, and Profile Cache. (Bit widths of profile cache entries shown within parenthesis).

frequencies do not eventually become saturated. This profiling approach can identify the most frequently executed loops of embedded applications and was utilized with the warp processor design previously mentioned [Lysecky et al. 2006]. However, for hardware/software partitioning approaches utilizing profiling to guide the partitioning process, the frequent loop detection profiler only provides a relative ranking of loop execution – or more specifically total loop iterations – to guide the order in which loops are analyzed for hardware implementation without any a priori knowledge. Without further simulation or application analysis, such limited profiling information may lead to suboptimal hardware/software partitioning results, as performance improvements cannot be accurately estimated using only the relative total iteration counts for frequently executed loops. The breakdown between loop executions and iterations per execution can have a significant impact on performance due to communication and synchronization requirements, as described earlier.

3. DYNAMIC APPLICATION PROFILER (DAPROF) We present an area efficient, non-intrusive dynamic application profiler (DAProf) capable of profiling an executing application by monitoring the application’s short backwards branches to provide detailed loop execution statistics including loop executions and average iterations per execution for frequently executed loops. Figure 1 presents an overview of our original dynamic application profiler [Nair and Lysecky 2008] design, highlighting its integration within a microprocessor based system and

internal profiling architecture. DAProf non-intrusively monitors the microprocessor’s instruction bus to determine the address of the currently executed instruction whenever a short backwards branch is executed. The DAProf design considers a short backwards branch as any branch instruction whose target address is a negative offset of less than 1024, which corresponds to small loops containing less than 256 instructions. While the DAProf could decode the instructions seen on the instruction bus, we currently assume the microprocessor provides a one-bit output, sbb, indicating a short backwards branch has been executed. Such support would require minor modification to a microprocessor’s decoding logic, and we note that the detection of short backwards branches is already available within the M*CORE processor [Scott et al. 1999]. DAProf design consists of a short backwards branch FIFO for synchronizing between the microprocessor and profiler, a profile cache that stores all relevant profile statistics for those loops currently being profiled, and a profiler controller that analyzes the short backwards branches to update the profiling statistics within the profile cache. We note that while our DAProf monitors short backwards branches much in the same way as the frequent loop detector, the design of the profiler cache, profiling methodology, and provided profiling statics are significantly different. 3.1 Short Backwards Branch FIFO (SBB FIFO) The short backwards branch FIFO (SBB FIFO) monitors the microprocessor’s instruction bus and sbb output signal. Whenever a short backwards branch occurs, the SBB FIFO will determine the branch instruction’s address and offset and store both values within a small internal FIFO. The offset is the number of instructions within the identified loop and is used along with the branch instruction’s address to represent the beginning and end of each loop within the profiler. In addition, the SBB FIFO is used to synchronize between the microprocessor and internal DAProf design because the microprocessor may operate at a higher clock frequency. As short backwards branches do not occur on every clock cycle, the internal DAProf profiler design need not operate at the same frequency of the microprocessor. A typical loop of interest within a software application consists of at least two to three instructions in addition to the short backwards branch at the end of the loop. We experimentally determined that the smallest profiled loop within the applications considered consists of 4 instructions. Hence, it should be sufficient to assume that short backwards branches on average occur no more than once every four instructions, implying the internal DAProf design can efficiently operate at one fourth the operating frequency of the microprocessor. However, the profiler FIFO should be large enough to

accommodate bursts of short backwards branch activity that may occur periodically as the application executes. We experimentally determined that a FIFO size of four entries is sufficient for the applications considered within this paper. 3.2 Profile Cache The profile cache is small cache that maintains the current profiling results and intermediate information needed for loop identification, iteration and execution profiling statistics, loop execution monitoring, and determining which profile cache entry should be replaced when new loops are executed. We currently consider a profile cache with 32 entries, which is sufficiently large to accurately profile most of the embedded software applications considered within this paper – although a larger profile cache would have yielded better accuracy for one application and may be needed for other large applications. 3.2.1 Loop Identification. Profiled loops are identified within the profile cache by the address of the loop’s short backwards branch, which serves as the Tag entry for the cache, and by the loop’s Offset determined by the SBB FIFO. Considering a 32-bit ARM processor and a byte addressable memory, the lower two bits for all instructions will be identical. Hence, the profile cache’s Tag entry is a 30-bit entry that stores the most significant 30 bits of a loop’s short backwards branch address. The profile cache’s Offset entry is an 8-bit entry that corresponds to the size of the loop in number of instructions. As described earlier, both the Tag and Offset for a loop are calculated by the SBB FIFO and provide the mechanism for identifying loop bounds. 3.2.2 Iteration/Execution Statistics. The main profiling information stored within the profile cache includes loop executions, average iterations per loop execution, and loop iterations for the current execution. Loop Executions provide the number of times a loop has been executed throughout the application execution. As the DAProf is intended to monitor an application over extended execution periods, regardless of the number of bits used to represent loop executions, the number of loop executions will eventually become saturated. The DAProf utilizes a 16-bit entry for loop executions that allows 65,536 loop executions to be profiled without saturations. As discussed in the following section, whenever a loop’s executions become saturated, DAProf’s profiler controller will adjust the loop executions for all entries, thereby maintaining a list of relative executions and ensuring all entries do not eventually become saturated. The Current Iterations provides the number of times a loop has iterated for the current loop execution and is stored within the profile cache as a 14-bit entry. As a 14-bit

entry, DAProf can accurately profile loops with a maximum of 16384 iterations per executions, which is well suited for most applications. The Average Iterations stores the average number of times a loop iterates per loop execution. As many loops do not iterate a fixed number of times per loop execution, the average iterations cannot be accurately stored as an integer value. Instead, the profile cache stores the average iterations as a 17-bit fixed point number using 14 bits to represent the integer portion and 3 bits for the fractional part. We note that the number of bits required to represent the integer portion of the fixed point number is equal to the number of bits used to store Current Iterations. 3.2.3 Loop Execution Monitoring. The profile cache contains a one-bit InLoop flag that is utilized to indicate a loop is currently being executed. The InLoop flag is essential in determining if the execution of a short backwards branch corresponds to a new loop execution or an additional iteration for the current execution. 3.2.4 Freshness and Replacement Policy. Although many replacement polices were considered and analyzed, including least recently used and estimated total instructions executed, the replacement policy incorporated within the profile cache uses total loop iterations to determine which entry within the profile cache will be replaced when a new loop is executed, where the entry with the lowest total iterations will be replaced. Total loop iterations are calculated as the product of the average iterations and executions. While this policy performs relatively well on its own, newly executed loops may not execute or iterate quickly enough to avoid being immediately replaced. To solve this problem, the profile cache includes a 3-bit loop Freshness value that represents how recently a loop has been executed or iterated, where a larger freshness indicates the loop has been more recently executed and is thus considered fresh. The freshness value is utilized within the replacement policy to only consider loops for replacement if the loops are not fresh – a loop that is not fresh has a freshness value of zero. A 3-bit freshness entry allows seven loops to be considered fresh and allows newly executed loops to be profiled for an extended duration before their profile cache entry will be considered during replacement. 3.2.5 Associativity. The Associativity of the profile cache potentially provides tradeoff between cache size and cache performance with the accuracy of the profiling results. With a fully associative profile cache, the replacement policy must compare all entries within the cache to determine the entry with the smallest total iterations, thereby requiring large hardware resources and reducing the overall performance of the DAProf. However, a fully associative profile cache may provide better profiling accuracy as the

DAProf (iAddr, iOffset, found, foundIndex, replaceIndex): 1. if ( found ) 2. if ( InLoop[foundIndex] ) 3. CurrIter[foundIndex] = CurrIter[foundIndex] + 1 4. else { 5. for all i, Fresh[i] = Fresh[i] – 1 6. Execs[foundIndex] = Execs[foundIndex] + 1 7. CurrIter[foundIndex] = 1 8. InLoop[foundIndex] = 1 9. Fresh[foundIndex] = MaxFresh 10. if ( Execs[foundIndex] = MaxExecs ) for all i, Execs[i] = Execs[i] >> 1 11. else { 12. for all i, Fresh[i] = Fresh[i] – 1 13. Tag[replaceIndex] = iAddr 14. Offset[replaceIndex] = iOffset 15. CurrIter[replaceIndex] = 1 16. AvgIter[replaceIndex] = 0 17. Execs[replaceIndex] =1 18. InLoop[replaceIndex] = 1 19. Fresh[replaceIndex] = MaxFresh 20. } 21. for all i, if ( InLoop[i] && !(iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i]) ) { 22. InLoop[i] =0 23. AvgIter[i] = (AvgIter[i] * 7 + CurrIter[i]) / 8 24. } Fig. 2. Pseudocode for initial DAProf’s profiler controller.

replacement policy can select from amongst all cache entries. Decreasing the associativity of the profile cache provides increased performance and smaller area requirements by reducing the number of entries the replacement policy must consider. However, a decreased associativity may result in an undesirable situation in which many of the most frequently executed loops are mapped into the same cache set. In this scenario, the replacement policy may be forced to replace an important loop, even though less important loops are present within the other cache sets. Finally, one must consider the relation between the profile cache’s associativity and freshness. For a profile cache with a small associativity and large maximum freshness, all entries within the same cache set may be considered fresh and should not be selected for replacement. To avoid this potential problem, the maximum freshness should be no greater than one half of the profile cache associativity. 3.3 Profiler Controller Figure 2 provides pseudocode for DAProf’s profiler controller. The profiler controller interfaces with the SBB FIFO and updates the profiling results for the current loops within the profile cache. The profiler controller receives the short backwards branch address (iAddr) and offset (iOffset) from the SBB FIFO in addition to a found, foundIndex, and replaceIndex signals from the profile cache. The found and foundIndex

signals indicate if the current short backwards branch is found within the profile cache and at what location. The replaceIndex provides the index for the loop entry that will be replaced if the current short backwards branch is not found within profile cache. Whenever a short backwards branch is available from the SBB FIFO, the profiler controller will determine if the loop is found within the cache. If the loop is found and the loop is currently executing – as indicated by the loop’s InLoop flag – the short backwards branch execution indicates a loop iteration has been detected and the loop’s current iterations are incremented. Otherwise, if the loop is not currently being executed, a new loop execution is detected. For new loop executions, the profile controller increments the loop’s executions, sets the InLoop flag, sets the current iterations to one, decrements the freshness value for all other loops, and sets the freshness of the current loop to the maximum freshness. Finally, if the profiler controller detects that the loop’s executions have become saturated, the executions for all loops will be divided by two. In addition to ensuring that the executions for all loops never become saturated, this approach provide a mechanism for monitoring the dynamic nature of an application in which loops that were once considered important may no longer be executed as time progresses. Initially, a previously executed loop’s high total iterations may ensure the loop is not replaced during profiling. However, after several saturations have been encountered, the reported total iterations will be decreased relative to other loops and can be replaced if the loop is no longer executed. If a loop’s short backwards branch is not found within the profile cache, the profiler controller will replace the entry within the cache as indicated by replaceIndex. The profiler controller initializes this profile cache entry by setting the Tag and Offset to those of the newly profiled loop’s, setting the executions to one, setting the InLoop flag, setting the current iterations to one, decrementing the freshness value for all other loops, and setting the freshness of the newly executed loop to the maximum freshness. For all detected short backwards branches, the profiler controller checks all entries of the profile cache whose InLoop flag is set to determine if the application is still executing within those loops, which can be determined by checking if the currently detected short backwards branch is contained within an entry’s loop bounds. If a loop is no longer being executed, the profile controller resets the InLoop flag and updates the loop’s average iterations. The average iteration calculation has a significant impact on the profiler’s accuracy, hardware requirements, and performance. For example, a straightforward method for calculating average iterations computes the exact average using the equation:

Fig. 3. Comparison of floating point exact average calculation, integer exact average calculation, 7/8 ratio average calculation, and real average over the past one hundred loop executions for a frequently executed loop within cjpeg.

AvgIteri =

AvgIteri * (Execi −1) + CurrIteri , Execi

in which the average iterations are calculated as the previous total iterations – determined by multiplying € the previous average iterations by the previous executions –plus the current iterations divided by the current executions. This method for calculating average iterations provides excellent accuracy across the entire application execution but requires floating point addition, multiplication, and division. Such an implementation would be too costly in terms of area and performance. Alternatively, the profiler controller could perform the same calculation using integer multiplication and division. However, this approach leads to inaccurate profiling results because of the truncation resulting from integer division. As the number of executions increases, the denominator of the calculation will increase to the point that regardless of a loop’s current iterations, the resulting average iterations will remain unchanged. Figure 3 presents the average iterations calculation for the floating point exact average iteration calculation, integer exact average iteration calculation, and the real average iterations over the past 100 loop executions for a frequently executed loop within the cjpeg application of the MiBench benchmark suite. As demonstrated from the real average iterations over the past 100 loop executions, the loop being profiled has significant variation in iterations per execution. While the floating point exact average calculation provides the correct average iterations across the entire application execution, it is unable to provide any means for detecting or adjusting to such dynamic changes in execution behavior. On the other hand, the integer exact average iterations calculation is completely inaccurate. After several loop executions, the calculated average iterations remain constant for the remainder of the application execution. Instead of relying on an exact average iteration calculation, the DAProf’s profiler controller utilizes a weighted average in which the previous average iterations accounts

Main() { ...

Func B() { ...

Loop A { ... Func B (); ... } }

Loop FuncB.1 { ... } // short backwards branch outside of // outside of Loop A’s bounds ... return;

... }

Fig. 4. Example of function call interference in which the execution a loop’s short backwards branch within a function will lead to incorrectly resetting the InLoop flag of any loops in which that function is called.

for 7/8th and the current iterations account for 1/8th of the calculated average iterations, as provided by the following equation: AvgIteri =

7 * AvgIteri CurrIteri + , 8 8

in which the average iteration is calculated using a fixed point representation described earlier. This ratio € based average iteration calculation can be efficiently implemented in hardware while providing excellent accuracy. Using the 17-bit fixed point representation to store the average iterations, this calculation is equivalent to: AvgIteri =

7 * AvgIteri + CurrIteri . 8

Figure 3 further presents the 7/8 ratio average iteration calculation for the selected loop within the cjpeg application. By providing a weighted average, the ratio average iteration € calculation is able to capture dynamic changes in loop executions – most closely tracking the real average iterations. Although a 7/8 ratio is utilized within the current DAProf design, other ratios for calculating the average iterations may be utilized to control how quickly or slowly the profiler will adapt to changing loop execution behavior.

4. DAPROF WITH FUNCTION SUPPORT Function call execution from within a loop currently being profiled can lead to profiling interference in which the InLoop flag of a currently executing loop is incorrectly reset. As illustrated in Figure 4, consider a loop, Loop A, that calls a function, Func B, where Func B also contains a loop, Func B.1. During execution of Loop A, the InLoop flag will initially be set and will remain set until a short backwards branch is executed outside of Loop A’s loop bounds. When the loop Func B.1 is executed, the initial DAProf design will interpret the execution of Func B.1’s short backwards branch as indicating Loop A is no longer being executed. This execution behavior results in Loop A’s InLoop flag being reset and average iterations being updated. Furthermore, when Func B returns and Loop A iterates again, this iteration will incorrectly interpreted as a new loop execution. Within

SBB FUNC

µP

IADDR

FUNC RET

IADDR IOFFSET

PROFILER CONTROLLER

SBB

SBB/FUNC FIFO

RET

I$ IADDR

SBB FUNC

DAPROF

RET

D$

PROFILE CACHE FOUND

TAG

(30)

OFFSET CURRITER AVGITER EXECS INLOOP FRESH INFUNC

(8)

(14)

(17)

(16)

(1)

(3)

(1)

FOUNDINDEX REPLACEINDEX

DYNAMIC APPLICATION PROFILER Fig. 5. DAProf with function support overview including modified microprocessor and DAProf integration with additional signals for detecting function calls (func) and function returns (ret), and DAProf architecture consisting of short backwards branch and function FIFO (SBB/FUNC FIFO), Profiler Controller, and Profile Cache with added InFunc entry for monitoring function execution behavior.

the application mad, a loop that is affected by function call interference is incorrectly reported as executing 154 times with average iterations per execution of 252, whereas the correct loop executions and average iterations are 2772 and 14, respectively. Notice that the total loop iterations – calculated as the product of executions and average iterations – are identical. To improve the accuracy of our dynamic application profiler, directly monitoring function call execution behavior is required to avoid function call interference. Figure 5 presents an overview of extended DAProf design with support for monitoring function calls and function returns. DAProf non-intrusively monitors the instruction bus of the microprocessor to detect the address of short backwards branches, as well as the address of function calls and function returns. While it is possible to directly integrate support for detecting function calls and returns by decoding instructions present on the bus, we currently assume the microprocessor provides a func signal indicating a function call is being executed and provides a ret signal indicating a function has returned. To detect function calls and function returns, DAProf requires the address from which a function was called and the address to which the function returned. Apart from providing support for detecting function calls and returns, the DAProf with function support design also required modifications to the short backwards branch FIFO, profile cache, and profiler controller to provide the necessary mechanisms for accurately profiling loops without function call interference.

The DAProf design with function support includes a short backwards branch and function FIFO (SBB/FUNC FIFO) that monitors the microprocessor’s instruction bus and sbb, func, and ret signals. Whenever a short backwards branch occurs, the SBB/FUNC FIFO stores the Tag value and Offset value into the FIFO as in the previous SBB FIFO. When a function call or function return is detected, the SBB/FUNC FIFO either stores the originating address of the function call or stores the return address of the function return. Internally, the SBB/FUNC FIFO includes a small FIFO that stores the address of interest, short backwards branch offset when needed, as well as an encoding indicating if the entry is a short backwards branch, function call, or function return. Although the profiler needs to monitor more profiling events, the combined frequency of short backwards branches, function calls, and function returns is not expected to increase the maximum expected frequency of profiling events. This is evident in the fact that both function calls and function returns require at least several instructions for maintaining the application’s execution stack that limits their overall frequency. However, the size of the SBB/FUNC FIFO has been increased to eight entries to accommodate small increases in bursts of profiling events. In addition to the initial DAProf design, the profile cache for the DAProf with function support incorporates an additional one-bit InFunc flag that is utilized to indicate a loop has called a function that is currently being executed. The InFunc flag is essential in ensuring that the InLoop flag for a loop that has called a function is not reset during the function’s execution. Figure 6 presents pseudocode detailing the operation of the extended DAProf profiler controller with support for monitoring function calls and function return. The profiler controller either receives the sbb signal along with the calculated branch offset, iOffset, the func signal, or the ret signal from the SBB/FUNC FIFO. In all cases, the address of the instruction of interest is provided by the iAddr signal from the SBB/FUNC FIFO. Compared to the initial profiler controller implementation presented in Section 3.3, the profiler controller for the DAProf with function support requires additional functionality for updating the InFunc flag of profiled loops and ensuring the InLoop flag for a loop that has called a function is not reset until at least after that function has returned. Whenever a function call is detected, the profiler controller checks the status of the InLoop flags for all loops within the profile cache to determine which loops are still currently being executed. If the function call originated from an address that does not fall within the bounds of a loop whose InLoop flag is set, then the InLoop flag for that loop is

DAProf (iAddr, iOffset, sbb, func, ret, found, foundIndex, replaceIndex): 1. if ( func) { 2. for all i, if ( InLoop[i] && !(iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i]) ) 3. InLoop[i] = 0 4. for all i, InFunc[i] = InLoop[i] 5. } 6. else if ( ret ) 7. for all i, if ( InFunc[i] && (iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i]) ) 8. InFunc[i] = 0 9. else if ( sbb) { 10. if ( found ) { 11. if ( InLoop[foundIndex] ) CurrIter[foundIndex] = CurrIter[foundIndex] + 1 12. else { 13. for all i, Fresh[i] = Fresh[i] – 1 14. Execs[foundIndex] = Execs[foundIndex] + 1 15. CurrIter[foundIndex] = 1 16. InLoop[foundIndex] =1 17. Fresh[foundIndex] = MaxFresh 18. if ( Execs[foundIndex] = MaxExecs) for all i, Execs[i] = Execs[i] >> 1 19. } 20. } 21. else { 22. for all i, Fresh[i] = Fresh[i] – 1 23. Tag[replaceIndex] = iAddr 24. Offset[replaceIndex] = iOffset 25. CurrIter[replaceIndex] = 1 26. AvgIter[replaceIndex] =0 27. Execs[replaceIndex] =1 28. InLoop[replaceIndex] =1 29. Fresh[replaceIndex] = MaxFresh 30. InFunc[replaceIndex] =0 31. } 32. } 33. for all i, if ( InLoop[i] && !InFunc[i] && 34. !(iAddr <= Tag[i] && iAddr >= Tag[i]-Offset[i]) ) { 35. InLoop[i] =0 36. AvgIter[i] = (AvgIter[i] * 7 + CurrIter[i]) / 8 37. } Fig. 6. Pseudocode for DAProf profiler controller with function support.

reset. After updating the InLoop flag for those loops no longer being executed, the profiler controller sets the InFunc flags for all currently executing loops, i.e., those loops whose InLoop flag is still set. This can be efficiently implemented simply by copying all InLoop entries to corresponding InFunc entries within the profile cache. Whenever a function return is detected, the profiler controller resets the InFunc flag for those loops that contain the address of the function return’s destination, i.e., the loops, or nested loops, from which the corresponding function was called. We note that if a function call is executed from the innermost loop of a nested loop, the InFunc flag for all loops within the nested loop structure will be set. On return from that function call, the profiler controller must reset the InFunc flags for all loops within the nested loops.

Table I. Area and maximum operation frequency for the fully associative, 16-way associative, and 8-way associative DAProf designs without function support and with function support implemented within a UMC 0.18µm technology. DAProf without Function Support Area Associativity

mm2

Gates

% of ARM9

DAProf with Function Support Area

Maximum Frequency

mm2

Gates

% of ARM9

Maximum Frequency

Fully

1.75

107,477

20%

553 MHz

1.76

107,902

20%

549 MHz

16-way

1.22

74,744

14%

584 MHz

1.25

76,810

14%

581 MHz

8-way

0.96

59,036

11%

660 MHz

0.98

60,314

11%

660 MHz

The profiler controller utilizes the InFunc flag to ensure that the InLoop flag is not incorrectly reset during a function’s execution and that the average iterations for loops are accurately calculated. For all detected short backwards branches, function calls, and function returns, the profiler controller checks all entries of the profile cache whose InLoop flag is set and whose InFunc flag is not set to determine if the application is still executing within those loops and is not the origin of a function currently being executed. If a loop is in fact no longer being executed, the profiler controller resets the InLoop flag and updates the loop’s average iterations.

5. EXPERIMENTAL RESULTS We consider three alternative profiler implementations including a fully associative, 16way associative, and 8-way associative DAProf designs. DAProf was implemented in Verilog and synthesized using Synopsys Design Compiler targeting a UMC 0.18 µm technology. All DAProf implementations have been extensively verified through gatelevel simulations using instruction traces collected through simulation and runtime execution. However, the integration and testing of DAProf within a real platform is left as future work. Table I presents the area and maximum operating frequency for the fully associative, 16-way associative, and 8-way associative DAProf designs without and with function support. Area is reported in logic gates, mm2, and as the percentage additional area required compared to an ARM9 processor. For all implementations, the SBB/FUNC FIFO has a maximum operating frequency of 934 MHz. Because the SBB FIFO can only execute four times faster than the profile cache and controller, DAProf’s overall operating frequency is limited by profile cache and profiler controller. For a fully associative implementation without function support, DAProf requires 107,477 gates (1.75 mm2) and can execute at a maximum operating frequency of 553 MHz. The area required for the fully associative DAProf design is approximately 20% of

the area of an ARM9 processor implemented within the same UMC 0.18 µm technology. Alternatively, with function support, a fully associative implementation requires only an additional 425 gates but executing at a slightly reduced maximum operating frequency of 549 MHz. The 16-way associative DAProf design without function support requires 74,744 gates (1.22 mm2) with a maximum operating frequency of 584 MHz, whereas the 16-way associative DAProf with function support requires 76,810 gates (1.25 mm2) with a maximum operating frequency of 581 MHz. Finally, the 8-way associative DAProf design without function support requires only 59,036 gates (0.96 mm2) with a maximum operating frequency of 660 MHz. For comparison, the 8-way associative DAProf design requires only 11% of the area of an ARM9 processor and is 45% smaller and 19% faster than the fully associative DAProf implementation. Adding support for functions only requires a marginal increase in area of 1278 gates, or 2%, while still achieving a maximum operating frequency of 660 MHz. We additionally developed a software based implementation of our dynamic application profiler. The software based implementation was developed in C and integrated within the ARM port of the popular SimpleScalar simulator [Burger and Austin 1997]. Although our focus is on developing a non-intrusive hardware based profiling method, the software based implementation provides an efficient profiler that can be readily integrated within any simulation environment with minimal profiling overhead. 5.1 Profiling Accuracy without Function Support To analyze the accuracy of our DAProf design, we compare the profiling results of DAProf with that of an accurate simulation based profiling method capable of fully profiling nested loop executions and iterations, function calls and executions, as well as recursive function calls [Villarreal et al. 2001]. We profiled various consumer electronics applications provided within the MiBench benchmark suite using the fully associative, 16-way associative, and 8-way associative DAProf designs. For the top ten loops of each application, we analyzed the profiling accuracy in terms of percent error in reported average iterations, percent error in reported executions, and percent error in estimated percentage of execution time. Figure 7 presents the results for a fully associative, 16-way associative, and 8-way associative DAProf designs without function support. The percent error in average iterations is calculated as the sum of differences between the reported average iterations and actual average iteration divided by the sum of the actual average iterations as follows:

35%

48%

37%

30% 32%

33%

30% 32%

(b) 16-way Associative

(a) Fully Associative

31%

34%

30% 32%

(c) 8-way Associative Fig. 7. Percent error in average iterations, executions, and percentage of execution time for a (a) fully associative, (b) 16-way associative, (c) and 8-way associative DAProf without function support for the top ten loops of the MiBench consumer electronics applications. 10

AvgIter% Error =

∑ AvgIteri(DAProf) − AvgIteri(Actual)

i =1

10

.

∑ AvgIteri(Actual)

i =1

On average, the initial DAProf design achieves good profiling results with an error in reported average € iterations of 20%, 19%, and 18% for a fully associative, 16-way associative, and 8-way associative implementations, respectively. In the best case, a 16way associative DAProf design has an error of only 0.5% in reported average iterations for the application tiffdither. Across all applications, the fully associative DAProf design has the lowest accuracy in average iterations, which can be primarily attributed to function call interference. In addition, the reported average iterations of the applications tiffmedian, mad, tiff2rgba, and tiff2bw exhibit an error of more than 30%. This profiling inaccuracy is caused by function call interference, in which certain function and loop execution behavior results in the InLoop flag being incorrectly reset. Because of unavoidable execution saturations, the loop executions reported by DAProf may not directly correspond to the total number of loop executions. As such, the percent error in reported loop executions is calculated using the following equation: 10

∑

i =1

Exec%Error =

€

Execi(DAProf) 10

−

∑Execj(DAProf)

j =1

10

Execi(Actual) 10

∑Execj(Actual)

j =1

,

in which the number of execution for each loop is calculated as that ratio of the reported, or actual, loop executions of each loop to the total loop executions of the top ten loops. On average, the DAProf design has an error in reported loop executions of 4% for the fully associative design and only 3% for both the 16-way associative and 8-way associative designs. Finally, as the percentage of execution time is often utilized to determine the critical kernels of an application, we estimated the relative percentage of execution time for each profiled loop using the DAProf profiling results. The percent error in percentage of execution time is simply the average absolute difference between the estimated percentage of execution time and actual percentage of execution time for all top ten loops, calculated using the following equation: 10

%ExecTime% Error =

∑ %ExecTimei(DAProf) − %ExecTimei(Actual)

i =1

10

.

While function call interference may lead to errors in reported average iterations, the combined€ accuracy of average iterations and executions results in only a 5% error in percentage of execution time for all DAProf implementations. 5.2 Profiling Accuracy with Function Support To evaluate the effectiveness of our DAProf design with function support we considered an expanded set of applications within the MiBench benchmark suite, including consumer electronics, automotive, office, networking, security, and telecommunications applications. Figure 8 presents the profiling accuracy in terms of percent error in reported average iterations, percent error in reported executions, and percent error in reported percentage of execution time for the fully associative, 16-way associative, and 8-way associative DAProf designs with function support. The DAProf design with function support achieves an average error in average iterations of 3%, 4%, and 3% for a fully associative, 16-way associative, and 8-way associative design, respectively, thereby, substantially improving on the initial DAProf design. For the applications, tiffmedian, mad, tiff2rgba, and tiff2bw, which previously exhibited an error in average iterations of greater than 30% without function support (as shown in Figure 7), the extend DAProf design with function support has a maximum error of only 9.5%. In addition, while all three designs performed similarly well on average, for some applications, the tradeoffs between associativity and profiling accuracy are clearly evident. For the application cjpeg, the fully associative DAProf implementation achieves an accuracy of 97%. However, for the 16-way associative and 8-way associative implementations, DAProf only achieves an accuracy of 90%.

15%

30% 15%

(a) Fully Associative 15%

30% 15%

(b) 16-way Associative 15%

15% 17%

(c) 8-way Associative Fig. 8. Percent error in average iterations, executions, and percentage of execution time for a (a) fully associative, (b) 16-way associative, (c) and 8-way associative DAProf with function support for the top ten loops of various MiBench consumer electronics, automotive, office, networking, security, and telecommunications applications.

The application fft is one of the largest applications profiled, in which much of the application execution consists of brief execution of many small loops. Due to this extensive execution of small loops, the current profile cache size of 32 entries is not sufficiently large to accurately profile all loops. During profiling, two of the top loops within the application are replaced, thus resulting in incorrectly reported average iterations that subsequently lead to incorrect estimates for percentage of execution time. However, we note that these two loops correspond to only 5% of the application execution time. To improve the overall profiling accuracy for fft, a larger profile cache could be utilized. On average, the DAProf design with function support has an error in reported loop executions of 2% for all implementations. For a fully associative design, this corresponds to a marginal improvement in loop execution accuracy from 96% to 98% compared to the initial DAProf design. In addition, the percent error in estimated percentage of execution time is on average only 5% for all implementations. In addition, for all applications

except bitcount and fft, the error in estimated percentage of execution time is less than 10%. In the case of bitcount, the majority of application execution time is spent executing small functions that do not contain loops. The profiling results from DAProf do not capture these functions – but rather only the loops within the application – and does not have sufficient data to provide an accurate estimate of the execution time, thereby resulting in a 15% error across all three DAProf designs. However, with the ability to detect function execution behavior, DAProf could be extended to directly profile function execution as well, although we leave this as future work. Overall, the three DAProf implementations with function support performed similarly well and showed a marked improvement in average iteration values compared to the initial DAProf design. The fully associative DAProf design provides an average profiling accuracy of greater than 95% for all profiling metrics with only minimal increase in area compared to the initial design. Although recursive function calls may not be as common in embedded systems compared to other computing domains, the structure of a recursive function can impact the resulting profiling accuracy. Figure 9 presents two possible recursive function implementations. Consider a recursive function RecFunc1 that contains a loop, Loop A, and a recursive function call from outside of Loop A, as depicted in Figure 9 (a).For this recursive function implementation, DAProf will provide accurate profiling results. Each time RecFunc1 is called, Loop A will be executed and will complete its execution before another recursive call to RecFunc1. Thus, the loop executions and average iterations for Loop A will be recorded accurately. On the other hand, consider a recursive function RecFunc2 that contains a loop, Loop B, which includes a recursive call to RecFunc2, as depicted in Figure 9 (b). When DAProf first detects Loop B’s short backwards branch, Loop B’s InLoop flag will be set. Subsequently, on detecting the recursive function call, the InFunc flag for Loop B will also be set because the function call is from within Loop B’s bounds. However, during those subsequent recursive function calls to RecFunc2, all of Loop B’s short backwards branches will be treated as iterations of the current execution, rather than new executions. Furthermore, during this recursive function execution, each function return will reset the InFunc flag only to set InFunc again during the next recursive function call. While this recursive function implementation may lead to inaccurate profiling results, we believe this case to be relatively rare.

RecFunc1() { ... Loop A { ... } ... RecFunc1(); ... }

(a)

RecFunc2() { ... Loop B { ... RecFunc2();

// Loop B’s InLoop and InFunc flags // will remain set during successive // recursive function calls

... } ... }

(b) .. Fig. 9. Example of a recursive function execution . behavior for (a) recursive function calls not contained within a loop and (b) recursive function calls within loops. }

As none of the MiBench applications considered include recursive function calls, we additionally profiled the ucbqsort application from the Powerstone benchmark suite [Scott et al. 1999]. This recursive quicksort implementation includes recursive function calls from within the outermost loop of the recursive function – similar to the example presented in Figure 9 (b). As expected, for all inner loops not affected by the recursive function call, the recursive execution does not affect DAProf’s ability to provide accurate profiling results. For these loops, DAProf achieves an accuracy of greater than 97% for all three metrics. However, for the outermost loop in which the recursive function call is made, the loop’s executions are reported as the number of non-recursive calls to the function – i.e. those calls to the function originating from outside the function. During the recursive execution of the loop, all executions and iterations of the loop are reported as iterations within a single execution. While the reported average iterations and execution are incorrect, the percentage of execution time for this loop can be accurately estimated with an accuracy of 99%. 5.3 Comparison with Frequent Loop Detection Profiler While DAProf provides additional profiling information beyond that available with the frequent loop detection profiler, the results from both profilers can be utilized to estimate each loop’s percentage of total application execution. Figure 10 (a) presents the percent error in estimated percentage of total application execution time of an 8-way associative DAProf design compared to the frequent loop detection profiler for the MiBench consumer electronics applications. Although the frequent loop detection profiler presented utilized a short backwards branch distance of 256, we consider a short backwards branch distance of 1024 in these results to provide a fair comparison between the two approaches. We further note that utilizing a short backward distance of only 256 for the frequent loop detection profiler would lead to reduced accuracy for the benchmarks considered. On average, DAProf provides a marginal increase in profiling

(a) % Error in % of Total Execution Time

(b) % of Total Execution Time Captured

Fig. 10. (a) Percent error in percentage of total application execution time and (b) percentage of total execution time captured for DAProf versus the frequent loop detection profiler for MiBench consumer electronics applications.

accuracy of 95% compared to the frequent loop detection profiler’s accuracy of 94%, with DAProf providing better profiling accuracy for all but one application. Figure 10 (b) presents the percentage of total application execution time captured by the top ten profiled loops for an 8-way associative DAProf design compared to the frequent loop detection profiler for the various embedded applications considered. On average, DAProf is able to capture 78% of the total application execution time, compared to 62% captured by the frequent loop detection profiler. This increase is largely the result of DAProf’s freshness calculation that ensures recently executed loops are not immediately replaced. For example, for the application djpeg, the frequent loop detection profiler does not capture the second most frequently executed loop, which accounts for 23% of the total application execution time. As a result, whereas DAProf captures 94% of the total execution time for this application, the frequent loop detection profiler only captures 55% of the total execution time.

5. CONCLUSIONS AND FUTURE WORK The dynamic application profiler provides an efficient, non-intrusive profiler capable of accurately profiling an application’s execution. By monitoring both loop and function execution behavior, a fully associative DAProf design can be utilized to profile the application execution of a 553 MHz processor requiring only 20% additional area – compared to an ARM9 processor. For the applications considered, DAProf achieves a profiling accuracy of 97%, 98%, and 95% for average iterations, loop executions, and estimated percentage of execution time. Alternatively, an 8-way DAProf design achieves similar profiling accuracy using 44% less area and executing at 660 MHz. Thus, the proposed DAProf design is ideally suited for many dynamic optimization approaches, providing accurate and detailed profiling results without incurring any runtime performance overhead or affecting an application’s execution behavior.

While the DAProf with function support is area efficient with excellent profiling accuracy, DAProf is designed to profile single threaded applications. However, many embedded systems consist of several tasks, or threads, executing within a lightweight kernel or operating systems. In the case of dynamically scheduled multitasked applications, context switches can be detrimental to the accuracy of DAProf, leading to context switch interference with similar detrimental effects as to that of function call interference discussed in this paper. As such, additional research is needed to investigate efficient, non-intrusive or minimally intrusive profiling methods for multitasked applications. Early efforts in this direction show promising results indicating that context switches can be non-intrusively detected, and DAProf can be extended to support multitasked applications with similar profiling accuracy. In addition, with the current ability to detect function execution behavior, future work also includes incorporating methods for directly profiling the execution behavior of functions themselves.

REFERENCES ALTERA, INC. 2009. Performance Counter Core. http://www.altera.com. ARM LTD. 2009. RealView Profiler. http://www.arm.com/products/ DevTools/RVP.html. ANDERSON, J., BERC L., DEAN, J., GHEMAWAT, S., HENZINGER, M., LEUNG, S.-T., SITES, R., VANDEVOORDE, M., WALDSPURGER, C., AND WEIHL, W. 1997. Continuous Profiling: Where Have All the Cycles Gone? ACM Transactions on Computer Systems, 15, 4, 357-390. BALA, V., DUESTERWALD, E., AND BANERJIA, S. 2000. Dynamo: A Transparent Runtime Optimization System. Conference on Programming Language Design and Implementation (PLDI), 1-12. BALL, T., AND LARUS, J. 1996. Efficient Path Profiling. International Symposium on Microarchitecture (MICRO), 46-57. BELLAS, N., HAJJ, I., POLYCHRONOPOULUS, C., AND STAMOULIS, G. 1999. Energy and Performance Improvements in Microprocessor Design Using a Loop Cache. International Conference on Computer Design (ICCD),378-383. BERRENDORF, R., ZIEGLER, H., AND MOHR, B. 2003. Performance Counter Library (PCL). http://www.fzjuelich.de/jsc/PCL/. BROWN, S., DONGARRA, J., GARNER, N., LONDON, K., AND MUCCI, P. 2000. A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters. ACM Conference on Supercomputing (SC), 42-54. BURGER, D., AND AUSTIN, T. M. 1997. The SimpleScalar Tool Set, Version 2.0. University of WisconsinMadison Computer Sciences Department, Technical Report 1342. CALDER, B., FELLER, P., AND EUSTACE, A. 1997. Value Profiling. International Symposium on Microarchitecture (MICRO), 259-269. CHUNG, E.Y., BENINI, L., AND DE MICHELI, G. 2001. Automatic Source Code Specialization for Energy Reduction. International Symposium on Low-Power Electronics and Design (ISLPED), 80-83. CHERNOFF, A., HERDEG, M., HOOKWAY, R., REEVE, C., RUBIN, N., TYE, T., BHARADWAJ YADAVALLI, S., AND YATES, J. 1998. FX!32 A Profile-Directed Binary Translator. IEEE Micro, 18, 2, 56-64. DEAN, J., HICKS, J., WALDSPURGER, C., WEIHL, W., AND CHRYSOS, G. 1997. ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors. International Symposium on Microarchitecture (MICRO), 292-302. EBCIOGLU, K., ALTMAN, E., GSCHWIND, M., AND SATHAYE, S. 2001. Dynamic Binary Translation and Optimization. IEEE Transactions on Computers (TC), 50, 6, 529-548. GORDON-ROSS, A., COTTERELL, S., AND VAHID, F. 2002. Exploiting Fixed Programs in Embedded Systems: A Loop Cache Example. IEEE Computer Architecture Letters, 1, 1, 2-5. GORDON-ROSS, A., AND VAHID, F. 2005. Frequent Loop Detection using Efficient Non-Intrusive On-Chip Hardware. IEEE Transaction on Computers (TC), 54, 10, 1203-1215. GRAHAM, S.L., KESSLER, P.B., AND MCKUSICK, M.K. 1982. gprof: a Call Graph Execution Profiler. Symposium on Compiler Construction, 120-126.

GUO, Z., BUYUKKURT, B., NAJJAR, W., AND VISSERS, K. 2005. Optimized Generation of Data-Path from C Codes. Design Automation and Test in Europe Conference (DATE), 112-117. GUTHAUS, M., RINGENBERG, J., ERNST, D., AUSTIN, T.M., MUDGE, T., AND BROWN, R. 2001. MiBench: A Free, Commercially Representative Embedded Benchmark Suite. IEEE Workshop on Workload Characterization, 3-14. HAZELWOOD, K., AND KLAUSER, A. 2006. A Dynamic Binary Instrumentation Engine for the ARM Architecture. Conference on Compiler, Architecture and Synthesis for Embedded Systems (CASES), 261270. HENKEL, J. 1999. A Low Power Hardware/Software Partitioning Approach for Core-based Embedded Systems. Design Automation Conference (DAC), 122-127. INTEL CORP. 2005. Vtune Environment, http://developer.intel.com/vtune. KEANE, J., BRADLEY, C., AND EBELING, C. 2004. A Compiled Accelerator for Biological Cell Signaling Simulations. International Symposium on Field-Programmable Gate Arrays (FPGA), 233-241. IEEE. 2001. IEEE 1149.1 Standard Test Access Port and Boundary Scan Architecture. KLAIBER, A. 2000. The Technology Behind Crusoe Processors. Transmeta Corporation, http://www.transmeta.com. LAKSHMINARAYANA, G., RAGHUNATHAN, A., KHOURI, K., JHA, N., AND DEY, S. 1999. Common-Case Computation: A High-Level Technique for Power and Performance Optimization. Design Automation Conference (DAC), 56-61. LEE, L.H., MOYER, B., AND ARENDS, J. 1999. Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops. International Symposium on Low Power Electronics and Design (ISLPED), 267-269. LYSECKY, R., STITT, G., AND VAHID, F. 2006. Warp Processors. ACM Transactions on Design Automation of Electronic Systems (TODAES), 11, 3, 659 - 681. LYSECKY, R., COTTERELL, S., AND VAHID, F. 2002. A Fast On-Chip Profiler Memory. Design Automation Conference (DAC), 28-33. NAIR, A., AND LYSECKY, R. 2008. Non-Intrusive Dynamic Application Profiler for Detailed Loop Execution Characterization. International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES), 23-30. PETTIS, K., AND HANSEN, R.C. 1990. Profile Guided Code Positioning. Conference on Programming Language Design and Implementation (PLDI), 16-27. SCHULZ, M., WHITE, B. S., MCKEE, S. A., LEE, H. S., AND JEITNER, J. 2005. Owl: Next Generation System Monitoring. Conference on Computing Frontiers (CF), 116-124. SCOTT, J., LEE, L.H., CHIN, A., ARENDS, J., AND MOYER, W. 1999. Designing the M*CORE M3 CPU Architecture. International Conference on Computer Design (ICCD), 94-101. SHANNON, L. AND CHOW, P. 2004. Maximizing System Performance: Using Reconfigurability to Monitor System Communication. International Conference on Field-Programmable Technology (FPT), 231-238. SPRUNT, B. 2002. Pentium 4 Performance Monitoring Features. IEEE Micro, 22, 72-82. STITT, G., VAHID, F., AND NEMATBAKHSH, S. 2004. Power Savings and Speedups from Partitioning Critical Loops to Hardware in Embedded Systems. ACM Transactions on Embedded Computing Systems (TECS), 3, 1, 218-232. STITT, G., AND VAHID, F. 2002. The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic. IEEE Design and Test of Computers, 19, 6, 36-43. TONG, J., AND KHALID, M. 2007. A Comparison of Profiling Tools for FPGA-based Embedded Systems. Canadian Conference on Electrical and Computer Engineering (CCECE), 1687-1690. VENKATARAMANI, G., NAJJAR, W., KURDAHI, F., BAGHERZADEH, N., AND BOHM, W. 2001. A Compiler Framework for Mapping Applications to a Coarse-grained Reconfigurable Computer Architecture. International Conference on Compiler, Architecture and Synthesis for Embedded Systems (CASES), 116125. VILLARREAL, J., LYSECKY, R., COTTERELL, S., AND VAHID, F. 2001. Loop Analysis of Embedded Applications. University of California Riverside, Technical Report UCR-CSE-01-03. YELLIN, D. M. 2003. Competitive Algorithms for the Dynamic Selection of Component Implementations, IBM Systems Journal, 42, 1, 85-97. ZAGHA, M., LARSON, B., TURNER, S., AND ITZKOWITZ, M. 1996. Performance Analysis Using the MIPS R10000 Performance Counters. Supercomputing, 16-35. ZHANG, X., WANG, Z., GLOY, N., CHEN, J., AND SMITH, M. 1997. System Support for Automatic Profiling and Optimization. International Symposium on Operating Systems Principles, 15-26. ZILLES, C., AND SOHI, G. 2001. A Programmable Co-processor for Profiling. International Symposium on HighPerformance Computer Architectures (HPCA), 241-252. Received January 2008; revised March 2009; revised June 2009; accepted November 2009.

Non-Intrusive Dynamic Application Profiling for ...

Practical Implementation of Space-Efficient Dynamic ...

Efficient estimation of general dynamic models with a ...

Efficient Representations for Large Dynamic Sequences in ML

Efficient Allocations in Dynamic Private Information Economies with ...

Efficient Allocations in Dynamic Private Information ...

Efficient Allocations in Dynamic Private Information Economies with ...

Practical Implementation of Space-Efficient Dynamic ...

Efficient Hardware-Based Non-Intrusive Dynamic ...

Dynamic Authentication for Efficient Data Storage in HMS

google-wide profiling: a continuous profiling ... - Research at Google

criminal profiling

Racial Profiling

Cost-Efficient Tool Integration for Tailored Application ...

Dynamic Tournament Design: An Application to ...

Consumer search and dynamic price dispersion: an application to ...

Profiling vs Fingerprinting.pdf

Session 1 Industrial experience and practical application of dynamic ...

Securing Nonintrusive Web Encryption through ... - Research at Google

clinical brain profiling

profiling s01 ita.pdf

Expression Profiling of Homocysteine Junction ... - Semantic Scholar