Improving memory performance of embedded java applications by ...

Viewer
Transcript

Improving Memory Performance of Embedded Java Applications by Dynamic Layout Modifications

F. Li, P. Agrawal, G. Eberhardt, E. Manavoglu, S. Ugurel, M. Kandemir Department of Computer Science and Engineering The Pennsylvania State University University Park, PA 16802, USA {feli, pagrawal, eberhard, manavogl, ugurel, kandemir}@cse.psu.edu

80% 60% 40% 20% MEDIAN

MATMULT

MATADD

LU

0% FFT

Combining Java technology tailored for the embedded market, industry support to bring Java technology to realtime systems, and engineering services to help customers put that technology to work, many companies are delivering the ingredients necessary to help developers create powerful Java-based products/environments. The primary features of Java that make it attractive for embedded systems are its platform independence (write-

Memory Accesses

100%

CHOLESKY

1. Introduction

Computation

BTRIX

Unlike desktop systems, embedded systems use different user interface technologies; have significantly smaller form factors; use a wide variety of processors; and have very tight constraints on power/energy consumption, user response time, and physical space. With its platform independence and secure execution environment, Java is fast becoming the language of choice for programming embedded systems. In order to extend its use to array based applications from embedded image and video processing, Java programmers need to employ several optimizations. Unfortunately, due to its precise exception mechanism and bytecode distribution form, it is not generally possible to use classical loop based optimization techniques for array based embedded Java applications. Observing this, this paper proposes a dynamic memory layout optimization strategy for Java applications. The strategy is based on observing the cache behavior dynamically and transforming memory layouts of arrays, when necessary, during the course of execution. This is in contrast to many previously proposed memory layout optimization strategies, which are static in nature (i.e., they are applied at compile time). Our results indicate large performance improvements on a suite of seven array based applications.

once, run-everywhere), dynamic loading capability, strong typing, and safe pointers. As Java is expected to become the language of choice for embedded platforms, its performance is fast becoming an important target for optimization. While Java is preferable from a platform independence viewpoint, previous research demonstrated that its performance is much worse than that of C and C++. For example, Moreira et al [2] report that the Java version of a matrix multiplication routine executes 130 times more slowly (on the same machine) than the C version of the same code. This is unfortunate, because many embedded image and video processing applications are array based [4], and use routines similar to matrix multiplication for manipulating multidimensional data sets of signals such as images and video sequences. Consequently, to efficiently employ Java in coding array based embedded applications, we need to develop new optimization strategies. Figure 1 gives the execution cycle breakdown for seven array intensive Java applications (the details of our benchmarks and our experimental setup will be discussed later). For each application, the total number of cycles is Execution Cycles Breakdown

Abstract

Figure 1. Execution cycles breakdown for Java versions of seven array based applications.

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Figure 2(a). Original array layout (U[N1][N2][N3]). broken down into the number of cycles expended during computation (datapath operations and cache references) and those spent during memory accesses (stalls). One can see from this graph that, on an average, 45.44% of execution cycles are spent in memory access, indicating that an optimization strategy that focuses on the memory performance can be very effective in practice. Unfortunately, in many cases, it is not possible to use classical loop transformations (employed in parallelizing and optimizing compilers) for improving data locality of Java applications. There are two major reasons for that. First, such transformations modify the execution order of loop iterations; this may violate the precise exception rule of Java (that is, the order of exceptions should not be modified). While a recent work [1] has shown that in some cases it is possible to relax this rule, the proposed techniques are quite involved and may not be applicable for large array based applications with complex access patterns. Second, Java programs are generally distributed in bytecode form and not as Java sources. Consequently, as discussed in Cierniak’s thesis [7], applying classical loop optimizations requires (i) transforming the bytecode representation to a high-level representation (where loop structures are visible), (ii) applying the transformation, and (iii) converting the representation back to the bytecode form. Obviously, this is typically a costly process and may not be applicable in general. Based on these observations, this paper proposes a new strategy for improving the performance of array based embedded Java applications. The proposed approach keeps track of cache misses at runtime and when the miss rate goes beyond a pre-set threshold, it transforms memory layouts in a user-transparent manner to improve cache behavior. Since keeping track of miss rates and transforming memory layouts dynamically incur

Figure 2(b). Transformed array layout (U[N1][N3][N2]). performance overhead, such layout transformations should be applied with care. Thus, our approach adopts a conservative strategy where the layout changes are applied incrementally and only when they are necessary (i.e., when the cache behavior is really bad). It should be emphasized that data transformations (if they can be applied) are particularly suitable for Java as Java is pointer-safe and, as mentioned earlier, loop transformations are in general not applicable to Java codes. Since Java uses a row-pointer layout for arrays, our layout transformation mechanism is different from those previously used in languages such as C and Fortran. The remainder of this paper is organized as follows. Section 2 revises memory layout of Java arrays and data (memory layout) transformations. Section 3 presents the details of our approach to data cache optimization. Section 4 introduces our experimental setup and presents data that demonstrate the effectiveness of our strategy. Section 5 concludes the paper by summarizing our major results and discusses the future work on this topic.

2. Java arrays and layout transformations In implementations of many programming languages, arrays are stored in contiguous locations in memory. Specifically, most languages use either row-major or column-major memory layouts. In a row-major layout, consecutive locations in memory hold elements that differ by one in the last (array) subscript position. As an example, for a two dimensional C array U, element U[5][12] is followed by element U[5][13]. In contrast, in a column-major layout, consecutive memory locations

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

hold elements that differ by one in the first subscript position. Java, on the other hand, employs a different strategy. It allows different rows of a given array to be stored anywhere in memory and uses pointers to link rows to an array of pointers. Figure 2(a) illustrates the structure of a three-dimensional Java array. This layout representation is referred to as the row-pointer memory layout, and as noted in [14], it has two potential advantages. First, it sometimes allows individual elements of an array to be accessed more quickly (if the multiplications involved in address computation required for classical row-major and column-major layouts happen to be costly in the target architecture). Second, it allows the rows to have different lengths. A data transformation (also called memory layout transformation [6]) transforms the memory layout of a multi-dimensional array from one form to another for improving the data cache behavior. An example is to transform a row-major layout to column-major if the majority of the accesses (references) to the array in question are column-wise. In C and Fortran, layout transformations are typically applied by statically (at compile time) re-writing array references and modifying array declarations [6]. In the row-pointer representation case, on the other hand, to transform a memory layout, we need to manipulate the pointers used for implementing the layout. This can be done within the Java Virtual Machine. For example, Figure 2(b) shows a transformed version of the layout depicted in Figure 2(a). The remainder of this paper discusses how such layout transformations can be applied for enhancing data cache behavior of array based embedded Java applications. Note that since Java applications are, in general, distributed in bytecode form, it is very difficult to apply static layout optimizations. Therefore, dynamic layout optimizations are more promising for enhancing cache behavior of Java applications.

3. Our approach Our approach makes use of two tables: Array Information Table (AIT) and Profile Data Table (PDT). The AIT stores the static information about the arrays used in the application. The entries are (for each array) the number of dimensions and the extent of each dimension. This table is populated when the bytecode is generated and stored as annotations for the use of our data transformations. The PDT, on the other hand, tracks the array access patterns during the course of execution and is further described in Section 3.2. The task of optimizing the memory layouts for cache locality is partitioned into four major phases: • Detection Phase: Detecting the degradation in the cache performance.

• Selection Phase: Selecting the arrays to be transformed.

• Application Phase: Deciding how to transform the selected arrays and applying the selected data transformation(s). • Re-writing Phase: Changing the array references in the code to reflect the new memory layout(s). The next four subsections discuss these four phases in detail.

3.1. Detection phase A common way of measuring the performance of a program is in terms of cache miss rate, which can be described as the percentage of memory references that could not be satisfied from the cache. Most of the previous work on data cache locality enhancement focus on loops (as they consume most of execution cycles) and apply either loop (iteration space) or data (memory space) transformations [7]. Note that these transformations are applied statically (i.e., during compilation). In contrast, the data transformations in our work are triggered by the data cache miss ratio; i.e., they are dynamic. To accomplish this, the miss rate is tracked while the program is running by sampling hardware counters at periodic intervals. Most architectures provide a set of hardware counters to allow the measurement of different processor/memory activities dynamically (at runtime) [10]. Considering current trends, we can expect that in the future even more architectures will support hardware counters. Typically, hardware can provide multiple counters, counting different events such as the number of cycles, the number of cache misses, the number of graduated instructions, taken/untaken branches, etc. There exists normally an interface through which these counters can be sampled. In our context, JVM uses only two statistics (counter values): the number of data accesses and the number of cache misses. Using these measures, it then calculates (at periodic intervals) the miss rate (denoted using misscurrent) and compares it with a pre-set threshold (denoted using missmax). If misscurrent ≥ missmax, this indicates that the current cache behavior is not acceptable and a data transformation should be applied.

3.2. Selection phase The selection of the arrays to layout-transform plays an important role in the performance of our approach. First, the arrays that are chosen for layout transformation should not be the rarely accessed ones (even if their data cache behavior is not good) as there is very little to gain in transforming the layout of a rarely accessed array. Second, the layout of an array with a good cache behavior should not be modified. Thus, only the frequently

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

accessed arrays with poor cache locality should be transformed. To achieve this, our approach uses PDT, the profile data table. Specifically, each entry of this table keeps a pair that gives the number of accesses to the array in question and the number of data cache misses incurred due to its references in the code. Whenever the cache miss ratio exceeds the threshold value, all the entries of the PDT are traversed. First, the arrays with large number of accesses are identified. Our current approach to this is to consider an array frequently accessed if the number of its accesses is above a threshold. In other words, if the accesses to an array X constitute, say, 20% of all memory accesses since the last sampling period, the array is considered as a frequently accessed array. In the rest of the paper, we refer to this value as the frequency threshold (denoted using fthreshold). Then, among the frequently accessed arrays (i.e., those whose frequency of access is larger than fthreshold), the ones with the high miss rate are identified. To identify such arrays, we assume that an array has poor cache behavior if its miss rate is larger than or equal to missmax.

3.3. Application phase Our array layout transformations are designed to handle multi-dimensional arrays as two-dimensional arrays of arrays. This is suitable for Java since, as mentioned earlier, a Java array can be viewed as a combination of two-dimensional pointer arrays. Columnto-row and row-to-column transformations in any layer of the multidimensional arrays can be handled properly. By combining such simple transformations, it is possible to accomplish more sophisticated layout modifications than those can be achieved in C or C++. Figure 2 gives an example that illustrates our approach to layout transformation. Figure 2(a) (on the previous page) shows the original memory layout for a threedimensional Java array. Note that this layout is very

Figure 2(c). Transformed array layout (U[N2][N3][N1]).

different from the conventional memory layouts used in languages such as C or Fortran as the different array sections are connected to each other using pointers. Let us assume that the array shown in Figure 2(a) corresponds to U[N1][N2][N3], where N1, N2, and N3 are the dimension extents (obviously, here we have N1=3, N2=2, and N3=3). Note that, in this default memory layout, the fastest changing array subscript position is the last one. Now, supposing that after sometime, the sampled counter values indicate that the data cache behavior is not good (i.e., misscurrent ≥ missmax), meaning that we need a dynamic layout transformation. Assuming that the array whose layout shown in Figure 2(a) needs to be transformed, we consider these two options: U[N1][N3][N2] (shown in Figure 2(b)) or U[N2][N3][N1] (shown in Figure 2(c) on the next page). In the first alternative, the fastest changing subscript position is the second one and, in the third one, the fastest changing position is the first one. The reason that we focus only on these alternatives can be explained as follows. In many cases, it is sufficient to optimize a layout such that the innermost loop traverses the array along the fastest changing subscript position. For example, for the layout shown in Figure 2(a), an array access (reference) such as U[i][j][k] (where i, j, and k are the loop indices from outermost to innermost position) is good. This is because the innermost loop index traverses the fastest changing subscript position. On the other hand, an array reference such as U[i][k][j] is not good for the layout in Figure 2(a). For this reference, the layout form shown in Figure 2(b) is the preferable one. Since a given array may be accessed using multiple references and different loops may use different references for accessing the same array, deciding the most suitable layout for a given loop can be best achieved using dynamic transformations. Specifically, if our approach detects that the default layout in Figure 2(a) is not good, it tries the two alternatives shown in Figures 2(b) and 2(c) in turn. One can expect that the performance

Figure 2(d). Array interleaving.

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

difference between U[N1][N3][N2] and say, U[N3][N1][N2] would not be too much as in both the cases the fastest changing subscript position is the same. Therefore, for a given array, our application phase proceeds as follows. Given an array of m dimensions, we try to bring each dimension (one by one until the cache behavior becomes good) to the innermost (fastest changing dimension). It should be noted that our approach does not attempt to modify the execution order of loop iterations; therefore, we are not restricted by the inherent data dependences in the code.

3.4. Re-writing phase Changing the memory layout of arrays requires the references to these arrays to be transformed as well. This task could be accomplished either by editing the bytecode or by accessing the transformed arrays through a transformation table. However, using one more redirection to access the array elements could yield extra overhead at run-time, which can in turn offset the benefits of layout optimization. Changing the bytecode, on the other hand, is done once for each transformation. Although it seems to be more complex to parse the bytecode and process it, building up a transformation table would also require a similar process, considering the fact that elements of Java arrays are accessed via pointers. Therefore, in our current implementation, we focus only on the bytecode modification.

3.5. Other data transformations While the data transformation framework presented so far might be quite effective (as will be shown later in Section 4), there are still cases where it may not be very useful in optimizing the data cache behavior. As is wellknown, there are three major reasons for cache misses: cold misses, capacity misses, and conflict misses. While the data transformation strategy presented above may eliminate a large percentage of capacity misses, it may not be particularly successful in reducing conflict misses. This is particularly true for inter-array conflict misses; that is, the conflict misses that occur between the references that belong to different arrays. For example, suppose that two array references (belonging to different arrays) appear in a loop with perfect spatial locality. In such a case, optimizing the layout of each array independently hardly helps. In fact, if the base addresses of the arrays in question happen to conflict in cache, depending on the subscript functions of the references, each iteration of the loop can create two conflict misses (one per array). One solution to this problem is to transform the arrays in question together; that is, interleave them. Previous

research used interleaving for optimizing performance (e.g., [11]) and energy consumption (e.g., [3]). But, these approaches were static; that is, interleaving was applied based on a compile time decision. In contrast, here, we consider dynamic array interleaving; i.e., we apply array interleaving based on dynamic cache statistics collected via performance counters. An important question that we need to address is how to decide when and how to apply array interleaving. It should be noticed that array interleaving can also interact with the data transformations considered earlier as all these transformations modify memory layouts. In our approach, we apply array interleaving only after the data transformations discussed earlier, and only if these transformations fail to improve the cache behavior. Figure 2(d) above illustrates how two Java arrays are interleaved. In applying interleaving, our approach proceeds as follows. After determining that the data transformations used so far do not improve the performance, array interleaving is considered. In order to qualify for interleaving, an array should belong to a compatibility set. Two arrays are said to belong to the same compatibility set if their access frequencies as well as the number of misses they incur are similar to each other (e.g., one is 10% within the other). This is because the inter-array conflict misses are most problematic when they occur between the arrays that are accessed with the same frequency. For example, if the innermost loop index (in a given nest) appears in the subscript expressions of both the arrays, these arrays are accessed with the same frequency. So, the conflict misses between them (when they occur) can be devastating. If this is the case, we can expect these two arrays to experience similar number of accesses and similar number of misses. On the other hand, if the innermost loop index appears only in one of the arrays (i.e., the other array has temporal reuse in the innermost loop), the conflict misses between them (if any) will not be very frequent. So, there will be little gain in dynamically interleaving such arrays. It should be emphasized that while we talk about two arrays being interleaved, it is possible to interleave more than two arrays as well. That is, if, for example, three arrays belong to the same Benchmark

Dataset Size

Cycles

BTRIX CHOLESKY FFT LU MATADD MATMULT MEDIAN

226.2KB 1,187.4KB 663.5KB 680.1KB 724.0KB 1,086.0KB 412.8KB

36,856,394 81,424,695 51,095,181 63,337,890 63,021,857 96,818,770 27,064,342

Miss Rate 68.3% 55.5% 60.8% 72.1% 47.4% 61.4% 46.8%

Figure 3 - Benchmarks used in our experiments.

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

All results reported in this section have been obtained using KVM [12] and Perfmon [13]. The K virtual machine (KVM) is Sun's newest Java virtual machine technology and is designed for products with approximately 128K of available memory. It is specifically designed for small-memory limited-resource connected devices such as cellular phones, pagers, PDAs, set-top boxes and point-of-sale terminals. The KVM is engineered and specified to support the standardized incremental deployment of the Java virtual machine features and the Java APIs included in the Java 2 ME (Micro Edition) architecture. To collect the reference and cache miss statistics we employed Perfmon, a tool-set that allows user-level code to access the performance counters present in the Ultra-series workstations and servers produced by Sun Microsystems. This is accomplished by a loadable driver which re-programs devices with performance counters so that the user-level code can access these counters (normally, access to these counters is restricted to the codes running in the privileged mode). We experimented with four different layout transformation strategies. Level 1, 2, and 3 correspond to the transformations depicted in Figure 2. In addition to these, we also experimented with dynamic array interleaving (see Section 3.5). Figure 3 lists the benchmarks used in this study and their important characteristics. These benchmarks were originally written

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46

Original

MEDIAN

Optimized

0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 1

Figure 5. Time per iteration (CHOLESKY).

MATMULT

in C or Fortran, and we re-wrote them in Java. The second, third, and fourth columns in this table give the data set size, the number of cycles, and the data cache miss rate, respectively, for each benchmark code. These numbers have been obtained using a 16KB, two-way setassociative data cache with a block size of 32 bytes, and a 16KB, direct-mapped instruction cache. The impact of our optimizations on the instruction cache performance was minimal. In fact, the original codes have excellent instruction cache performance (less than 2% miss rate on the average). Our focus here is on data cache behavior. The default values used in our experiments for missmax, fthreshold, and comprange are 30%, 20%, and 10%, respectively. We also modify these values to measure the robustness of our strategy. We first present in Figure 4 the normalized execution time (execution cycles) for different data transformations. Each bar in this figure is normalized with respect to the number of execution cycles of the corresponding original (unoptimized) code. Note that the interleaved version already includes the benefits from Levels 1, 2 and 3. One can see from this graph that the normalized execution cycles for Level 1, Level 2, and Level 3 are 0.81, 0.75, and 0.74, respectively. We see that, except for MATADD, all our benchmarks take advantage of dynamic layout change (either Level 1, Level 2, Level 3, or a combination of these). Obviously,

Optimized

0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1

Interleaving

Figure 4. Normalized execution times.

Time Per Iteration

Time Per Iteration

Original

MATADD

CHOLESKY

4. Experimental setup and results

Level 3

LU

1 0.8 0.6 0.4 0.2 0

Level 2

FFT

Level 1

BTRIX

Normalized Execution Time

compatibility set, they can be interleaved. In the remainder of this paper, we use the term compatibility range (denoted comprange) to denote the range in which the access frequencies and the number of misses of different arrays in the same compatibility set should fall. For example, a compatibility range of 10% means that the number of accesses/misses generated by any two arrays in the compatibility set are within 10% of each other (i.e., they are very close to each other).

5

9

13 17 21 25 29 33 37 41 45

Figure 6. Time per iteration (LU).

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

not all benchmarks exploit all three levels (of transformations) as not all the arrays used are threedimensional. We also observe from Figure 4 that three of our benchmarks benefit from array interleaving, which reduces the execution time by 32.4%, on the average. To illustrate how our data transformation improves cache behavior, we present in Figure 5 the time per iteration for CHOLESKY. To obtain this graph, the execution profile of the benchmark has been divided into forty five epochs, and the time (cycles) taken by each iteration is normalized to 0.5. We present the time per iteration variation for both the original (unoptimized) and the optimized codes. From this graph, one can clearly identify the three levels of data transformations performed by our approach (corresponding to Levels 1, 2, and 3). The combined effect of these transformations is that the per iteration time is reduced significantly. Figure 6 presents the same graph for LU. We observe a trend similar to the one for CHOLESKY. We next quantify the overheads incurred by our approach. There are two different sources of overhead: profiling the code (which corresponds to the detection phase) and selecting the arrays to be transformed, transforming the memory layouts, and modifying the code (which correspond, respectively, to the selection, application and re-writing phases). Figure 7 shows the breakdown of execution cycles of the optimized codes into (i) profiling overhead (detection phase), (ii) layout transformation overhead (the remaining phases), and (iii) the remaining (useful) execution cycles (denoted Execution in the graph). One can observe from these results that, on the average, the profiling overhead and the layout transformation overhead constitute, respectively, 6.56% and 11.51% of the overall execution cycles. In other words, as compared to the time spent in doing useful computation, these overheads are not too much. Finally, to evaluate the robustness of our strategy, we performed another set of experiments where we changed

Level 2

Level 3

Interleaving

1 0.8 0.6 0.4 0.2 (30%,20%,40%)

(30%,20%,30%)

(30%,20%,20%)

(30%,40%,10%)

(30%,30%,10%)

(30%,10%,10%)

(40%,20%,10%)

(20%,20%,10%)

(10%,20%,10%)

0 (30%,20%,10%)

Figure 7. Execution time breakdown of the optimized codes.

Level 1 Normalized Execution Time

MEDIAN

Execution

MATMULT

MATADD

LU

Transform ation

FFT

CHOLESKY

100% 80% 60% 40% 20% 0% BTRIX

Execution Time Breakdown

Profiling

Figure 8. Sensitivity analysis. missmax, fthreshold, and comprange. Each point on the xaxis of the graph in Figure 8 represents a specific (missmax, fthreshold, comprange) triple. Each point on the yaxis corresponds to the normalized execution cycles (averaged over all benchmark codes in our suite) under the associated triple. We see from these results that while the net savings due to our approach varies from one configuration (triple) to another, we obtain performance benefits with all configurations experimented. In more detail, we observe an important trend from the graph given in Figure 8. Specifically, for each parameter, there is an optimal value that gives the best performance. Using a value smaller or larger than this optimal value leads to a sub-optimal result. For example, the best value for missmax (among the values we experimented with) is 20%. A value smaller than this triggers layout transformations early (and, most probably, unnecessarily). On the other hand, a value larger than this delays layout transformations, thereby leading to performance loss. Similar observations can be made for other parameter values as well.

5. Conclusion and future work Java is a semi-compiled language and therefore inherently much slower than native machine-code execution. This problem is exacerbated in embedded environments by the fact that embedded processors are less powerful than desktop ones. The easiest and most expensive way to enhance Java performance is to use a faster processor. Fortunately, there are other (and more cost-effective) solutions. This paper explores such a solution and presents a runtime layout transformation strategy for array based embedded Java applications. Our

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

results indicate that the proposed strategy is very successful in practice. We plan to extend the framework presented in this paper as follows. First, we would like to extend our transformations to include more sophisticated transformations such as dimension changing optimizations (e.g., converting a two-dimensional layout to a three-dimensional one). Second, we would like to develop strategies that will evaluate the future usefulness of a layout transformation. The goal would be to skip a transformation if the transformed array will not be used frequently in the future. Third, we would like to port our approach to different JVM’s and test its effectiveness using different applications.

6. References [1] P. V. Artigas, M. Gupta, S. P. Midkiff, and J. E. Moreira. High Performance Numerical Computing in Java: Language and Compiler Issues. In Proceedings of the 12'th International Workshop on Languages and Compilers for Parallel Computing, San Diego, CA, 1999. [2] J. E. Moreira, S. P. Midkiff, and M. Gupta. From Flop to Megaflops: Java for technical computing. In Proceedings of International Workshop on Languages and Compilers for Parallel Computing, August 1998, Chapel Hill, NC. [3] R. Athavale, N. Vijaykrishnan, M. Kandemir. Annotation Based Energy Optimization Using Array Interleaving. In Proceedings of the Second Workshop on Second Annual Workshop on Hardware Support for Objects and Microarchitectures for Java, 2000. [4] F. Catthoor, S. Wuytack, E. D. Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle. Custom Memory Management Methodology - Exploration of Memory

Organization for Embedded Multimedia System Design. Kluwer, June 1998. [5] T. M. Chilimbi, M. D. Hill, and J. R. Larus. CacheConscious Structure Layout. In Proceedings of ACM SIGPLAN’99 Conference of Programming Language Design and Implementation, May 1999. [6] M. Cierniak. Optimizing Programs by Data and Control Transformations. Ph.D. Dissertation, University of Rochester, Rochester, NY, 1997. [7] M. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, 1996. [8] R. Jones and R. Lins. Garbage Collection, Algorithms for Automatic Dynamic Memory Management. Wiley, 1996. [9] Han B. Lee, and Benjamin G. Zorn. BIT: A Tool for Instrumenting Java Bytecodes. In Proceedings of the 1997 USENIX Symposium on Internet Technologies and Systems (USITS97), pages 73-82, Monterey, CA, December 1997. [10] M. Zacha, B. Larson, S. Turner, and M. Itzkowitz. Performance Analysis Using the MIPS R10000 Performance Counters. In Proceedings of Supercomputing ’96, Pittsburgh, November 1996. [11] M. Kandemir. Array Unification: A Locality Optimization Technique. In Proceedings of International Conference on Compiler Construction, Genova, Italy, April 2001. [12] KVM white paper. http://java.sun.com/products/cldc/wp/. [13] http://www.cps.msu.edu/~enbody/perfmon/. [14] M. L. Scott. Programming Language Pragmatics. Morgan Kaufmann, 2000.

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

memory optimizations of embedded applications for ...