Simulated Dynamic Selection of Block-Sizes in Level ...

Viewer
Transcript

Simulated Dynamic Selection of Block-Sizes in Level 1 Data Cache Vincent Peng Tyler Bischel University of California at Riverside CS-203A Computer Architecture

1 Abstract Cache block size has a significant effect on miss rate in a fixed size cache. Static block sizes may not be sufficient for individual programs, where dynamic behavior may create a constantly changing optimum block size. This project evaluates a promising new approach; hardware based dynamic block size selection, where block size selection occurs during a data miss based on parameters that point to a programs locality. This method has the potential to improve the miss rate for individual programs, even over the optimal static selection.

2 Introduction Cache performance is strongly influenced by the optimum selection of its configuration settings. For a fixed capacity cache, perhaps the most important features in the configuration are the number of sets, the cache associatively, and the block size. Current processors rely on fixed settings based on a timely optimization process comparing the effect of various settings on performance in a known set of benchmarks. However, for individual programs, these settings may be far from optimal. This is particularly evident in selection of the block size. Variations in a program’s specific temporal and spatial locality properties can make larger or smaller block sizes become

more attractive. Larger blocks decrease compulsory misses inherent in spatially local data, while smaller blocks will decrease capacity and conflict misses with temporally but not spatially local programs when using a fixed size cache. The dynamic behavior of a program dictates a dynamic solution. We have developed an algorithm to select the block size at runtime in the hardware, based on the proximity of a cache miss address to a neighbor already in the cache, and the dynamic size of that neighbor. Given the miss penalty only marginally changes when larger contiguous data chunks are requested, it is reasonable to evaluate the performance of this algorithm based on the miss rates produced. For the extream cases of only spatially local or only temporally local data, we would expect our algorithm to be the second best performer, as optimal block sizes exist for each particular case. For programs of mixed locality, it is possible that our algorithm could out-perform any static block size. In a fully explored fixed block size region, our algorithm should second best performance in the worst case, and best performance regularly. The rest of this paper is organized as follows. Section 3 will discuss the related efforts in dynamic cache enhancements. Section 4 will detail specific features of our algorithm. Section 5 will describe the benchmarks we compared against, and evaluate the

performance of our algorithm. Lastly, possible areas of further research using our algorithm will be presented, followed by conclusions.

The following section describes related work to our project, namely data prefetching, and other variable block size efforts being independently researched.

ways. Tambat et al have proposed the use of sub-tagged caches to provide variable block sizes. Here, block replacement conflicts with similar tags can co-occupy a block by breaking it into sub-blocks, each with a small tag denoting the difference between the similar tags. Beg et al demonstrated a method of compiler controlled block sizes to exploit variable basic-blocks in instruction cache traces.

3.1

4 Design

3 Related Work

Data Prefetching A similar idea to our algorithm, instruction and data prefetching, has been around for a long time. In the L1 data cache on the Pentium 4, prefetching is invoked if there are two successive misses on the same page less than 256 bytes apart. Data can then be placed into a separate buffer where it is only added to the main cache if it is hit within a specified time window. Alternatively, the prefetched data can be added directly into the cache. The primary difference between our algorithm and prefetching is that selection size of our prefetch data window is variable. Prefetching relies on utilizing memory bandwidth that otherwise would be unused, but if it interferes with demand misses, it can actually lower performance. Results from the SPEC benchmark suite have shown a 35% increase in performance on the Intel Pentium 4. 3.2

Variable Data Block Size Independent efforts have been made regarding the problem of variable block size caches. Veidenbaum et al have developed a similar mechanism for growing block size based on neighboring blocks. They also focused on compilercontrolled block size changes, in which they used software profiling to select the appropriate block size in a variety of

4.1

Block Size vs. Temporal, Spatial Locality Due to diverse needs of end users, individual applications and algorithms have strikingly different hardware needs. Despite their dissimilarities, we can consistently separate them into several categories based on two major principles: temporal locality and spatial locality. When high temporal locality is present, recently accessed data is more likely to be needed soon than other data. Similarly, high spatial locality means that data close to the currently accessed memory location is likely to also be requested soon. Each of these principles is strongly related to the optimum block size for a program, in terms of how it impacts the miss rate performance. Ideally, a different strategy should be chosen to handle each different locality combination. When a program tends to have a high temporal locality, a smaller block size would be better. By using smaller block size, the cache can contain more individual data, thus reducing the capacity misses. On the other hand, larger block size will have a better performance in applications with high spatial locality. Since data nearby will probably be accessed soon, fetching the neighboring data into the cache at the

same time can efficiently reduce compulsory misses. Therefore, the size of a block may not always perform well if it is fixed, while dynamic selection of the block size will adapt to the locality of the program.

Figure 1: Physical Address space currently residing in the cache.

4.2

Size selection strategy Since different locality needs suggest different block sizes to enhance the cache performance, we designed a strategy to dynamically decide the block size. Two factors are considered when we are making this decision. Before we introduce these two factors, we first define the term “best block”. A best block is a block currently residing in the cache, and it contains data that has the closest physical address to the missed address that we are currently trying to fetch from next level cache (or main memory). For instance, consider Figure 1. Assume it indicates the relationship between physical memory and entries currently in the cache. Words B1~B3 are already in cache, but B4~B7 are not, and so on. This means we will have a hit when the program tries to access B1 while accessing B4 will result in a miss.

When B4 is requested, the best block is B3, and which has a distance of one block away from the missed block. If B5 is requested instead, the best block is still B3, but the distance between them is now two blocks. Once we find the best block, we can get the distance between the best block and the missed data. Also, we must extract a second property of the best block: the “group size”. This term describes how many blocks were fetched together when the best block was fetched into cache. For example, the group size of B8 is one. The “group size” here is conceptually equivalent to the dynamic block size which was determined for the B8 data miss. We can treat four consecutive blocks as a single large block in the cache. The block size will be determined based on the distance (between missed data and the best block) and the group size of the best block as follows: If the distance is larger than the group size, fetch the missed data only; otherwise fetch double of the group size unless the maximum block size is reached. Thus, if we got a miss on B5 and the group size of B3 is one, we will fetch B5 only. We will fetch, however, B4~B7 if B3 has a group size of 2 or larger. We can build a table for the block size decision as follows: Group Size (of best block) Distance 1 2 4 8 1 2 4 8 8 2 1 4 8 8 4 1 1 8 8 8 1 1 1 8 >8 1 1 1 1 Table 1: Replacement block size policy

Every time we want of fetch the missed data, we first find out the best block, get its group size, calculate the separation distance between the missed data address and the best block address, and then perform a table lookup to determine how many blocks to fetch. 4.3

Block address alignment One very important nuance to consider is the alignment of the blocks. Since we opted to store all blocks of various sizes in the same cache, we must assure that large blocks will still align correctly with smaller blocks. Additionally, when a fetch is serviced by the next level cache, it cannot retrieve data across multiple blocks (or pages when the next level is main memory) in its cache. Therefore, these is a requirement that the minimum block size of the next level cash is at least large enough to hold the base block size of the dynamic cache times its maximum group size. Additionally, this means the start of each of our larger blocks will be aligned when the block’s starting address modulo the block size times the group number is equal to zero. 4.4

Group size storage device For each entry in the cache, we need to record the information indicating its group size at the time of the fetch. This can be done by adding extra bits in each block. For instance, in order to recognize group size of {1, 2, 4, 8}, two additional bits are sufficient for this purpose. Therefore, the greater number of group sizes supported, the more overhead introduced. For a cache with 32 byte blocks and storage space for a tag, adding 2 extra bits means only 0.7% additional space overhead. After a large single block is fetched, it is separated into the smallest

possible blocks, and stored in the cache as if four consecutive misses were handled. Even though they are consecutive in the physical address space, it’s not necessary to store them in the cache consecutively. The point is larger blocks reduce the compulsory misses, and this doesn’t depend on how they’re stored in the cache but only how they’re fetched. Therefore, we do not need to worry about the problem of fragmentation. 4.5

Replacement strategy We may fetch several blocks at the same time in the form of one large block, but this also means we need to evict several blocks at the same time. Here we do not evict one large block but several of the smallest sized blocks. Since typically the number of sets in our cache is much larger than the maximum group size, there would never be a conflict even if you treated it as four normal consecutive cache misses. Hence, we will choose several blocks to be replaced by using existing replacement policy. We repeatedly utilize the FIFO / LRU / Random method to get those blocks. In this way, we can get a better performance when using caches of higher associativity, as individual evictions reflect the oldest possible data, and thus the least likely data to be reused. Data that is frequently accessed will be kept in the cache, while other rarely used data in the same group will be replaced. The one disadvantage of this approach is that fully associative caches are no longer possible, because a sub-block containing requested data could be evicted by the next sub-block brought in during the same large block fetch.

Spatial Locality 0.03 0.025 0.02 0.015 0.01 0.005

ic Dy na m

25 6

0 12 8

5.1.1 Locality Our primary benchmark was a test set we developed to demonstrate different types of locality. The program has four states: high spatial locality, high temporal locality, high combined locality, and low locality. The high spatial locality program simply iterated through a large array, calculating a stored value based on fixed offsets within the array. The high temporal locality performed a merge sort on a very large linked list, where each list node was 64 bits. The high combined locality was a merge sort on a large array. Low locality generated a large linked list, which was promptly thrown away. Each of these modes of operation was simulated in Simple Scalar using a variable L1 data cache, along with fixed block-size L1 data caches with 32, 64, 128, and 256 byte blocks. All caches had a fixed total size of 8KB, and all were direct mapped. For programs with very high spatial locality but low temporal locality, we expected the largest block sizes to dominate the miss rate performance, since the primary type of miss would be compulsory. We would then expect the dynamic block sizes to quickly converge on the larger block size, as consecutive spatial accesses would drive up the size

64

Benchmarks The following section outlines the benchmarks we chose to simulate our algorithm with. Each test run had a specific D1 data cache setting to be described, while all D2 cache settings were fixed such that the D2 cache block size was at least as large as the largest dynamic block. After describing the setup, we will present our results.

32

5.1

selected for new blocks. In such circumstances, we would expect the dynamic block size to achieve second best performance over all the selected caches. For programs with very high temporal locality but low spatial locality, we expected the smaller block sizes would dominate, since misses were likely to be caused by capacity problems. Since each node of the linked-list was only 64 bits, using 32 byte blocks meant we could hold the greatest number of non-consecutive nodes in the cache. Once again, we expected the dynamic block size cache to converge on the smaller block sizes, as memory accesses would typically be far apart in memory. The programs with no locality, and programs with mixed spatial and temporal locality, the results are harder to predict. For low locality programs, we would expect the miss rate would go up, while mixed locality programs may be the best representation of how a program really works.

Miss Rate

5 Results

Block Size

Figure 2: comparison

Spatial

locality

block

size

ic Dy na m

25 6

12 8

64

0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 32

Miss Rate

Temporal Locality

Block Size

Figure 3: Temporal locality block size comparison

ic Dy na m

25 6

12 8

64

0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 32

Miss Rate

Both Spatial & Temporal Locality

Block Size

Figure 4: comparison

Both

localities

block

size

No Locality 0.025 Miss Rate

0.02 0.015 0.01

spatial locality programs. While we expected dynamic block sizes to show second best performance, in fact it only beat the smallest fixed block size. Programs with only temporal locality, as illustrated in Figure 3, once again contained a minimum at a fixed 128 byte block size. This is somewhat perplexing, as nodes should have been able to fit in individual blocks. It’s possible that this reflects a reality that the heap begins as mostly contiguous, thus multiple nodes are on the same page in memory, and thus multiple nodes are in the same block. The results for dynamic block sizes were much more encouraging here, as they performed second best. Dynamic block sizes again show good results in the mixed locality, approaching the minimum miss rate achieved by the best fixed block cache, shown in Figure 4. Here, we again see both the largest and smallest fixed block sizes show the worst performance. This is a reasonable result, as small blocks will be poor choices for spatial locality, and large blocks would be poor choices for strictly temporal locality. Figure 5 shows the results of poor locality. The major surprise is the result that larger block sizes have a remarkably poor impact on miss rate, while the others remain fairly low. Note that even here, dynamic block sizes show the best performance.

0.005

ic Dy na m

25 6

12 8

64

32

0

Block Size

Figure 5: Low locality block-size comparison.

The actual results were somewhat surprising. As Figure 2 shows, there was a minimum point in the miss table using 128 byte blocks for high

5.1.2 Dijkstra The next benchmark we chose to use was from the MiBench suite. It’s an implementation of Dijkstra’s algorithm, specifically applied to networking problems. The results are shown in Figure 6. We changed the block size of L1 data cache while fixing its total size to be 8KB, and its associativity to 1. The replacement policy is LRU.

6.00% 5.00% 32 Bytes

4.00%

64 Bytes 3.00%

128 Bytes 256 Bytes

2.00%

Dynamic

1.00% 0.00% Miss Rate

Replace Rate

Write Back Rate

Figure 6: Block size vs. replacement rates for Dijkstra benchmark.

memory. Currently, when a larger block is requested on a miss, it is not checked whether any of the new sub-blocks brought in were already in the cache until after the data is returned from memory. At that point, if one of the subblocks was already in the cache, the new data is disregarded. If a mechanism were put in place to fetch only data not already in the cache, we would expect a possible reduction in the miss penalty.

Our algorithm works better than using a fixed small or a large block size. The replacement rate, however, is much higher than others due to the fact that a single miss can result in multiple replacements in our algorithm. We can get the average group size from the ratio of replacement rate to miss rate. In this benchmark, it’s about 3.65. Theoretically our results should be able to achieve a better performance, but it doesn’t work as what we expected. Although the average group size (3.65) is between 64 (2 blocks) and 128 (4 blocks), the miss rate is higher than that of 64 and 128 bytes. We thought this could probably because our prediction mechanism, the table, is not accurate enough, i.e., we might miss-predict the group size. We will discuss about this issue in the future work section.

6 Future Work We identified a couple areas of research that could further improve the performance of variable block-size data caches, namely optimized use of the data bus to reduce the miss penalty, and an optimized block sizing policy for fetches. These are discussed in further detail below. 6.1

Optimized Block fetch One area of future work is optimized use of the data bus to lower levels of

Figures 1a and 1b: cache before and after retrial of a four sub-block super-block. Green blocks are already in the cache, while red blocks represent the new data brought during the fetch.

There are two possible types of fetches that must be considered. First, there is the case when one or more data

sub-blocks already in the cache are at the start or end of the fetch region. In this case, exhibited in Figure 1a, the bandwidth miss penalty could be reduced by requesting a smaller piece of contiguous data. Second, there is the case when one or more sub-blocks already in the cache are in the middle of the fetch region. In this case, exhibited in Figure 1b, it might be more beneficial to request all the data during the fetch, and disregard the extra sub-block after it arrives. This is because typically memory I/O is a function of both latency and bandwidth. While it is typically expensive in terms of time to receive the first bit of data, contiguous data after that comes at a fraction of the cost, so sending separate fetches may in fact increase the miss penalty. For this project, our primary concern was with reducing the miss rate, so implementing this optimization had no effect on our results. 6.2 Learned table parameters As we mentioned in section 4(b), we built a lookup table to decide the group size based on the distance to the nearest neighbor, and the group size of that neighbor. This lookup table can be predetermined, user defined, or selfadapt. In this project we use a naïve, intuitive definition to fill in the table. The performance could possibly be even better if the table can heuristically refine its value on the fly according to the hit / miss status. However, this idea will need a more complicated block size selection mechanism than what we have now. We would have to store information including not only the group size of each entry, but also the distance between the missed data and the best block at the time that particular entry was fetched into cache (because the

closest block address may change as time passes). This will introduce more spatial overhead than what we have added with our algorithm, so the tradeoff between performance in time and in space should be investigated.

7 Conclusion This paper presented a cache design in which block size adjusts dynamically based on locality principles occurring in individual programs. On a data miss, the hardware calculates displacement from the closest block already in the cache, and the group size of that item, both of which are used to choose a new block size to fetch from the next level of memory. The performance results indicate some promise to this technique. As implemented, our technique approaches optimum settings for static block selection, which otherwise must be custom tailored for each individual program. While the actual improvement did not match our theoretical understanding, we believe we can close this gap by improving the policy for selection of replacement block sizes.

8 Bibliography [1] Beg, Azam, and Yul Chu. Utilizing Block Size Variability to Enhance Instruction Fetch Rate. Journal of Computer Science and Technology, Springer, 2006. [2] Dubnicki, C. and LeBlanc, T.J. “Adjustable Block Size Coherent Caches”. Proceedings of 19th Annual International Symposium on Computer Architecture, 1992. [3] Inoue, K. et al. “High Bandwidth, Variable Line-Size Cache Architecture

for Merged DRAM/Logic LSIs”, Japanese IEICE Transactions on Electronics, Vol. E81-C No. 9, pp. 14381477, September 1999. [4] Patterson, David A, and John L. Hennessy. Computer Architecture, a Quantitative Approach. Morgan Kauffmann Publishers, 2007. [5] Tambat, Siddhartha V, et al. “Subtagged Caches: Study of Variable

Cache-Block Size Emulation.” IIS-CSA Technical Report, July 2001. [6] Tse, John, and Alan Jay Smith. “CPU Cache Prefetching: Timing Evaluation of Hardware Implementations”. IEEE Transactions on Computers, Vol 47, Issue 5, May 1998. [7] Veidenbaum, Alexander V, et al. “Adapting Cache Line Size to Application Behavior”. Proceedings of ICS'99, June 1999.

Kin Selection, Multi-Level Selection, and Model Selection