Reuse Distance Based Probabilistic Cache Replacement

Viewer
Transcript

Reuse Distance Based Probabilistic Cache Replacement CVA MEMO 134 Subhasis Das, Tor M. Aamodt, and William J. Dally Abstract

1. Introduction Last level cache misses cause off-chip accesses that consume significant energy and impact performance via higher latency and limited bandwidth. Conventional replacement policies, such as LRU, perform poorly on LLCs because the easy references, with short reuse distances, have been filtered by the lower level caches leaving a references stream dominated by moderate and long reuse distances. Even scan-resistant replacement algorithms, such as DRRIP [7] perform poorly on these LLC reference streams because they are unable to discriminate references with moderate reuse distances from those with long reuse distances. Many conventional cache replacement strategies, such as LRU, are based on an “informal principle of optimality” [1] that states that hit rate is maximized by replacing the block with maximum expected time to re-use. This principle holds for a simplified model of program behavior known as the

32 -‐6 3 64 -‐1 27 12 8-‐ 25 5 >= 2 56

8-‐ 15 16 -‐3 1

4-‐ 7

2-‐ 3

2 <

Frac%on of accesses

(a) Reuse distance frequencies Cache assoc = 16 1.5 Hit rate

LRU

DRRIP

OPT

PRP

1 0.5

64 -‐1 27 12 8-‐ 25 5 >= 2 56

32 -‐6 3

8-‐ 15 16 -‐3 1

4-‐ 7

2-‐ 3

2

0

<

This paper proposes Probabilistic Replacement Policy (PRP), a novel replacement policy that estimates the probability a cache line would receive a hit under optimal replacement. PRP evicts the line with the minimum estimated probability of receiving a hit instead the line with maximum expected reuse distance. The latter is optimal under the independent reference model of program behavior, which does not hold for the last level cache (LLC). To efficiently calculate the probability of a hit PRP augments the TLB to track reuse distance distributions at page granularity using narrow bitwidth histogram counters. To capture long reuse distances metadata is stored alongside the page table for a small subset of lines per page. PRP improves LLC hit rates in two ways. By tracking reuse distances across insertions in the cache it improves hit rates for moderately long reuse distance accesses by separating them from very long reuse distance accesses. Second, PRP is able to perform accurate dead block removal because once a block becomes dead, its probability of receiving a hit in the optimal policy drops sharply. PRP outperforms DRRIP [7], a state-of-the-art LLC replacement algorithm, by 6.7% and reduces LLC misses by 9.6%. PRP requires 34b of metadata per line (7%), and 1.3B (2%) of metadata per line in the DRAM. Using a sampling scheme reduces DRAM overhead to 1.4b (0.3%) per line, while outperforming DRRIP by 5.2%. PRP improves multiprogrammed workloads system throughput 9.1% on average versus TA-DRRIP.

Cache assoc = 16

0.25 0.2 0.15 0.1 0.05 0

(b) Hit rates of accesses having various reuse distances

Figure 1: Reuse distances and hit rates for 429.mcf, cache size = 4MB, block size = 64B

independent reference model which is a poor approximation to the access stream at the LLC. In this paper we show it is better to replace the block with minimum estimated probability of receiving a hit before being evicted. To demonstrate the practical benefits of this approach we introduce two techniques that combined improve hit rate over current hardware cache replacement algorithms. First, to enable calculation of hit probability, we record a coarse-grained reuse distance distribution rather than a scalar proxy for expected reuse distance (such as LRU stack distance). Second, to discriminate reuse distances longer than the size of the cache we retain these distributions for blocks not currently in the cache. To greatly reduce overhead, PRP retains metadata at page granularity. While some work has explored using distributions [25] they have high implementation complexity. An implementation of PRP using these two techniques improves performance by 6.7% and reduces misses by 9.6% compared to DRRIP. This performance increase requires storing 34 bits (7% overhead) with each resident cache line and just 16 bits (3% overhead) with each non-resident line in DRAM. Additionally, PRP requires a probability computation unit that adds 6 pJ per miss and 0.5% to the LLC area. With a sampling technique, the DRAM overhead can be brought down to 1.4b (0.3%) per line, with only 1.5% performance loss over PRP. We observe that an optimal replacement policy enables

2

cache hits to blocks with reuse distances too large for current replacement policies to track. For example, Figure 1a shows the fraction of LLC accesses having different reuse distances in the benchmark 429.mcf. Here reuse distance is defined as the number of accesses to the set containing a cache block, not necessarily unique [4, 23], between consecutive access to that cache block. Figure 1b shows the fraction of accesses of different reuse distances that hit in the cache when using LRU, Belady’s optimal cache replacement algorithm (OPT1 ) [2], DRRIP [7], and PRP replacement policies in a 4MB, 16 way associative LLC. From this figure, we observe that most of the accesses have a high reuse distance, which tend to be misses while using the LRU algorithm. DRRIP provides more hits than LRU for higher reuse distances. However, the OPT algorithm has a much higher hit rate in these reuse distance bins than either LRU or DRRIP. Maintaining a coarse-grained reuse distance distribution improves cache replacement by enabling the policy to discriminate between moderate reuse distance lines and very long reuse distance lines. This enables the last level cache to maintain a portion of a working set with moderate reuse distance in cache protecting it from a working set with a long reuse distance. The contributions of this paper are as follows: • It argues for using probability of a hit instead of expected reuse distance as the principle of optimality for causal replacement algorithms. • It observes that improving cache replacement requires maintaining enough information to distinguish reuse distances larger than the capacity of the last level cache. • It introduces a novel cache replacement algorithm, PRP, that employs detailed reuse distance distributions and metadata for non-resident lines. • It introduces optimizations to reduce the cost of off-chip and on-chip metadata storage. • It shows PRP significantly reduces last level cache misses across a wide range of workloads. The rest of the paper is organized as follows. We provide more background and motivation behind PRP in Section 2. PRP is described in Section 3. Then, the implementation details are given in Section 4, followed by examples of access patterns which PRP handles better in Section 5. Then, simulation methodology and results are described in Section 6 and Section 7. Related work is discussed in Section 8 before concluding.

0.75 0.50 0.25 0.00 0.75 0.50 0.25 0.00 0.75 0.50 0.25 0.00 0.75 0.50 0.25 0.00 0.75 0.50 0.25 0.00

Type B

Type C

Type D

Type E < 16

32

64 128 256 >= 256 Reuse Distance

PRP DRRIP PRP DRRIP PRP DRRIP

Hit Miss

PRP DRRIP PRP DRRIP 0.0 0.2 0.4 0.6 Fraction of Accesses

Figure 2: Dominant line reuse distribution profiles for 429.mcf

is referenced furthest in the future. Denning, et al. [3] distinguish optimal replacement algorithms based upon whether they require future information that is unknown, which they call “unrealizable”, versus “realizable” optimal algorithms that make the best possible replacement decision given a statistical model that is assumed to accurately reflect future program behavior. They propose the independent reference model (IRM) of program behavior in which at each time the probability of accessing a block i is given by stationary probability λi . They argue the optimal replacement algorithm, A0 , will evict the block j with maximum expected reuse distance λ1j . Aho et al. [1] provide a formal proof of the optimality of A0 under the independent reference model. A cache line with reuse given by the independent reference model tends to have a geometric (i.e., exponential) reuse distance distribution. In practice the access sequence observed at the LLC for individual lines does not follow this model. For example, Figure 2 illustrates five dominant reuse distributions for individual cache lines from the SPEC 2006 mcf benchmark. The reuse profile for each individual memory block was found by profiling then clustered using K-means. The five bar charts on the left plot access frequency (y-axis) versus reuse distance (x-axis). The bar charts on the right show relative access frequencies to the different line distributions broken down into hits and misses when employing DRRIP and PRP on the right. Most lines have reuse distance profiles that are multimodal. Evicting the line with maximum expected reuse distance can lead to poor replacement decisions when lines have multimodal reuse distributions. Consider a fully associative cache with capacity of 16 blocks and two replacement candidates A and B. Block A is predicted to be accessed 1024 references in the future with probability P=1. Block B, on the other hand, is predicted to be accessed either 8 reference in the future with P = 0.5 or 8192 references in the future with P = 0.5. Block B has the higher expected reuse distance, 4100 vs 1024 for

2. Motivation This section looks more closely at the motivation for employing reuse distance distributions for replacement decisions. Belady [2] proposed an optimal replacement algorithm under the assumption that the future reference stream is known in which case the optimal replacement candidate is the line which 1 Called

Type A

MOTIVATION

MIN in Belady’s paper.

2

3

3. Probabilistic Replacement Policy

A. However it is better replace block A because it is almost certain to be evicted before it is reused 1024 references in the future. Block B on the other hand has a 50% chance of being hit after just 8 references. In this example, replacing the block with largest expected reuse distance leads to a poor replacement decision. If employing the expected reuse distance to select the replacement candidate can lead to poor choices of which block to evict, the question arises as to “what is a better alternative?” Before introducing OPT, Belady [2] informally argued, “To minimize the number of replacements, we attempt to first replace those blocks that have the lowest probability of being used again.” PRP builds upon this notion of using probability. Below we sketch a brief theoretical argument for replacing the block with minimum estimated probability of receiving a hit before eviction under OPT. Since future accesses are unknown, later we will assume recent reuse distance distributions are a good predictor of upcoming reuse distance statistics. Consider a reference stream consisting of accesses to lines (S1 , S2 , S3 , . . .). Consider a single set of the cache, and assume at time t, the current contents of the set are the lines (x1 , x2 , x3 , . . . , xW ), where W is the associativity of the cache. Let us denote the evicted line at time t by xte . We compare a particular policy F to OPT, the optimal replacement policy assuming the future reference stream is known. A difference between the miss rates of OPT and policy F happens for two reasons: (a) references that hit in the OPT policy but miss in the policy F, and (b) references that miss in the OPT policy but hit in the policy F. For “reasonable” replacement policies Figure 1b shows that there are relatively few references of type (b). Hence we focus only on the number of references which miss in policy F but hit in the OPT policy. We denote this number by ∆F . If a reference Si hits in OPT but misses in F, F must have evicted that line earlier. We define an indicator random variable Ixte which is 1 if the evicted line xte receives a hit at some time t 0 > t before being evicted under the OPT policy, and 0 otherwise. The outcome of this random variable depends upon the actual future reference sequence which we assume is drawn from some (unspecified) probability distribution. Then, ∆F = ∑ Ixte

A cache controller implementing PRP chooses a victim line from a set by selecting the candidate line L with the lowest PLhit , the probability that line L would receive a hit under optimal replacement. The cache controller computes PLhit by estimating the following distributions: 1. The line distribution PL (t): the probability that the next reuse distance for line L will be t, and, 2. the cache distribution Phit (t), the probability any line with reuse distance t would receive a hit under OPT. Using these quantities, the hit probability is estimated as: PLhit =

∑t>TL PL (t)Phit (t) ∑t>TL PL (t)

(3)

Where TL is the age of line L. The sum is over reuse distances t greater than TL since the next reuse distance will be greater than the line’s current age. We note a similar formula was used by Takagi et al. [25]’s Inter-Reference Gap Distribution Replacement (IGDR) policy to compute a “weight” used to select a victim line except that instead of using Phit (t) they use 1t . Section 8 discusses IGRP in more detail. In Equation 3 PL (t) is dependent on the line, while Phit (t) is independent of the line. Thus, a representation of PL (t) is stored for each line in the cache, but only one copy of Phit (t) is maintained. Section 4, discusses the implementation of these distributions. 3.1. Estimating the Line Distribution Policy FREQ: If we assume the next reuse distance for a line is independent of the prior reuse distance for the same line we can estimate PL (t) by recording the frequency NL (i) with which reuse distance i is observed for each line L. The line distribution PL (t) is then estimated using PL (t) =

NL (t) ∑i NL (i)

(4)

If reuse distance t is binned into K bins, then for each line K counts, one for each bin, must be stored. Policy CONDFREQ: It has been observed that reuse distances follow patterns [20]. For example, the reuse distance sequence of one line from the benchmark 450.soplex is (22, 4, 3, 1, 22, 4, 3, 1, 23, 4, 3, 1). Given the previous reuse distance for L was T prev it may be possible to predict the next reuse distance with even greater accuracy by recording the conditional frequency NL (T prev , i). The CONDFREQ policy estimates PL (t) as

(1)

t

Taking expectation of the expressions on both sides, and using the linearity of expectation, we get E[∆F ] = ∑ Pxte (hit)

PROBABILISTIC REPLACEMENT POLICY

(2)

t

Where Pxte (hit) is the probability that xte receives a hit before being evicted using OPT. Hence, we can minimize ∆F , and thus miss rate, by replacing the line x that has the lowest Px (hit), i.e., lowest probability of hit under the OPT policy. To employ the above approach we require a practical approach to estimating the probability of a hit. The following section describes our approach.

PL (t) =

NL (T prev ,t) ∑i NL (T prev , i)

(5)

If reuse distance is binned into K bins, K 2 different counts are stored leading to higher overhead for CONDFREQ. 3

3.2

Estimating the Cache Distribution

4

3.2. Estimating the Cache Distribution

IMPLEMENTATION

recent references more heavily than older references allowing the distribution to adapt more quickly to non-stationary behavior. In our implementation, we use a 4 bit precision for all the counters. We show in Section 7.1 that this does not lead to any significant degradation in performance.

To estimate the cache distribution Phit (t), we use the average hit rate of OPT for each reuse distance t over the SPEC2006 suite. The exact distribution we use in our evaluations is described in Section 4.6. In Section 7.1.3 we show that PRP works well for a wide range of synthetic cache distributions

4.3. Computing Reuse Distance To compute reuse distance for each line we keep a count M of accesses to each set of the LLC and a timestamp ML for each line L. The age of a line, TL is computed as TL = M − ML . When a line is reused, we increment the histogram bin NL (TL ) associated with its age and reset its timestamp to the current count ML = M. To save space, we encode timestamps in units of W /2, half the way size (i.e., for our 16-way cache we discard the low 3-bits of M when recording a timestamp ML .). Aliasing occurs if the reuse distance is greater than the range of the timestamp. However, the effect of this aliasing is small, in part because due to the geometric bin sizing aliased timestamps tend to fall in the > 16W bin. In practice we found using a 10-bit timestamp was sufficient.

4. Implementation Figure 3 illustrates our implementation of PRP, highlighting a single set of a 2-way LLC. On an LLC miss to line, the line is fetched from DRAM and in parallel a victim is selected. To select a victim an array of hit probability calculators ( 1 ) computes PLhiti for each candidate line Li using its age TLi and reuse distribution NLi (t). The candidate with the lowest probability of hit is evicted to make space for the incoming line. The reuse distribution for the incoming line is initialized with the reuse profile NL (t) that was stored alongside the page table translations in the TLB ( 2 ) and brought to the LLC alongside the memory request that initiated the LLC access ( 3 ). Below each component is described in detail along with the representation of line timestamps, reuse distance bins and reuse distance frequencies.

4.4. Efficiently Storing Reuse Histograms We observed that the frequency vectors of adjacent lines in page are similar. We leveraged this observation to reduce the overhead of storing reuse distance frequency vectors by associating a single vector with a profile block of multiple consecutive lines. We found a profile block size consisted of 64 consecutive lines or 4KB equal to a page size works well. Section 7.1 shows the impact of profile block size. We found larger profile blocks tend to be better even when ignoring the bandwidth overhead savings. Reuse distance histograms are collected online as an application runs and stored adjacent to the page translation in the TLB ( 2 in Figure 3). Upon n TLB eviction the histogram is stored in memory in a structure parallel to the page table called the PRP Metadata Table. This PRP Metadata Table is cached in the LLC and to avoid recursively fetching frequency vectors for the lines holding this data we assign them a uniform NL (t). On a TLB access that misses ( 4 ) the PRP metadata is loaded into the TLB ( 5 ) alongside the page translation ( 6 ). After accessing the LLC and computing TL the reuse histogram in the TLB is updated together with the response to the original memory request ( 7 ). The metadata associated with each page is as follows: Metadata for FREQ policy: For COND each page contains a frequency vector NL (t) for the page. This frequency vector is of length 24 bits (6 bins × 4 bits per frequency). Also, a last access timestamp of 10b length needs to be stored for each line in the DRAM. Thus, the total DRAM storage overhead is 10b + 24/64b ≈ 1.3B per line. Below we discuss reducing this overhead using a sampling technique. In addition, the timestamp and a frequency vector needs to be stored along with each line in the cache, resulting in an overhead of 10b + 24b = 34b per line for the cache.

4.1. Representing Reuse Distance Histograms To minimize space, we use a logarithmic spacing of reusedistance histogram bins focused on the range where hit rate varies with reuse distance. We group reuse distances into H histogram bins (we found H = 6 was sufficient). Bin 0 records reuse distances in the interval [1,W ), where W is way size of the cache. Bin i = 1, 2, . . . , H − 2 record reuse distances that fall in the interval [W α i−1 ,W α i ) where α is a constant (we found α = 2 works well). The last bin (i = H −1) records reuse distances in the range [W α H−2 , ∞), i.e., all reuse distances ≥ α H−2 times the way size. In our evaluation using a 4MB, 16-way associative cache with 64B lines, the intervals are [1,15], [16,31], . . . , [256, ∞). For the optimal policy, we observed the hit rates for accesses with reuse distance ≥ 16W is almost 0, whereas for reuse distance ≤ W it is always 1. We also show in Section 7.1 that having smaller bins for the reuse distances does not improve the performance significantly. 4.2. Encoding Reuse Distance Frequencies We store a reuse distance profile NL (t) with each line L in the cache. This profile has one entry for each of the bins described above. To represent an indefinite count with a small, finiteprecision, counter we halve all of the counter values for one line’s histogram whenever any counter in the line overflows. For example, suppose the counter precision is 4 bits and the current counter values for the different reuse bins are [7, 9, 2, 10, 15, 8]. Once an access with reuse distance in interval 4 is observed, the counter value overflows leading to the halving of all the counter values. Thus the new counter values are [3, 4, 1, 5, 8, 4]. This method has the added benefit that it weights 4

4.5

Sampling

5

Page Table

PRP Metadata Table

PTE

TLB 4

tag

NL(t) LIDs

6 page

LLC Tag Array tagL1

M

ML

NL(t) LIDs 2

tagL2

NL1(t) ML1

7

5

PRP EXAMPLE

Calc.1

Calc.2

1 hit

3

ML

NL2(t) ML2

hit

PL1

PL2

Victim Line

Min

Figure 3: Overall hardware for PRP

t bin 24 · Phit

Metadata for CONDFREQ policy: For CONDFREQ each page contains a conditional frequency vector of length 144b (62 frequencies, each of 4 bit width). A 10b last access timestamp and a 3b last reuse bin also needs to be stored for each line in the DRAM. Thus, a total of 10b + 3b + 144/64b ≈ 2B needs to be stored per line in the DRAM. The complete conditional frequency vector does not need to be stored along with each LLC line. In the LLC each line needs only portion of the frequency vector corresponding to its last reuse distance. In this case the overhead per line is 10b + 24b = 34b. However we note this space optimization does incur some additional traffic between the LLC and TLB.

0-15

16-31

32-63

64-127

128-255

256-∞

15

14

12

10

9

1

Table 1: Cache Distribution Reuse frequencies NL (0) NL (1) NL (2) NL (3) NL (4) NL (5) TL

Zero out for t < TL Cache distribution, P hit

0

0

NL (2) NL (3) NL (4) NL (5) 4

4

4

Sum of elements

Dot Product

7

11 Divider

4.5. Sampling

11 PLhit

To measure long reuse distance intervals for lines not currently in the LLC PRP requires storing the 10-bit last access timestamps ML for each line in the PRP Metadata Table. To further reduce the overhead of PRP we propose sampling a randomly selected subset of lines per page. This subset is obtained as the first four touched lines in each page. To enable this optimization requires adding a set of line IDs labeled LIDs in Figure 3. Each LID need only encode the offset of the line within the page and so for 64B lines and 4KB page size they are 6-bits. As shown in Section 7 sampling only four lines per page is sufficient to obtain most of the benefits of PRP. Since a 6 bit LID and a 10 bit timestamp need to be stored per line, sampling 4 lines constitutes an overhead of 64 bits, thus bringing down the DRAM storage overhead to 1.4 bits per cache line.

Figure 4: Probability Calculator Unit

4.7. Probability Calculator Unit Given the line distribution of a particular line, the hit probability calculator unit (Calci in Figure 3) calculates the hit probability using Equation 3. The schematic of this unit is shown in Figure 4. The unit utilizes the fact that Equation 3 can be rewritten as PLhit =

∑t>TL NL (t)Phit (t) ∑t>TL NL (t)

(6)

Here NL (t) is the frequency of occurence of reuse distance t for line L. Thus, first the frequencies of all the bins < the current age bin, TL are zeroed out. Then a dot product is done between this truncated frequency vector and the cache distribution. The dot product is then divided by the sum of the elements in the frequency vector to get PLhit . All arithmetic is low precision so the energy consumed is much lower than a memory access. The energy of these operations is included in our evaluation.

4.6. Cache Distribution The Cache Distribution, Phit (t) is the probability that a line of reuse distance t will hit in the cache. This distribution is not dependent on the line. As described in Section 3.2, we use a fixed cache distribution which is the average hit rate of the OPT policy in the selected reuse bins using the training input set. These probabilities are also quantized to 4 bits like the line distribution. For a 16-way, 4 MB cache cache we use the probabilities in Table 1.

5. PRP Example This section considers an example to provide insight into PRP. 5

5.1

Distinguishing reuse distances

5

5.1. Distinguishing reuse distances

PRP EXAMPLE

tance accesses for various benchmarks, and Figure 6b shows the hit rates of LRU, DRRIP, PRP and OPT on these accesses. It can be observed that on an average 38% of accesses are of moderate reuse distance. LRU achieves a hit rate of 1.4%, while DRRIP achieves a hit rate of 32% on this category of accesses. PRP is better than either and achieves a hit rate of 49% for moderate reuse distance accesses. Necessity of storing information in DRAM: Policies such as DRRIP or DGIPPR only store information about lines that are present in the cache. The key insight is that such policies can make a replacement decision based only on the behavior of the line since last insertion. We call this class of policies non-discriminating. A discriminating policy, on the other hand, stores metadata to differentiate between lines with same behavior since last insertion. Thus, all variants of PRP are discriminating policies. We now argue why a discriminating policy is necessary to get hits to moderate reuse distance lines. As can be observed in Figure 5, the lines A2 , A3 , . . . have not yet received hits when the replacement candidate for S2 needs to be obtained. Thus, non-discriminating policies such as DRRIP and DGIPPR can not differentiate between S1 and Ai , since they only look at the behavior of a line since last insertion. A discriminating policy such as PRP, on the other hand, can distinguish the Ai lines to be of moderate reuse, and thus replace the S1 line instead correctly. To prove this point, we looked at lines which suffered two misses in a row in DRRIP, i.e., the line was evicted before receiving any hits. We then looked at the number of cases where the second miss was converted to a hit in PRP (∆NH , where NH stands for “No Hit”), and compared this number to the total number of misses reduced by PRP (∆ALL ). Figure 7 plots the ratio ∆NH /∆ALL . We can see that about 80% of the additional hits in PRP arise from the ability of PRP to know the reuse distribution of lines which have not received a hit yet. This is also a strong indication that cache replacement policies should be designed to be discriminating, i.e., to take into account the past behavior of a cache line in making the replacement decision.

Modern replacement policies try to protect short reuse distance accesses from scan patterns induced by long reuse distance accesses. However, we find such policies are not able to protect moderate reuse distance accesses from long reuse distance accesses. An example access pattern to a single set of a 4-way LLC from the benchmark 429.mcf is shown in Figure 5. Accesses to A1 , A2 , . . . have moderate reuse distance of 80 and accesses to S1 , S2 , . . . have a long reuse distance of 1000. This pattern is created by two scans performed by two loops inside the function price_out_impl() in the file implicit.c which is called repeatedly from global_opt() in the file mcf.c. The working set of the first loop (lines 263-264 in implicit.c) fits in the cache, whereas that of the second loop (especially line 269) is larger. The middle columns of Figure 5 show the behavior of a state-of-the-art replacement algorithm, DRRIP [7], on this pattern. DRRIP was introduced by Jaleel et al. [7] who start by considering the “LRU chain” used for LRU replacement as providing predictions on re-reference intervals. They point out the LRU chain makes poor predictions for workloads that contain scans and thrashing interspersed with LRU-friendly access patterns. To avoid evicting useful data they propose several re-reference interval prediction (RRIP) mechanisms. Static RRIP (SRRIP) employs an M-bit re-reference interval prediction value (RRPV) to replace the metadata employed by pseudo-LRU algorithms. Missing references are initially inserted with an RRPV value encoding a “long” re-reference interval prediction. Under SRRIP-HP a hit changes the RRPV encoding to a “near-immediate” prediction. Upon a miss the first block with a “distant” re-reference interval prediction is selected for eviction. Bimodal RRIP (BRRIP) inserts a small fraction ε of references with a “long” RRPV and the rest with a “distant” RRPV. The effect is to enable scan resistance. Dynamic RRIP (DRRIP) employs set-dueling [21] to select between SRRIP and BRRIP. For the access sequence in the example at the point when a replacement candidate is required, none of the lines A1 , A2 , . . . have received any hits, and so SRRIP is unable to distinguish between the lines belonging to the moderate vs the long scan. On the other hand, the scan resistant BRRIP policy chooses a small random fraction of a scan to retain in the cache. Since this selection is not dependent on the past behavior of the lines, only a small fraction of the moderate reuse distance scans end up being retained in the cache. PRP can distinguish between the long and moderate reuse distance lines since it stores the reuse distance histograms of the different lines. Thus, for PRP S2 evicts S1 , S3 evicts S2 , and so on, while maintaining A2 , A3 , etc... in the cache. To gauge the importance of moderate reuse distance lines, we collected the hit rate of PRP and previous policies for moderate reuse distance accesses. The results are shown in Figure 6. Figure 6a shows the fraction of moderate reuse dis-

5.2. Evicting dead blocks early Since PRP computes the probability of hit given the current age of a line, blocks become available for replacement immediately after their "live period" has passed. On the other hand, PRP does not evict blocks that still have some possibility of receiving a hit at a moderately long reuse distance. 5.3. Effect of accounting for non-temporality Above we have described PRP such that the incoming line is not bypassed [14] even if the victim line has a higher probability of hit than the incoming line. We have observed that extending PRP to use bypassing based upon computing PLhit for the incoming line provides only a 0.2% performance benefit over using PRP without bypass. The reason behind this phenomenon can be better understood from Figure 5. In this 6

6

Access Pattern

SRRIP

BRRIP

METHODOLOGY

PRP

A1 7

7

7

7

A2

A1

A3

A1 A2

A4

A1 A2 A3

S1

A1 A2 A3 A4

S2

S1 A2 A3 A4

S3

S1 S2 A3 A4

S4

S1 S2 S3 A4

6 6 6 6 6 6 6

7 6 6 6 7 6 6

7

7

7 6 6 7 7 6

S77 A1 A2 A3

S78

A4 A1 A2 A3

S79

A4 S78 A2 A3

S80

A4 S78 S79 A3

6

.. .

6

6 6 7 6 6

7 7 6 6 7 7 7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

.. . S3 S27 S45 S76

6

6

6

6

7

S3 S27 S45 S77

7

6

6

6

7

S3 S27 S45 A1

7

6

6

6

7

S3 S27 S45 A2

7

6

6

6

7

S3 S27 S45 A3

6

6

6

6

7

S3 S27 S45 A4

7

6

6

6

7

S3 S27 S45 S78

7

6

6

6

7

S3 S27 S45 S79

7

6

6

6

7

= 1/32 In this case, lines S3 , S27 and S45 have RRPV = 6.

Set associativity = 4 lines

0

0

0

0

0

0

0

0

PLhit

0.5 0.5

0.5

A1 A2 A3

S3 A2 A3 A4 6

0

A1 A2

S2 A2 A3 A4

7

A4

6

7

S1 A2 A3 A4

7

S77 A1 A2 S76

7

6

7

7

7

A3

6

RRPV

A1 A2 A3 A4

6

S77 A1 S75 S76

6

7

A1 7

7

7

A2

6

7

A1 A2 A3

S77 S74 S75 S76

6

7

7

A1

6

7

A1 A2

.. . S73 S74 S75 S76

6

7

A1

.. . S77

6

RRPV

0.5

0.5

0.5

0

A1 A2 A3 A4 0.5

0.5

0.5

0.5

S1 A2 A3 A4 0.1

0.5

0.5

0.5

S2 A2 A3 A4 0.1

0.5

0.5

0.5

S3 A2 A3 A4 0.1

0.5

0.5

0.5

.. . S76 A2 A3 A4 0.1

0.5

0.5

0.5

S77 A2 A3 A4 0.1

0.5

0.5

0.5

A1 A2 A3 A4 0.5

0.5

0.5

0.5

A1 A2 A3 A4 0.5

0.5

0.5

0.5

A1 A2 A3 A4 0.5

0.5

0.5

0.5

A1 A2 A3 A4 0.5

0.5

0.5

0.5

S78 A2 A3 A4 0.1

0.5

0.5

0.5

S79 A2 A3 A4 0.1

0.5

0.5

0.5

PS (t) = [0, 0, 0, 0, 0, 1] PA (t) = [0, 0, 0, 1, 0, 0] P hit (t) = [1, 0.9, 0.7, 0.5, 0.3, 0.1]

Evicted line Hit line

Figure 5: Scan pattern handling for SRRIP, BRRIP, and PRP

case, it can be observed that line S1 evicts line A1 even though A1 has a higher hit probability. This results in having only 3 hits instead of 4 hits which would have happened with bypassing. However, note that in the case of a 16-way cache, the total number of hits for a similar pattern would have decreased from 16 to 15, which is minor. Thus, as associativity increases, the benefits of using bypassing with PRP reduces.

consumption of a low overhead replacement policy. For all policies we use parameters provided in the respective papers. The system parameters we use are shown in Table 2. All line sizes are 64B. The access energies for all the caches were computed using Cacti [16]. We use SpecCPU2006 benchmarks to evaluate the various cache replacement policies. We use Pinpoints [19] to obtain up to 10 simpoints of length 500M instructions each which is representative of more than 90% of program execution. A subset of the SPEC benchmarks is simulated for which performance increases by at least 5% when LLC size is increased from 4MB to 8MB.

6. Methodology We use MARSSx86 [18], an x86-64 full system simulator to compare different cache policies. We compare PRP to DRRIP. We validated our implementation of DRRIP by comparing with the implementation of DRRIP provided with CMP$im [6]. We also simulate WN1-4-DGIPPR for obtaining the energy

To evaluate the latency and energy costs of the computation involved in obtaining the hit probability, we synthesize the 7

0.6

Core

0.4 0.2 0

(a) Fraction of moderate reuse distance accesses PRP

as bw tar av e ca s ct us de gc alii c.g 23 go bm ge m k. s 13 x hm gro 13 m ma er cs .n ph 3 le sli e m cf o m pe mn ilc rl. et sp p p lit m ai so pl sje l ex n .p g ds 5 sp 0 hi nx w rf

Hit Rate

DRRIP 1 0.8 0.6 0.4 0.2 0

(b) Hit rates of moderate reuse distance accesses

Figure 6: Moderate reuse distance fractions and hit rates 1

ΔNH/ΔALL

RESULTS

Timing parameters 4-way Out-Of-Order, 128 ROB entries, 2.4GHz L1 data cache 32 KB, 4 way, 1 cycle L1 instruction cache 32 KB, 4 way, 1 cycle L2 cache 256 KB, 8 way, 10 cycle L3 cache 4 MB, 16 way, 30 cycle DRAM latency 200 cycles PRP parameters Reuse distance vector 6 bins, 4b precision Timestamp precision 10b Profile block size 4 KB Energy parameters Core energy / instruction 1 nJ L1(I/D) cache access 20 pJ L2 cache access 70 pJ L3 cache access 438 pJ DRAM access 25.6 nJ Timestamp access (all ways) 30 pJ Profile access (all ways) 40 pJ PLhit calculation (all ways) 6 pJ

0.8

as bw tar av e ca s ct us de gc alii c.g 23 go bm ge m k. s 13 x hm gro 13 m ma er cs .n ph 3 le sli e m cf o m pe mn ilc rl. et sp p p lit m ai so pl sje l ex n .p g ds 5 sp 0 hi nx w rf

Frac%on of accesses

7

Table 2: System parameters

0.8 0.6

PRP-CondFreq, despite being a more sophisticated policy, performs worse than PRP-Freq for benchmarks such as mcf, xalanc and leslie. The reason behind this degradation is the extra metadata fetch overhead of PRP-CondFreq. In order to quantify the overhead of metadata fetch, we also simulated an ideal CondFreq policy which does not involve any metadata fetch. The performance of various PRP policies is shown in Figure 10a. PRP-Ideal achieves a 12.8% speedup over LRU, which is 0.8% better than PRP-CondFreq. Figure 10a, shows the relative performances of various granularities of line sampling in PRP. It can be seen that even sampling only 2 lines out of the 64 lines in a page can lead to 10.2% improvement in performance, while sampling 8 lines leads to an 11.5% gain in performance. Thus, the performance of the sampling policies is near to that of PRP-Freq. Figure 9 shows the LLC demand miss rates of various policies as compared to a baseline LRU policy. PRP-Freq decreases misses by 15.2% over LRU, which is 9.6% better than DRRIP. Compared to OPT, PRP is worse by 16.8%. Figure 10b shows the full system dynamic energy savings of various PRP variants as well as DRRIP and DGIPPR over LRU. The system energy includes dynamic energy consumption by the core, caches and DRAM. Both PRP-Freq and PRP-CondFreq saves 3% of the full system energy. DRRIP consumes 2% more energy than the LRU baseline. DRRIP does not allocate lines for writeback and does not update the RRPV’s of lines on writeback requests. This leads to a higher writeback miss rate which increases the DRAM traffic in DRRIP. DGIPPR, on the other hand, saves 1.2% dynamic energy, which is 1.8% worse than PRP.

0.4 0.2

e Av er ag e

le sli

s ca ct u

ta r as

nc

nx hi sp

la xa

pl ex om ne tp p

so

m cf

0

Figure 7: Fraction of hits to lines evicted without hits in DRRIP

logic in 45 nm TSMC node and obtain its latency and energy. One Probability Calculator unit takes 5 processor cycles to compute PLhit , and consumes 374 fJ per operation. The area of an unit is equal to 0.00129mm2 . Thus, 16 parallel probability calculator units, one for each way of the LLC, consume 6 pJ to compute all the hit probabilities, and consume an area of 0.0207mm2 , which is < 0.5% of the LLC area.

7. Results In this section, we present the performance and energy results of the PRP policy. We evaluate PRP with FREQ policy (PRPFreq), PRP with CONDFREQ policy (PRP-Condfreq), and PRP with sampling 4 lines per page (PRP-Sample4). The performances of these policies are summarized in Figure 8. It can be seen that PRP-Freq has a performance advantage of 12.5% over the baseline LRU policy, as opposed to 5.8% for DRRIP. PRP-Condfreq has a performance advantage of 12.0%, while PRP-Sample4 has a performance advantage of 11.0%. Thus, PRP-Freq performs 6.7% better than DRRIP over the chosen set of workloads. It can be seen that PRP-Sample4, despite sampling only 1/16 of the lines in each page, performs almost as well as PRP-Freq. 8

Sensitivity of design parameters

7

av er ag e

m p ze us

nc xa

la

w rf

so

pl

sp

hi

nx

50 ds ex .p

pl

RESULTS

PRP-‐Sample4

l

PRP-‐CondFreq

itm ai

p

PRP-‐Freq

pe rl. s

m cf

e le sli

gr om ac s

DRRIP

om ne tp

1.5 1.4 1.3 1.2 1.1 1 0.9

gc c.g 23

Speedup over LRU

7.1

av er ag e

m p ze us

nc xa la

so

pl

sp

hi

nx

50 ds ex .p

OPT

w rf

PRP-‐Sample4

l itm ai pl

p

PRP-‐CondFreq

pe rl. s

om ne tp

e le sli

gr om ac s

PRP-‐Freq

m cf

DRRIP

1.2 1 0.8 0.6 0.4 0.2 0

gc c.g 23

Miss rate rela*ve to LRU

Figure 8: PRP performance

Figure 9: PRP miss rates

where obtaining Phit for OPT is not feasible, such empirical distributions may be used. While different Phit distributions do not have a significant effect on the average performance of PRP, we have observed that benchmarks such as Omnet++ and Leslie3D can benefit by upto 2% from using a Phit distribution tailored to the benchmark. Thus, when optimizing the performance of a particular application is paramount, specific Phit distributions could potentially be encoded into the cache distribution. 7.1.4. Sensitivity to frequency vector binning: We varied α, the reuse distance bin size multiplier, to evaluate the sensitivity of PRP to the reuse vector representation. The total number of bins were appropriately scaled to cover the span from W to 16W , W being the cache associativity. The results are shown in Figure 11c. We can see that any α within the range 1.5 – 2.5 performs similarly as PRP. However, outside that range the performance of PRP degrades. With a higher value of α, the number of bins become too few to accurately discriminate between various reuse distances. On the other hand, with lower values of α, the reuse distance frequencies are distributed over a larger number of bins, and thus become more noisy.

7.1. Sensitivity of design parameters Below we study the sensitivity of PRP design parameters. 7.1.1. Sensitivity to profile block size: Figure 11a shows the sensitivity of PRP-Freq to the size of profile block, i.e., the group of lines whose frequency vectors are accumulated together. The performance of PRP-Freq increases as the size of the profile block increases. This is because the larger profile blocks can collect reuse distances of more lines and thus get trained faster. 7.1.2. Sensitivity to frequency vector precision: Figure 11b shows the performance of PRP for various precisions of the frequency vector. It can be observed that PRP with frequency vector precision of 4 bits gives almost same performance as higher precisions. Also, even with 1 bit reuse frequencies, PRP-Freq is able to achieve 5.3% performance gain over LRU, which is similar to DRRIP. 7.1.3. Sensitivity to OPT cache distribution Phit (t): Above, we used the OPT hit rate distribution averaged across the SPEC benchmarks as the cache distribution Phit (t). However, such profile information may not always be available. Thus, we also evaluate the performance of PRP using other empirical distributions. We generate empirical probability distributions by the following method. The Phit value for Bin 0 and Bin 5 are fixed at 15/16 and 1/16 respectively. The Phit value for Bin i, where 0 < i < 5 is set to be 15/16 − Ki. The performance of PRP with varying K is shown in Figure 11d. It can be observed that these empirical distributions do not perform significantly worse than PRP, and the performance is within 0.5% of PRP for all values of K. Thus, in situations

7.2. Performance for multiprogrammed workloads To evaluate the effectiveness of PRP in a multiprocessor environment, we also simulated PRP for 51 random workload mixes in a 4 processor system. Each mix was created by first choosing 4 random benchmarks, and then choosing a single simpoint from each of the chosen benchmarks with a probability equal to the weight of the simpoint. We used CMP$im [6] for this study which models a processor system with the same parameters as given in Table 2. Note that since we cannot use 9

RELATED WORK

4% 2% 1% 0%

RI

-‐1%

P DG IP PR PR P-‐ F re PR q P-‐ Co nd Fr eq PR P-‐ Id ea PR l P-‐ Sa m pl e8 PR P-‐ Sa m pl e4 PR P-‐ Sa m pl e2

Sa

3%

DR

Energy savings over LRU

e2 m pl

e4 m pl

e8 m pl

Sa

Co

Sa

nd

Id

ea l

Fr eq

1.14 1.12 1.1 1.08 1.06 1.04 1.02 1

Fr eq

Speedup over LRU

8

-‐2% -‐3%

(a) Performance of various PRP variants

(b) Full system energy savings

Figure 10: Performance and Energy savings for various PRP variants

1

256 512 1K 2K 4K Proﬁle Block size (B)

1 2 4 6 8 10 Frequency precision (b)

1.04

1.06 1.04

1.02

1.02

1

1

Bin size mul4plier (α)

(a) Sensitivity to Profile Block size (b) Sensitivity to Frequency Vector precision

(c) Sensitivity to sizes of bins

K = 0.2

1

1.06

1.1 1.08

K = 0.1

1.02

1.02

1.08

1.12

K = 0.05

1.04

1.1

PRP

1.04

1.06

Speedup over LRU

1.06

1.08

1.12

1.25 1.5 1.75 2 2.25 2.5 2.75

1.08

1.1

Speedup over LRU

1.1

1.14

1.14

K = 0.03

1.12

1.12 Speedup over LRU

Speedup over LRU

1.14

(d) Sensitivity to Phit

Figure 11: Sensitivity to various parameters of PRP

8. Related Work

Rela%ve System Throughput

1.3 1.25

8.1. Scan and Thrash Resistance

1.2

PRP-‐Freq Mean

1.15

A scan is a sequence of accesses that does not repeat. A sequence is said to be thrashing when it is repeated but the sequence is larger than the cache. PRP helps improve LLC performance by addressing both patterns. Qureshi et al. [21] propose a dynamic insertion policy (DIP) that improves performance by selectively inserting lines into the least instead of most recently used position in the recency stack. DIP adaptively selects a bimodal insertion policy (BIP) or traditional LRU. BIP improves hit rate for thrashing workloads by inserting most lines into the least recently used position and hence retaining some of the working set. BIP adapts to changing workloads by occasionally inserting some lines into the most recently used position. Rather than finding the right blocks to keep by statistically sampling, PRP tracks reuse distances which helps it identify lines that will be reused. Jiménez [8] introduced Dynamic Genetic Insertion and Promotion for PseudoLRU Replacement (DGIPPR). Building upon DRRIP [7] and other works, Jiménez reinterprets the positions in the LRU-stack. The position a block moves to upon initial insertion or a subsequent hit is governed by a generic insertion/promotion vector (IPV) that indicates the next location to move to upon a subsequent reference. To reduce storage overhead, the approach is then applied to a

1.1 1.05 1

DRRIP

0.95 0.9 0

5

10

15

20

25

30

35

40

45

50

Figure 12: S-curve for system throughput of PRP

CMP$im to model TLB traffic data, the metadata overhead is not present in these simulations. However, as seen in Figure 8, the metadata overhead is not significant for PRP-Freq. Figure 12 shows the s-curve of the system throughput metric [5] of PRP-Freq with respect TA-DRRIP [7]. It can be seen that PRP-Freq improves the system throughput by a maximum of 26.9% over TA-DRRIP, and degrades the throughput by at most 4.5%. A throughput degradation is observed in 2 out of the 51 simulated cases. On an average, PRP-Freq improves system throughput by 9.1%. 10

8.2

Optimal Replacement

9

Pseudo-LRU encoding. While such policies such as DRRIP and DGIPPR are designed to tolerate scans and thrashing as noted in Section 5, they lack reliable information about a block when it is first inserted.

CONCLUSION

too long in the cache, whereas a too short PD will lead to useful lines being evicted. PRP, by storing a distribution for every line, avoids this problem. 8.4. Dead Block Prediction

8.2. Optimal Replacement A significant amount of work has been done towards predicting when blocks become dead, i.e., when they will receive no hits in the future [10–13]. Lai et al. [11] proposed a dead block predictor that uses the last PC that touched a block to predict when a block becomes dead. Liu et al. [13] propose CacheBurst, which uses number of bursts instead of number of raw references to a cache block for predicting dead blocks. Khan et al. [10] use only a sample of the LLC accesses to train a smaller and more accurate dead block predictor based on the program PC. These works declare a block to be dead when it does not receive any hits under the LRU replacement policy. However, as shown in Section 5, even blocks with very high reuse distance which will certainly be evicted under the LRU policy can receive hits under the OPT policy. In this aspect, the closest work to PRP is by Lin et al. [12]. In this work, the authors try to predict when a block becomes dead under the OPT policy by running a collected address trace through a OPT policy simulator and detecting the last-touch PC’s. Note that since OPT is an offline policy, this method has a significant overhead of profiling an application. PRP, on the other hand, characterizes the OPT algorithm only by the distribution Phit , which can then be used for all programs.

Rajan and Govindarajan [22] propose the Shepherd Cache which attempts to emulate optimal replacement for a subset of ways in the cache by using the remaining shepherd ways to “look ahead” in the access stream. This look ahead distance is limited by the size of the shepherd cache ways to being much smaller than the reuse distances that PRP can consider. 8.3. Reuse Distance and Distributions Takagi et al. [25] propose Inter-Reference Gap Distribution Replacement (IGDR). The concepts underlying IGDR are somewhat similar to PRP. Rather than grouping lines based upon pages, IGDR categorizes each line into one of five generic reuse classes then maintains reuse distributions for each class. The class of a line is determined based upon the number of references a line has received as well as their regularity. These distributions are used along with the time of last reference to compute a weight for each replacement candidate. A significant practical difference with PRP is the size of the histograms. IGDR maintains histograms with 256 uniformly spaced bins versus PRP’s 6 geometrically sized bins. This difference means that where PRP can compute hit probabilities within a few cycles, IGDR takes much longer to compute the weights which are stored in a table for each class and periodically updated in the background. IGDR also requires a two step process of classifying blocks, then finding their reuse distance profiles. Given it requires several large data structures including multiple queues and tables we believe it would be more complicated to implement. Kaxiras et al. [9] propose a reuse distance prediction based mechanism for doing cache replacement. The authors try to predict the next reuse distance of an access based on the reuse distance patterns observed by the PC that last touched the line. This work also uses a log2 based reuse distance bucketing similar to what we propose. However, since the reuse distance prediction has a confidence associated with it, it falls back to LRU when the confidence is low. This can lead to problems for workloads where the reuse distances are not predictable. Since PRP functions on the basis of the probability distribution of reuse distances, it does not suffer from this problem. Duong et al. [4] employ online profiling of an application’s overall reuse distance distribution to determine a protecting distance value used for determining cache replacement decisions. Each line contains a counter that is set to the protecting distance value upon insertion. Each access to a cache set decrements the counters for each line in the set until they saturate at zero. Only lines with a protecting distance value of zero are eligible for replacement. Since PDP computes only a single protecting distance, a longer PD might leave dead blocks for

8.5. Metadata for evicted blocks Several virtual memory and database buffer replacement policies retain metadata for nonresident pages or blocks to improve replacement decisions. Examples include EELRU [24], LRUK [17], and ARC [15]. One challenge to adopting this practice for cache replacement is the additional storage and bandwidth costs implied. PRP mitigates these overheads by storing reuse distributions at page granularity.

9. Conclusion In this paper we introduce probabilistic replacement policy (PRP), a novel LLC replacement policy. On a miss PRP estimates the probability each block in the cache set would receive a hit under optimal replacement if it were to be retained and then evicts the block with lowest probability of a hit. We argue using a probability calculation is more robust under the varying reuse distance intervals observed at the LLC. A key challenge to employing PRP is to implement this probability calculation efficiently and with low complexity. To achieve this we introduced several optimizations that are found to be effective. Reuse distances are tracked using low precision narrow bitwidth counters for a small number of geometrically spaced reuse distance bins and tracked at page granularity. To reduce offchip storage costs reuse distances are tracked for a 11

REFERENCES

REFERENCES

small subset of lines per page. PRP outperforms DRRIP [7], a state-of-the-art LLC replacement algorithm, by 6.7% and reduces LLC misses by 9.6% and naturally adapts to multiprogrammed workloads.

[17] E. J. O’Neil, P. E. O’Neil, and G. Weikum, “The lru-k page replacement algorithm for database disk buffering,” in Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’93. New York, NY, USA: ACM, 1993, pp. 297–306. Available: http://doi.acm.org/10.1145/170035.170081 [18] A. Patel et al., “MARSS: a full system simulator for multicore x86 CPUs,” in Proceedings of the 48th Design Automation Conference. ACM, 2011, p. 1050–1055. Available: http: //dl.acm.org/citation.cfm?id=2024954 [19] H. Patil et al., “Pinpointing representative portions of large R itanium R programs with dynamic instrumentation,” in intel Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2004, p. 81–92. Available: http://dl.acm.org/citation.cfm?id=1038933 [20] V. Phalke and B. Gopinath, “An inter-reference gap model for temporal locality in program behavior,” in Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS ’95/PERFORMANCE ’95. New York, NY, USA: ACM, 1995, pp. 291–300. Available: http://doi.acm.org/10.1145/223587.223620 [21] M. K. Qureshi et al., “Adaptive insertion policies for high performance caching,” in Proceedings of the 34th Annual International Symposium on Computer Architecture, ser. ISCA ’07. New York, NY, USA: ACM, 2007, p. 381–391. Available: http: //doi.acm.org/10.1145/1250662.1250709 [22] K. Rajan and G. Ramaswamy, “Emulating optimal replacement with a shepherd cache,” in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 40. Washington, DC, USA: IEEE Computer Society, 2007, pp. 445–454. Available: http://dx.doi.org/10.1109/MICRO.2007.14 [23] X. Shen et al., “Locality approximation using time,” in Proceedings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ser. POPL ’07. New York, NY, USA: ACM, 2007, pp. 55–61. Available: http: //doi.acm.org/10.1145/1190216.1190227 [24] Y. Smaragdakis, S. Kaplan, and P. Wilson, “EELRU: Simple and Effective Adaptive Page Replacement,” in Proceedings of the 1999 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS ’99. New York, NY, USA: ACM, 1999, pp. 122–133. Available: http://doi.acm.org/10.1145/301453.301486 [25] M. Takagi and K. Hiraki, “Inter-reference gap distribution replacement: An improved replacement algorithm for set-associative caches,” in Proceedings of the 18th Annual International Conference on Supercomputing, ser. ICS ’04. New York, NY, USA: ACM, 2004, pp. 20–30. Available: http://doi.acm.org/10.1145/1006209.1006213

References [1] A. V. Aho, P. J. Denning, and J. D. Ullman, “Principles of optimal page replacement,” J. ACM, vol. 18, no. 1, pp. 80–93, Jan. 1971. Available: http://doi.acm.org/10.1145/321623.321632 [2] L. A. Belady, “A study of replacement algorithms for a virtual-storage computer,” IBM Systems journal, vol. 5, no. 2, p. 78–101, 1966. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber= 5388441 [3] P. J. Denning, Y. C. Chen, and G. S. Shedler, “A model for program behavior under demand paging,” IBM Research, Tech. Rep. RC-2301, 1968. [4] N. Duong et al., “Improving cache management policies using dynamic reuse distances,” in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE Computer Society, 2012, pp. 389–400. Available: http://dx.doi.org/10.1109/MICRO.2012.43 [5] S. Eyerman and L. Eeckhout, “System-level performance metrics for multiprogram workloads,” IEEE micro, vol. 28, no. 3, p. 42–53, 2008. Available: http://citeseerx.ist.psu.edu/viewdoc/download?rep= rep1&type=pdf&doi=10.1.1.217.4687 [6] A. Jaleel et al., “CMP $ im: A pin-based on-the-fly multi-core cache simulator,” in Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), co-located with ISCA, 2008, p. 28–36. Available: https://ece.umd.edu/~blj/papers/mobs2008.pdf [7] A. Jaleel et al., “High performance cache replacement using re-reference interval prediction (RRIP),” in ACM SIGARCH Computer Architecture News, vol. 38, 2010, p. 60–71. Available: http://dl.acm.org/citation.cfm?id=1815971 [8] D. A. Jiménez, “Insertion and promotion for tree-based PseudoLRU last-level caches,” in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-46. New York, NY, USA: ACM, 2013, p. 284–296. Available: http://doi.acm.org/10.1145/2540708.2540733 [9] G. Keramidas, P. Petoumenos, and S. Kaxiras, “Cache replacement based on reuse-distance prediction,” in 25th International Conference on Computer Design, 2007. ICCD 2007, 2007, pp. 245–250. [10] S. M. Khan, Y. Tian, and D. A. Jimenez, “Sampling dead block prediction for last-level caches,” in Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, 2010, p. 175–186. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp? arnumber=5695535 [11] A.-C. Lai, C. Fide, and B. Falsafi, “Dead-block prediction & dead-block correlating prefetchers,” in Computer Architecture, 2001. Proceedings. 28th Annual International Symposium on. IEEE, 2001, p. 144–154. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp? arnumber=937443 [12] W.-F. Lin and S. K. Reinhardt, “Predicting last-touch references under optimal replacement,” Ann Arbor, vol. 1001, p. 48109–2122, 2002. Available: http://www.eecs.umich.edu/techreports/cse/2002/ CSE-TR-447-02.pdf [13] H. Liu et al., “Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency,” in Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 41. Washington, DC, USA: IEEE Computer Society, 2008, p. 222–233. Available: http://dx.doi.org/10.1109/MICRO.2008.4771793 [14] S. McFarling, “Cache replacement with dynamic exclusion,” in Proceedings of the 19th Annual International Symposium on Computer Architecture, ser. ISCA ’92. New York, NY, USA: ACM, 1992, pp. 191–200. Available: http://doi.acm.org/10.1145/139669.139727 [15] N. Megiddo and D. S. Modha, “ARC: A Self-Tuning, Low Overhead Replacement Cache.” in FAST, vol. 3, 2003, pp. 115–130. [16] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0,” in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 40. Washington, DC, USA: IEEE Computer Society, 2007, p. 3–14. Available: http://dx.doi.org/10.1109/MICRO.2007.30

12

Probabilistic Distance-based Arbitration: Providing Equality of Service ...