Cache Management for Multi-Core Architecture

BY ABHISHEK JAIN BITS, PILANI ABHINAV AGRAWAL BITS, PILANI

Supervisor Dr. Sudarshan TSB Group Leader (Head), CS & IS Group BITS, PILANI

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE PILANI (RAJASTHAN) 333031

Contents

PAGE NO

1

Introduction

3

2

Multicore Architectural Overview

3

3

Cache Memory Architecture

4

4

Proposed Scheme

9

5

References

10

Managing Shared Caches in Chip Multiprocessors Introduction A multi-core CPU or Chip-Multiprocessor (CMP) combines two or more independent cores into a single chip. Each "core" independently implements optimizations such as superscalar execution, pipelining, and multithreading. The most commercially significant multi-core processors are those used in computers from Intel (eg: Quad-Core processor) & AMD (e.g.: Tri-Core processor) and game consoles (e.g. the Cell processor in PS3). However, this technology is widely used in other areas, especially those of embedded processors such as network processors, Digital Signal Processors, and Multimedia Processors. One fundamental aspect of multi-core processors and systems is the way the memory is organized. Memory architecture and performance influences both the performance of tasks running on processors and the communication between tasks and processors [1]. Especially when task performance depends on locality of data in caches, a smart memory architecture and appropriate cache size will have profound impact on performance and is yet to be explored for its efficiency in multi-core architecture. Many Chip Multiprocessor (CMP) architectures have explored sharing the last level of cache among different processors for better performance–cost ratio and improved resource allocation. Shared cache management is a crucial CMP design aspect for the performance of the system. Qureshi et al. [2] investigated the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a higher demand and fewer cache resources to the application that has a lower demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. Speight et al. [3] proposed simple architectural extensions and adaptive policies for managing the L2 and L3 cache hierarchy in a CMP system. In particular, authors evaluate two mechanisms that improve cache effectiveness. First, they proposed the use of a small history table to provide hints to the L2 caches as to which lines are resident in the L3 cache. Chang et al. [4] presented Cooperative Cache Partitioning (CCP) to allocate cache resources among threads concurrently running on CMPs. Unlike cache partitioning schemes that use a single spatial partition repeatedly throughout a stable program phase, CCP resolves cache contention with multiple time-sharing partitions. Timesharing cache resources among partitions allows each thrashing thread to speed up dramatically in at least one partition by unfairly shrinking other threads’ capacity allocations, while improving fairness by giving different partitions equal chance to execute. Quality-of- Service (QoS) is guaranteed over the long term by orchestrating the shrink and expansion of each thread’s capacity across partitions to bound the average slowdown [5] [6] A new classification of cache misses in the context of CMPs with shared cache has been proposed in [7]. Author classifies the cache misses in a CMP with shared L2 cache as Compulsory, Inter processor and Intra processor misses. A novel technique called set pinning which associates cache sets with owner processors (ownership here refers to right of the processor to evict blocks within the set on a cache miss) and redirects blocks that would lead to interprocessor misses to a small Processor Owned Private (POP) cache. Each core has its own POP cache. Set pinning eliminates Inter-processor misses and reduces Intra-processor misses in shared caches. As an improvement over the set pinning approach, a technique called adaptive set pinning is proposes which improves the benefits obtained by set pinning, by adaptively relinquishing ownership of pinned sets. The adaptive set pinning approach mitigates the effect of dominated ownership of sets by a few processors that is observed in the set pinning approach Multicore Architectural Overview Multi-core is a design in which a single physical processor contains the core logic of more than one processor. The multi-core design puts several such processor “cores” and packages them as a single physical processor. The goal of this design is to enable a system to run more tasks simultaneously and thereby achieve greater overall system performance. The presence of multiple cores can also introduce

greater design complexity. For instance, to cooperate with one another, applications running on different cores may require efficient inter-process communication (IPC) mechanisms, a shared-memory data infrastructure and synchronization primitives to protect shared resources [8]. Our goal is to have a better cache management policy to partition cache according to the requirements. Hence we are trying to design Multi core architecture with efficient cache management. 1. A Taxonomy of Cache Misses in Chip Multiprocessors The motivation for our design comes from the example transactions depicted in Figure 1 of the reference paper [7]. Consider a CMP with two processors, P1 and P2 and a fully associative shared L2 cache. Examples (a) and (b) in Figure 1 show two possible types of transactions that could result in a miss in a shared cache. Example (a) depicts a traditional capacity miss where the same processor P1 is responsible for both first reference and eviction of the memory element X. Example (b) also depicts a miss by P1, that occurs to a memory element X. The difference here is that X was brought into the cache by an earlier reference by P1, but was evicted only because of a reference to a different memory element Y that mapped to the same cache block as X by P2. The authors classify the misses similar to that shown in (a) of Figure 1 as Intra-processor misses and ones similar to that shown in (b) as Inter-processor misses. In order to present a more formal understanding of the CII classification, the authors represent the life cycle of a memory element as shown in the state diagram in Figure 2. This diagram depicts the life cycle of a memory element in the shared cache during the execution of a program accessing it, assuming the program is executing on a dual core CMP. The same idea is easily extensible to any number of processors. As seen from Figure 2, the memory element under consideration is initially in the Never Referenced state. The first access by P1 or P2 causes a compulsory (cold) miss and the memory element enters the Referenced state for the first time in the life cycle. Any subsequent references (by any processor) to a memory element in the Referenced state leads to a cache hit. Further, a replacement of the cache block takes the memory element into the Replaced state. A memory element is tagged with the id of the processor which replaced the element. For instance, a memory element which is evicted from the cache as a result of a reference from P1 is in Replaced (P1) state. It is evident that all non-compulsory cache misses to a memory element occur when it is in the Replaced state. The classification of the noncompulsory misses is based on whether the cache miss is occurring because of the block being replaced (at an earlier point in time) by the same processor or a different processor. This is deciphered by comparing the processor facing the miss with the tag of the memory element in the Replaced state.

2. Set Pinning CMPs with private caches for each processor have the advantage that there are no interprocessor misses. However, due to a much smaller fraction of aggregate L2 cache capacity being available to individual processors, there can be a significant increase in intra-processor misses (all cache misses in a private cache are intra-processor misses), compared to a shared L2 cache of a much larger (aggregate) capacity. Thus there is a motivation for shared caches in CMPs. Set pinning is based on two crucial observations about the characteristics of non–compulsory misses in the shared cache. Based on experimental observations, it is known that there is low fraction of distinct memory addresses leading to inter-processor misses that happen due to a few hot blocks in the memory. Also it is observed that most of these hot blocks are accessed over and over again. This indicates that the hot blocks are also frequently accessed. Set pinning is designed to exploit these two observations by disallowing the large number of references for the few hot blocks responsible for causing inter-processor misses, from evicting L2 cache blocks. Instead, these hot blocks are stored in very small regions of the L2 cache, confined to be written by individual processors, called Processor Owned Private (POP) caches. Set pinning is a cache management scheme where every processor acquires replacement ownership of a certain number of sets in the shared cache. Only the processor that has replacement ownership of the set being accessed can replace entries in that set. Therefore, in set pinning, all references that could potentially cause an inter-processor miss, i.e., all references from different processors, could never evict each other even if they index to the same set. The small numbers of such hot–blocks that would have been responsible for inter-processor misses are stored instead in the POP cache of the processor that first references them. The block diagram of the proposed set pinning architecture for L2 cache is shown in Figure 3. As seen from the figure, a field in each cache set is used to store the identifier of the current owner processor of the set. This field is log n bits for n processors per cache set. The large shared L2 cache is organized into a large set pinned cache along with small POP caches, one for each processor. The set pinned L2 cache behavior is unchanged from the traditional shared cache behavior in case of a cache hit. Look up also happens in parallel in all the POP caches and on a cache miss in the set pinned L2 cache; if there is a hit in any of the POP caches, the cache request is satisfied from the POP cache.

Figure 3. Block diagram explaining the working of set pinning architecture. 3. Cache Miss Policy When there is no tag match in either the indexed set of the set pinned L2 cache or any of the POP caches, it is an L2 cache miss. There are three possible cases under which a cache miss can occur: 1. The indexed set in the L2 cache is not owned by any processor. 2. The indexed set in the L2 cache is owned by the processor responsible for the current reference. 3. The indexed set in the L2 cache is owned by a processor other than the one responsible for the current reference. In case 1, the indexed set is in its pristine state and is not owned by any processor. Therefore, in addition to bringing the referenced data from memory, the owner field also needs to be set. Consider the reference to A3 by P3 in Figure 4. Since the address indexes to a block in a set owned by no processor (indicated by o(xx)), the ownership of the set is claimed by P3 and the referenced data is brought into the set from memory. In case 2, the indexed set is used as in the traditional cache schemes, i.e., the data from memory evicts the least recently used cache block in that set. Intra-processor misses can be significantly reduced because of reduced number of references that are eligible to evict data (limited to references from the owner processor). The reference to A2 by P2 in Figure 4 is an example of this case. Since the address indexes to a set owned by P2, an LRU block is evicted from the indexed set and replaced by the referenced data. In case 3, where the indexed set is owned by a processor different from the one making the reference, data cannot be evicted from the indexed set. This reduces the inter-processor misses which could only occur if the eviction had taken place. The POP cache owned by that processor is instead used to store the referenced data by evicting the LRU entry from that processor’s POP cache. Considering Figure 4 once more, let us focus on A1 referenced by P1 which indexes to a set owned by P3. In this case, P1 stores A1 in POP-1 cache, by evicting a block selected using LRU from the POP1 cache.

4. Adaptive Set Pinning The simple first–come first–serve policy favors the processors that first access the set. This can lead to acquirement of ownership of a large number of sets by the first few active processors. Such domination in ownership of sets by a few processors leaves few sets for the other processors and hence over stresses the POP caches of those processors. In the adaptive set pinning scheme, the authors overcome this limitation with a policy where processors dynamically relinquish ownership of sets. The relinquishing of sets by an owner processor is based on a confidence counter for each set, which indicates the confidence of the system that the current processor should be the owner of the set. The confidence counter is a saturating counter which is incremented on every cache hit occurring on the set. It is decremented for every reference by a processor, that (i) indexes into the set, (ii) is not the owner of the set, and (iii) is a cache miss (the reference is a cache miss in the indexed cache set as well as the POP caches). A cache miss due to a reference indexing into a set owned by the processor responsible for the reference does not change the value of the confidence counter. When the confidence counter becomes zero, the processor identity bits are cleared and the set again becomes available to the next processor whose reference indexes into it. The flow chart for the logic in the adaptive set pinning scheme is shown in Figure 4.

Figure 5. Flowchart explaining the logic of Adaptive set pinning scheme

5. Proposed scheme The adaptive set pinning doesn’t take care of the processor identity to transfer the ownership of the cache. We propose to work on a scheme to give priority to transfer the ownership to a processor which will have maximum number of references. This will improve the performance of the shared cache architecture reducing the overall number of cache misses and leading to fair cache partitioning among processors.

References [1].L. Hsu, R. Iyer, S. Makineni, S. Reinhardt, and D. Newell. Exploring the cache design space for large scale cmps. SIGARCH Computer Architecture News, 33(4):24–33, 2005. [2] Qureshi M. K., Patt Y. N. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance th Runtime Mechanism to Partition Shared Caches. Proceeding of the 39 Annual IEEE/ACM international Symposium on Micro-architecture, Orlando, USA;2006, Dec 12, IEEE;2006, 423-432 [3] Speight E., Shafi H., Zhang L., Rajamony R. Adaptive Mechanisms and Policies for Managing Cache nd Hierarchies in Chip Multiprocessors. Proceeding of the 32 Annual International Symposium on Computer Architecture (ISCA’05), Madison, Wisconsin USA;2005,June 15-17, ISCA; 2005, 346-356. [4] Chang J., Sohi G.S, Cooperative Cache Partitioning for Chip Multiprocessors. Proceedings of the 21 annual international conference on Supercomputing, Seattle, USA; 2007, Dec 19, ICS; 2007, 242-252.

st

[5] Iyer R., Zhao L., Guo F., Illikkal R., Makineni S., Newell D., Yan S., Hsu L, Reinhardt S. QoS Policies and Architecture for Cache/Memory in CMP Platforms, Proceedings of the ACM SIGMETRICS 2007 the International Conference on Measurement and modeling of Computer Systems, San Diego, USA; 2007, June 22, ACM; 2007, 23-24. [6].S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor th architecture. In Proc. of the 13 International Conference on Parallel Architectures and Compilation Techniques, Paris, 2004 [7] B. M. Beckmann and D. A. Wood. ASR: Adaptive Selective Replication for CMP Caches. In Proc. of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Orlando, 2006. [8] J. L. Hennessy and D. A. Patterson. Computer Architecture : A Quantitative Approach; Fourth edition. Morgan Kaufmann, 2006. [9] Srikantaiah, Shekhar and Mahmut Kandemirn “Adaptive Set Pinning: Managing Shared Caches in Chip Multiprocessors” 2008

Cache Management for Multi-Core Architecture

cache management is a crucial CMP design aspect for the performance of the ... the life cycle of a memory element as shown in the state diagram in Figure 2.

183KB Sizes 0 Downloads 241 Views

Recommend Documents

Gaining Insights into Multicore Cache Partitioning
increasingly relevant to system architecture design. In ad- dition, careful ... increase I/O accesses, e.g. page swapping or file I/O, which may distort the ...

Enabling Software Management for Multicore Caches ...
ment at different levels of software, such as operating systems, compilers, and ..... filing unit provide a set of counters for each memory region, from which the software .... MCM classifies this type of memory region as “black” memory region.

Multicore Architecture & Programming July 2016 (2014 Scheme).pdf ...
Multicore Architecture & Programming July 2016 (2014 Scheme).pdf. Multicore Architecture & Programming July 2016 (2014 Scheme).pdf. Open. Extract.

Reducing Cache Miss Ratio For Routing Prefix Cache
Abstract—Because of rapid increase in link capacity, an Internet router has to complete routing ... stores the most recent lookup result in a local fast storage in hope that it will be ..... for providing free access to the trace data under Nationa

Location-Aware Cache Management for Many-Core ...
on a chip increases, managing caches becomes more impor- .... cache the data should be in (local pushing store), but also ..... reducing coherence traffic [13].

Scavenger: A New Last Level Cache Architecture ... - Semantic Scholar
Addresses suffering from cache misses typically exhibit repetitive patterns due to the temporal locality inherent in the access stream. However, we observe that the number of in- tervening misses at the last-level cache between the eviction of a part

cache memory in computer architecture pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. cache memory ...

Reactive DVFS Control for Multicore Processors - GitHub
quency domains of multicore processors at every stage. First, it is only replicated once ..... design/processor/manuals/253668.pdf. [3] AMD, “AMD Cool'n'Quiet.

Multicore Prog.pdf
The New Architecture. If a person walks fast on a road covering fifty miles in ... Intel, and AMD have all changed their chip pipelines from single core processor production to. multicore processor production. This has prompted computer vendors such

Design Space Exploration for Multicore Architectures: A ...
to the amount of parallelism available in the software. They evalu- ate several ... namic thermal management techniques for CMPs and SMTs. Li et al. .... shared bus (57 GB/s), pipelined and clocked at half of the core clock (see Table 2).

Cache Creek Ridge
Stalley, a recent graduate of Wellesley College; and her sister Alyssa Stalley, a recent ... Chuck Stalley is the former California state champion in Ride & Tie.

In-Network Cache Coherence
valid) directory-based protocol [7] as a first illustration of how implementing the ..... Effect on average memory latency. ... round-trips, a 50% savings. Storage ...

Cache Creek Ridge
For the BLM website, go to: www.blm.gov/ca/st/en/fo/ukiah/cachecreek.html ... a lush bed of red clover that is evidently more tasty than the bales of grass-hay back at ... At right, Sharon Wimberg is delighted with her 5th place finish and Top-Ten ..

Accelerating Virtual Machine Storage I/O for Multicore ...
the I/O request, a completion notification is delivered to the guest OS by ... due to cache pollution results from executing guest OS and VMM on a single CPU.

Design Space Exploration for Multicore Architectures: A ...
Power efficiency is evaluated through the system energy, i.e. the energy needed to run ... Furthermore, in Section 7, we evaluate several alternative floorplans by ...

Scalable Shared-Cache Management by Containing ...
Core 2. Core 3. Per−Core Shadow Tag Arrays. Set−Sampled Shadow Tags. Partitioning ...... 128 (DSS sampling rate), m=2 (Way Merging rate), UMON counters,.

A Platform for Developing Adaptable Multicore ...
Oct 16, 2009 - First, many RMS applications map naturally to ... Finally, this model maps well to execu- ...... constrained by battery and flash memory size,” in.

An Architecture for Affective Management of Systems of Adaptive ...
In: Int'l Workshop on Database and Expert Systems Applications (DEXA 2003), ... Sterritt, R.: Pulse monitoring: extending the health-check for the autonomic grid.

pdf-1891\architecture-and-patterns-for-it-service-management ...
... the apps below to open or edit this item. pdf-1891\architecture-and-patterns-for-it-service-mana ... -making-shoes-for-the-cobblers-children-second-edi.pdf.

architecture and patterns for it service management pdf
There was a problem previewing this document. Retrying... Download ... architecture and patterns for it service management pdf. architecture and patterns for it ...