MBZip: A Case for Compressing Multiple Data Blocks Raghavendra K, Biswabandan Panda and Madhu Mutyam Department of CSE, IIT Madras 1
Problem Definition
To propose a framework that can compress multiple data blocks into one single block (zip) at the LLC and at the DRAM. 2
Our Solution: MBZip, where multiple consecutive blocks that share common data pattern are compressed together into a zipped block using BDI[1], and thus need only only one set of encoding bits and a common base, in total.
compressible
2-7
address - index - offset
log2(# sets)
log2(block size)
tag
index
offset
t
address - index - zip - offset
log2(# sets)
tag
index
T
t0
t1
100%
b0
b1
t2
6
• On an average, more than 30% of the columns residing in a single page, when grouped together in groups of two to six columns, can be compressed into a single column. Created by Peter Downing – Educational Media Access and Production © 2011
b0 - b3 b1 b2 64-bytes
T0
T1
b0 b0 b1 B0 B0 B1 B1 B1
64-bytes
64-bytes
(b) Compressed cache (BDI)
(c) Zipped cache (MBZip)
b2
b3
b4
b5
...
8kB page
b2
b3 - b8 b4 - b8 b5 - b8 . . . 8kB page
Zipped DRAM Page (MBZip)
1. A block of data is either stored in uncompressed or zipped format (utmost 6 consecutive columns zipped into a single column).
bwaves, GemsFDTD, h264ref, mesa, zeusmp, calculix, gromacs, sjeng bzip2, soplex, omnetpp, ammp, galgel leslie3d, mgrid, twolf, vortex2 hmmer, lbm, mcf, milc
2. For each column, 8 bits of metadata information (3 encoding bits & 5 valid bits) is stored in a reserved DRAM space. A metadata cache to store metadata of frequently used rows. 3. Using MBZip-M, we can service multiple block requests with a single read, & hence improve performance. 4. The same block of data might be present in 6 different columns. This replication does not change the generic DRAM address mapping apart from reserving space for meta data.
ZF and CS Neither ZF nor CS
• ZF: If more than 20% of the blocks can be zipped into a single block. • CS: If the ratio of improvement in performance by going from 1MB to 2MB LLC is greater than 10%. • 70 4-core and 25 8-core workload mixes. • gem5 simulator, LLC of 4MB/8MB for 4-/8-cores, DDR3 with 8KB page size, Cache block size – 64B
Results
10
• Harmonic Speedup (HS) compared to a system with no compression. HS for 4-core: 15.4% and 21.9% improvement by MBZip-C and MBZip-CM, respectively • Bandwidth reduction for 4-core (in terms of DRAM reads) – 29.6% and 39.7% reduction by MBZip-C and MBZip-CM, respectively. BDI MBZip-C MBZip-CM
BDI MBZip-C MBZip-CM 1.25
1.2
1.2
1.15
1.15
1.1
1.1
MBZip life cycle (MBZip-CM)
8 Zipped/uncompressed block response
ammp bwaves bzip2 calculix galgel GemsFDTD gromacs h264ref hmmer lbm leslie3d mcf mesa mgrid milc omnetpp sjeng soplex twolf vortex2 zeusmp average
2-5
ammp bwaves bzip2 calculix galgel GemsFDTD gromacs h264ref hmmer lbm leslie3d mcf mesa mgrid milc omnetpp sjeng soplex twolf vortex2 zeusmp average
1
t1
8-bytes
Generic DRAM Page
Opportunity at the DRAM
t0
Zipping at the DRAM (MBZip-M)
b0 b1 64-bytes
• On an average, around 25% of the cache blocks, when grouped together in groups of two to eight blocks, can be compressed into a single cache block.
t3
8-bytes
20%
100% 80% 60% 40% 20% 0%
offset
1. Similar to BDI, MBZip doubles the # of tags per set. The index bits are shifted by the maximum number of blocks a zipped block is allowed to hold). 2. Half of the tags retain the generic index function, whereas the remaining employ zipped index function. 3. Each block in a zipped block has it's own set of coherence bits.
b0 b0 b1 b2 b2 b2 b3 b3
7
40%
4
t1
(a) Generic cache
60%
0%
zip
t0
32-bytes 64-bytes
8
80%
log2(# zb) log2(block size)
Zip Friendly (ZF)
Cache Sensitive (CS)
t: Generic tag T: Zipped tag
Opportunity at the LLC uncompressible
Zipping at the LLC (MBZip-C)
6
Evaluation
9
1. Handling multiple types of cache blocks: uncompressible (64B), compressible (< 64B), and zipped (8 to 64B). 2. Accessing these blocks (including the ones residing in a zipped block), without incurring additional latency. 3. Mapping fixed size virtual pages to variable size compressed DRAM pages.
Motivation
1. Existing compression techniques such as BDI [1] and LCP [2], compress a single cache block/DRAM column independently. 2. Applications exhibit data locality that spread across multiple consecutive data blocks.
3
Challenges
5
Zipped/uncompressed dirty block
LLC
Write buffer
Request
DRAM valid bits update Read queue
Write queue
DRAM
From secondary storage
1. A page is brought into DRAM is stored in the zipped format. 2. Either an uncompressed block or a zipped block is transferred to the cache along with the corresponding valid bits, and a generic or zipped indexing function is chosen accordingly . 3. When a zipped block containing dirty data is evicted from the cache, the entire block is written to the write buffer and from thereon to the write queue. (The zipped block might contain other clean blocks.) 4. This dirty zipped block is written back to the DRAM and the valid bits are updated . Also, the valid bits of the previous five columns are updated so that stale data is not serviced.
1.05
1.05
1
1 WS HS Normalized WS and HS for 4-core system
11
WS HS Normalized WS and HS for 8-core system
Reference
[1] Pekhimenko et. al., “Base-delta-immediate compression: practical data compression for on-chip caches”, PACT 2012, pp 377-388. [2] Pekhimenko et. al., “Linearly compressed pages: a lowcomplexity, low-latency main memory compression framework ”, MICRO 2013, pp 172-184.