MBZip: A Case for Compressing Multiple Data Blocks Raghavendra K, Biswabandan Panda and Madhu Mutyam Department of CSE, IIT Madras 1

Problem Definition

To propose a framework that can compress multiple data blocks into one single block (zip) at the LLC and at the DRAM. 2

Our Solution: MBZip, where multiple consecutive blocks that share common data pattern are compressed together into a zipped block using BDI[1], and thus need only only one set of encoding bits and a common base, in total.

compressible

2-7

address - index - offset

log2(# sets)

log2(block size)

tag

index

offset

t

address - index - zip - offset

log2(# sets)

tag

index

T

t0

t1

100%

b0

b1

t2

6

• On an average, more than 30% of the columns residing in a single page, when grouped together in groups of two to six columns, can be compressed into a single column. Created by Peter Downing – Educational Media Access and Production © 2011

b0 - b3 b1 b2 64-bytes

T0

T1

b0 b0 b1 B0 B0 B1 B1 B1

64-bytes

64-bytes

(b) Compressed cache (BDI)

(c) Zipped cache (MBZip)

b2

b3

b4

b5

...

8kB page

b2

b3 - b8 b4 - b8 b5 - b8 . . . 8kB page

Zipped DRAM Page (MBZip)

1. A block of data is either stored in uncompressed or zipped format (utmost 6 consecutive columns zipped into a single column).

bwaves, GemsFDTD, h264ref, mesa, zeusmp, calculix, gromacs, sjeng bzip2, soplex, omnetpp, ammp, galgel leslie3d, mgrid, twolf, vortex2 hmmer, lbm, mcf, milc

2. For each column, 8 bits of metadata information (3 encoding bits & 5 valid bits) is stored in a reserved DRAM space. A metadata cache to store metadata of frequently used rows. 3. Using MBZip-M, we can service multiple block requests with a single read, & hence improve performance. 4. The same block of data might be present in 6 different columns. This replication does not change the generic DRAM address mapping apart from reserving space for meta data.

ZF and CS Neither ZF nor CS

• ZF: If more than 20% of the blocks can be zipped into a single block. • CS: If the ratio of improvement in performance by going from 1MB to 2MB LLC is greater than 10%. • 70 4-core and 25 8-core workload mixes. • gem5 simulator, LLC of 4MB/8MB for 4-/8-cores, DDR3 with 8KB page size, Cache block size – 64B

Results

10

• Harmonic Speedup (HS) compared to a system with no compression. HS for 4-core: 15.4% and 21.9% improvement by MBZip-C and MBZip-CM, respectively • Bandwidth reduction for 4-core (in terms of DRAM reads) – 29.6% and 39.7% reduction by MBZip-C and MBZip-CM, respectively. BDI MBZip-C MBZip-CM

BDI MBZip-C MBZip-CM 1.25

1.2

1.2

1.15

1.15

1.1

1.1

MBZip life cycle (MBZip-CM)

8 Zipped/uncompressed block response

ammp bwaves bzip2 calculix galgel GemsFDTD gromacs h264ref hmmer lbm leslie3d mcf mesa mgrid milc omnetpp sjeng soplex twolf vortex2 zeusmp average

2-5

ammp bwaves bzip2 calculix galgel GemsFDTD gromacs h264ref hmmer lbm leslie3d mcf mesa mgrid milc omnetpp sjeng soplex twolf vortex2 zeusmp average

1

t1

8-bytes

Generic DRAM Page

Opportunity at the DRAM

t0

Zipping at the DRAM (MBZip-M)

b0 b1 64-bytes

• On an average, around 25% of the cache blocks, when grouped together in groups of two to eight blocks, can be compressed into a single cache block.

t3

8-bytes

20%

100% 80% 60% 40% 20% 0%

offset

1. Similar to BDI, MBZip doubles the # of tags per set. The index bits are shifted by the maximum number of blocks a zipped block is allowed to hold). 2. Half of the tags retain the generic index function, whereas the remaining employ zipped index function. 3. Each block in a zipped block has it's own set of coherence bits.

b0 b0 b1 b2 b2 b2 b3 b3

7

40%

4

t1

(a) Generic cache

60%

0%

zip

t0

32-bytes 64-bytes

8

80%

log2(# zb) log2(block size)

Zip Friendly (ZF)

Cache Sensitive (CS)

t: Generic tag T: Zipped tag

Opportunity at the LLC uncompressible

Zipping at the LLC (MBZip-C)

6

Evaluation

9

1. Handling multiple types of cache blocks: uncompressible (64B), compressible (< 64B), and zipped (8 to 64B). 2. Accessing these blocks (including the ones residing in a zipped block), without incurring additional latency. 3. Mapping fixed size virtual pages to variable size compressed DRAM pages.

Motivation

1. Existing compression techniques such as BDI [1] and LCP [2], compress a single cache block/DRAM column independently. 2. Applications exhibit data locality that spread across multiple consecutive data blocks.

3

Challenges

5

Zipped/uncompressed dirty block

LLC

Write buffer

Request

DRAM valid bits update Read queue

Write queue

DRAM

From secondary storage

1. A page is brought into DRAM is stored in the zipped format. 2. Either an uncompressed block or a zipped block is transferred to the cache along with the corresponding valid bits, and a generic or zipped indexing function is chosen accordingly . 3. When a zipped block containing dirty data is evicted from the cache, the entire block is written to the write buffer and from thereon to the write queue. (The zipped block might contain other clean blocks.) 4. This dirty zipped block is written back to the DRAM and the valid bits are updated . Also, the valid bits of the previous five columns are updated so that stale data is not serviced.

1.05

1.05

1

1 WS HS Normalized WS and HS for 4-core system

11

WS HS Normalized WS and HS for 8-core system

Reference

[1] Pekhimenko et. al., “Base-delta-immediate compression: practical data compression for on-chip caches”, PACT 2012, pp 377-388. [2] Pekhimenko et. al., “Linearly compressed pages: a lowcomplexity, low-latency main memory compression framework ”, MICRO 2013, pp 172-184.

Generic Research Poster - 4 Templates

MBZip: A Case for Compressing Multiple Data Blocks ... encoding bits & 5 valid bits) is stored in a reserved ... The same block of data might be present in 6.

488KB Sizes 2 Downloads 232 Views

Recommend Documents

No documents