Block Size Optimization in Deduplication Systems Cornel Constantinescu, Jan Pieper and Tiancheng Li {cornel, jhpieper}@almaden.ibm.com, [email protected] IBM Almaden Research Center San Jose, California, USA Data deduplication is a popular dictionary based compression method in storage archival and backup. The deduplication efficiency improves for smaller chunk sizes, however the files become highly fragmented requiring many disk accesses during reconstruction or chattiness in a client-server architecture. Within the sequence of chunks that an object (file) is decomposed into, sub-sequences of adjacent chunks tend to repeat. We exploit this insight to optimize the chunk sizes by joining repeated sub-sequences of small chunks into new super chunks with the constraint to achieve practically the same matching performance. We employ suffix arrays to find these repeating sub-sequences and to determine a new encoding that covers the original sequence. With super chunks we significantly reduce fragmentation, improving reconstruction time and the overall deduplication ratio by lowering the amount of metadata. As a result, fewer chunks are used to represent a file, reducing the number of disk accesses needed to reconstruct the file and requiring fewer entries in the chunk dictionary and fewer hashes to encode a file. To encode a sequence of chunks we proved two facts: (1) any subsequence that repeats is part of some super chunk (supermaximal in Figure 1) - therefore the dictionary contains just supermaximals and non-repeats, and (2) maximals are the only repeats not covered (overlapped) by the containing supermaximals - so once we discover the maximals with the suffix array, and encode them, there is no need for maintaining auxiliary data structures like bit masks, to guarantee the coverage (encoding) of the entire object. Our experimental evaluation (Figure 2) yields a reduction in fragmentation between 89%-97% and a reduction of dictionary metadata (number of entries) between 80%-97%, without increasing the total amount of storage required for unique chunks. root

16 l: 1

1 $ 1

15 l: 1

2 $

14 l: 1 $ 13 l: 1 SM

1

2

1

3 3

1

4

12 l: 4

4

SM

1 8 l: 5 SM

5

4

1 11 l: 3

3 l: 3

3

10 l: 2

1

1

120%

$

2

1 9 l: 1

$

5

1

4 5 l: 6 M

4

6 l: 1

3 2

3

7 l: 2 NR 6

1

2

3

2

4 l: 4 NR

6 2 l: 2

position (offset) left-value

6 1 l: 1

6 0 l: SM

Supermaximal Maximal Non-Repeat

Figure 1: Suffix Tree Example

Relative Improvement

$

Original Disk Accesses (Reference) New % of Disk Accesses

100% 80% 60% 40% 20% 0% 128

512

4096

16384

Chunk Size (Bytes)

Figure 2: Disk Fragmentation Reduction.

Block Size Optimization in Deduplication Systems

age archival and backup. ... ing reconstruction or chattiness in a client-server architecture. ... increasing the total amount of storage required for unique chunks.

106KB Sizes 0 Downloads 163 Views

Recommend Documents

PARALLELING VARIABLE BLOCK SIZE MOTION ...
variable block sizes for ME and MC, the compression performance can ... becomes the bottleneck for real time encoding. In the ..... “Data Partition for Wavefront.

A VLSI Architecture for Variable Block Size Motion ...
Dec 12, 2006 - alized in TSMC 0.18 µm 1P6M technology with a hardware cost of 67.6K gates. ...... Ph.D. degree in information & computer sci- ence from ...

Scalable VLSI Architecture for Variable Block Size ...
ment of Industry Science and Technology, Kitakyushu-shi, 808-. 0135 Japan. a) E-mail: ...... China in 2001 and M.E. degree in Computer. Science from Tsinghua ...

Scalable VLSI Architecture for Variable Block Size ...
adopts many new features, which include variable block sizes motion .... For block size of 16 X 8, the MVPs of the upper block ... because the overlapped area of two adjacent search windows ...... phone Corporation (NTT) in 1990, where he.

Joint optimization of fleet size and maintenance ...
Jul 17, 2012 - The goal of this work is to improve the performance of a cyclic transportation system by judicious joint resource assignment for fleet and maintenance capacity. We adopt a business centered multi-criteria analysis, considering producti

Nutrient management and optimization of sieve size to ...
and maintenance of viability and vigour in storage. (Kursanov et al, 1965 ). Hence, adequate supply of nutrients become .... nutrient application did not show beneficial effect in the present study. Jayshree et al. (1996) also reported reduction in t

Size Optimization of Truss Structures By Cellular ...
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 3, ISSUE 1, SEPTEMBER 2010. 1 .... M.H. Afshar is with the department of Civil Engineering, Iran University of Science and ..... Cambridge University Press., 1998.

Generic Optimization of Linear Precoding in Multibeam Satellite Systems
Abstract—Multibeam satellite systems have been employed to provide interactive .... take into account the power flexibility, which is essential for optimum ...

Block
What does Elie's father learn at the special meeting of the Council? 11. Who were their first oppressors and how did Wiesel say he felt about them? 12. Who was ...

Data Deduplication for Dummies.pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying.

Block
10. What does Elie's father learn at the special meeting of the Council? 11. Who were their ... 5. What did the Jews in the train car discover when they looked out the window? 6. When did ... How did Elie describe the men after the air raid? 8.

Optimization in
Library of Congress Cataloging in Publication Data. Đata not available .... sumption and labor supply, firms” production, and governments' policies. But all ...

In A-4 size paper - RRB Allahabad
To, For office use :I. The Asstt. Personnel Officer (Recruitment). Railway Recruitment Cell, North Central Railway,. Post Bag No. 201, Allahabad -211034.

Clutch size determination in shorebirds
supports the ILH and points to the importance of monitoring reproductive success beyond the ...... ping has been studied previously and concluded to have no.

Mate guarding, competition and variation in size in ...
97 Lisburn Road, Belfast BT9 7BL, Northern Ireland, U.K. (email: [email protected]). ..... Princeton, New Jersey: Princeton University Press. Arak, A. 1988.