Block Size Optimization in Deduplication Systems Cornel Constantinescu, Jan Pieper and Tiancheng Li {cornel, jhpieper}@almaden.ibm.com,
[email protected] IBM Almaden Research Center San Jose, California, USA Data deduplication is a popular dictionary based compression method in storage archival and backup. The deduplication efficiency improves for smaller chunk sizes, however the files become highly fragmented requiring many disk accesses during reconstruction or chattiness in a client-server architecture. Within the sequence of chunks that an object (file) is decomposed into, sub-sequences of adjacent chunks tend to repeat. We exploit this insight to optimize the chunk sizes by joining repeated sub-sequences of small chunks into new super chunks with the constraint to achieve practically the same matching performance. We employ suffix arrays to find these repeating sub-sequences and to determine a new encoding that covers the original sequence. With super chunks we significantly reduce fragmentation, improving reconstruction time and the overall deduplication ratio by lowering the amount of metadata. As a result, fewer chunks are used to represent a file, reducing the number of disk accesses needed to reconstruct the file and requiring fewer entries in the chunk dictionary and fewer hashes to encode a file. To encode a sequence of chunks we proved two facts: (1) any subsequence that repeats is part of some super chunk (supermaximal in Figure 1) - therefore the dictionary contains just supermaximals and non-repeats, and (2) maximals are the only repeats not covered (overlapped) by the containing supermaximals - so once we discover the maximals with the suffix array, and encode them, there is no need for maintaining auxiliary data structures like bit masks, to guarantee the coverage (encoding) of the entire object. Our experimental evaluation (Figure 2) yields a reduction in fragmentation between 89%-97% and a reduction of dictionary metadata (number of entries) between 80%-97%, without increasing the total amount of storage required for unique chunks. root
16 l: 1
1 $ 1
15 l: 1
2 $
14 l: 1 $ 13 l: 1 SM
1
2
1
3 3
1
4
12 l: 4
4
SM
1 8 l: 5 SM
5
4
1 11 l: 3
3 l: 3
3
10 l: 2
1
1
120%
$
2
1 9 l: 1
$
5
1
4 5 l: 6 M
4
6 l: 1
3 2
3
7 l: 2 NR 6
1
2
3
2
4 l: 4 NR
6 2 l: 2
position (offset) left-value
6 1 l: 1
6 0 l: SM
Supermaximal Maximal Non-Repeat
Figure 1: Suffix Tree Example
Relative Improvement
$
Original Disk Accesses (Reference) New % of Disk Accesses
100% 80% 60% 40% 20% 0% 128
512
4096
16384
Chunk Size (Bytes)
Figure 2: Disk Fragmentation Reduction.