Compressing Encrypted Data ECE 559RB Cryptography Siva Theja Maguluri
Outline • Introduction • Distributed Source Coding Lossless Compression – Slepian Wolf Compression with Fidelity Criterion WynerZiv
• Information Theoretic Security • Compression of Encrypted Data • Computer Simulations Lossless Compression of binary data Lossy Compression of real valued data
• Conclusions 5/5/2009
Siva Theja Maguluri
2
Introduction • To transmit Redundant data over an insecure bandwidth constrained channel,
• Reversing the order of Encryption and Compression
5/5/2009
Siva Theja Maguluri
3
Introduction • Compressor does not have access to the Key • At first glance, it appears that not much gain can be obtained, because encrypted data looks quiet random • But we have a joint decompression and decryption. So, decrypter has access to the key • Turns out significant compression gain can be obtained, from distributed source coding theory • In some cases, same gains as in encryption followed by compression • Application – A scenario where data is being distributed on a network 5/5/2009
Siva Theja Maguluri
4
Distributed Source Coding: Lossless • To Compress Sources Y and K that are correlated, but cannot communicate with each other. • Lossless case – Discrete Source • Special Case: K is available at the Decoder, and is correlated to Y
• Slepian-Wolf Result – H(Y/K) in both cases
5/5/2009
Siva Theja Maguluri
5
Lossless Source Coding Example • K known both at encoder and decoder
• Y and K uniformly distributed binary length 3 • Y and K differ in at most one position ie Hamming distance 1 • Encoder transmits index of error e=Y+K {000,001,010, 100} – 2 bits
5/5/2009
Siva Theja Maguluri
6
Example Continued… • K known only at the decoder • Encoder cant find e – but that is not necessary • Do not differentiate between 000 and 111 etc • Cosets of repetition code – cover the entire space • Use index of Coset as encoding – 2 bits again
5/5/2009
Siva Theja Maguluri
7
Example Continued… • Suppose X is a random variable taking values on {000, 001, 010, 100} and K is a one time pad and Y = X+K • Hamming distance between Y and K is at most 1 • Can use this construction to compress Y to 2 bits, since decoder has access to K, it can decode Y and get X=Y+K • In general case, partition the space into cosets associated with the syndromes of the principal underlying channel (repetition code here) • Encoding – Compute Syndrome corresponding to the Channel code • Channel code – Choose depending on correlation between Y and K • Decoding – Identify closest codeword to K in the coset corresponding to the transmitted Syndrome
5/5/2009
Siva Theja Maguluri
8
Distributed Source Coding: Lossy • Wyner- Ziv extends Slepian- Wolf to the case of lossy coding with distortion measure • Discrete or Continuous • We will focus on Real Line with mean square error
5/5/2009
Siva Theja Maguluri
9
Compression with Fidelity Criterion - Example • Y- Uniformly distributed on [- 9δ/2,9δ/2] • Side Information K such that |Y-K|< δ
• Encoder will quantize Y to Y’ with step size δ . |Y-Y’|≤ δ/2 • This can be thought of as three interleaved quantizers (cosets) of size 3δ • Encoder transmits the label of the coset – log3 bits Y '− K ≤ Y '−Y + Y − K <
5/5/2009
δ 3δ +δ = 2 2
Siva Theja Maguluri
10
Example Continued… • Decoder finds the reconstruction level closest to K with same label and decodes Y
• log3 bits for reconstruction with in δ/2. In absence of K, it would have been log9 bits • Performance can be improved using more complex alternatives.
5/5/2009
Siva Theja Maguluri
11
Information Theoretic Security • General Secret Key Cryptosystem • WLOG discrete iid source • Block length n • Key independent of the source, uniformly distributed • Noiseless insecure public channel • Rate, R – bits per symbol,
5/5/2009
Siva Theja Maguluri
12
Performance Measures • Measure of Secrecy against Eavesdropper Shannon-sense perfect secrecy I ( X ; B) = 0 I ( X ; B) Wyner sense perfect secrecy lim =0 n Maurer sense perfect secrecy lim I ( X ; B) = 0 n
n
• Measure of fidelity of the decoder i.e. expected distortion • Number of bits per source symbol, R • Number of bits per source symbol of the secret key – cardinality of key space
5/5/2009
Siva Theja Maguluri
13
Tradeoff between performance parameters
• RX(D) is the rate distortion function • Shannon Cryptosystem – Achieves these bounds
5/5/2009
Siva Theja Maguluri
14
Compression of Encrypted Data • Define XOR on general alphabet x⊕ y = y⊕ x x⊕z = y⊕z ⇒ x = y
• Reversed Cryptosystem
5/5/2009
Siva Theja Maguluri
15
5/5/2009
Siva Theja Maguluri
16
Performance Limit
• It can also be shown that this is the best possible performance for a system having this kind of structure
5/5/2009
Siva Theja Maguluri
17
Performance limit • For finite alphabets, it is possible to guarantee the stronger notion of Shannon sense perfect secrecy by sacrificing key efficiency (R’), by letting K be distributes uniformly over alphabet of X • How much compression can be achieved if the encryption scheme is pre specified? • When source is required to be reproduced at the decoder losslessly, by Slepian Wolf Theorem, one can compress up to the entropy rate of unencrypted source, without compromising on the security 5/5/2009
Siva Theja Maguluri
18
Special Cases
5/5/2009
Siva Theja Maguluri
19
Simulations – lossless compression of binary data • Binary source with empirical entropy .37 bits per pixel • Encrypt using pseudorandom Bernouli(1/2) string • Encrypted data has 1 bit/pixel empirical entropy • Incompressible if no side information • Compress by finding syndrome using a rate ½ LDPC code
5/5/2009
Siva Theja Maguluri
20
Simulations
• Modify the iterative decoding algorithm • At check nodes to take syndrome into account • Initialize with the knowledge of key and it’s correlation to the encrypted string • Decryption is trivial after decoding 5/5/2009
Siva Theja Maguluri
21
5/5/2009
Siva Theja Maguluri
22
Simulation – Lossy compression of Real valued data • iid Gaussian sequence, with variance 1 • Encrypted using a stream cipher • Key – iid gaussian, independent of data • Each sample is quantized, and the levels are labeled with 4 labels. This gives a sequence of binary digits, double the original length • Compressed, by finding the syndrome wrt a rate ½ trellis code – effectively, 1 bit/sample • Decoder finds the closest real valued sequence to the key which gives same syndrome • It then combines this sequence with the key sequence to get an optimal estimate of the encrypted data
5/5/2009
Siva Theja Maguluri
23
5/5/2009
Siva Theja Maguluri
24
Conclusions • Seen the possibility of compressing encrypted data without the knowledge of the key • Inspired by Distribution Source Coding Principles • In some cases, can be compressed to the same extent as the original unencrypted data
5/5/2009
Siva Theja Maguluri
25
References • K Ramachandran, V Prabhakaran et al, “On Compressing Encrypted Data”, IEEE Trans on Signal Proc vol 52 No. 10, pp2992-3006 Oct 2004 • S. S. Pradhan and K. Ramchandran, “Distributed source coding using syndromes (DISCUS): Design and construction,” IEEE Trans. Inform.Theory, vol. 49, pp. 626–643, Mar. 2003. • D. Slepian and J. K. Wolf, “Noiseless coding of correlated information sources,” IEEE Trans. Inform. Theory, vol. IT-19, pp. 471–480, July 1973. • A.Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the decoder,” IEEE Trans. Inform. Theory, vol. IT-22, pp. 1–10, Jan. 1976. 5/5/2009
Siva Theja Maguluri
26
Questions?
5/5/2009
Siva Theja Maguluri
27
Linear Error Correcting Codes • Code C is a linear subspace of vector space on the finite field • G, generating matrix of the code, [Ik|A] – size(k,n) • Maps any k length vector into a n length vector in C, xTG T • H, Parity Check Matrix [-A |In-k ]
• Hx = 0 for x in C • If z= x+e, Hz=He – called Syndrome of z • Minimum hamming distance d between codewords • Syndrome decoding of Linear codes is efficient
5/5/2009
Siva Theja Maguluri
28