implementable schemes for lossy compression a ...

Viewer
Transcript

IMPLEMENTABLE SCHEMES FOR LOSSY COMPRESSION

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Shirin Jalali December 2009

c Copyright by Shirin Jalali 2010

All Rights Reserved

ii

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

(Tsachy Weissman)

Principal Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

(Thomas Cover)

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

(Andrea Montanari)

Approved for the University Committee on Graduate Studies.

iii

iv

Dedicated to my parents.

v

Preface In spite of all the achievements of information theory in the last 60 years, some of its basic questions, specially in the area of source coding, have not found satisfying solutions yet. Finding an implementable scheme for universal lossy compression, that is part of the focus of this work, serves as a good example. Unlike universal lossless compression, for which there are several well-known practical algorithms, e.g., the famous work published in 1978 by Lempel and Ziv which its modified versions are used even today, for the universal lossy compression, the situation is not so positive. We propose two new paradigms for universal lossy compression. One of them takes advantage of Markov chain Monte Carlo combined with simulated annealing, and the other one has Viterbi algorithm at its core. Both algorithms try to approximate the solution of a universal exhaustive search scheme which has enormous computational complexity. The performance of both algorithms, theoretically, gets as close as desired to the solution of the original algorithm, but we do not have a bound on the required computations versus the desired precision. However, we show through our simulations on memoryless binary, and first-order binary Markov sources that the performance of both algorithms gets close to the rate-distortion curve for a number of computations linear in the block size. We also consider the Wyner-Ziv (WZ) problem of lossy compression where the decompressor observes a noisy version of the source, whose statistics are unknown. A new family of WZ coding algorithms is proposed and their universal optimality is proven. Compression consists of sliding-window processing followed by Lempel-Ziv (LZ) compression, while the decompressor is based on a modification of the discrete universal denoiser (DUDE) algorithm to take advantage of side information. The new vi

algorithms not only universally attain the fundamental limits, but also suggest a paradigm for practical WZ coding. The effectiveness of our approach is illustrated with experiments on binary images, and English text using a low complexity algorithm motivated by our class of universally optimal WZ codes. Finally, we investigate the problem of Multiple Description (MD) coding of discrete ergodic processes. We introduce the notion of MD stationary coding, and characterize its relationship to the conventional block MD coding. In stationary coding, in addition to the two rate constraints normally considered in the MD problem, we consider another rate constraint which reflects the conditional entropy of the process generated by the third decoder given the reconstructions of the two other decoders. The relationship that we establish between stationary and block MD coding enables us to devise a universal algorithm for MD coding of discrete ergodic sources, based on simulated annealing ideas that were recently proven useful for the standard rate distortion problem.

vii

Acknowledgement It is a pleasure for me to wholeheartedly thank those who made this thesis possible. First, and for most, I would like to thank my advisor Prof. Tsachy Weissman. In my first quarter at Stanford, I took a course with Tsachy, and it took me just few sessions before I started hoping to join his research group. He has been a great mentor for me both in academia and in real life. His positive attitude towards the issues, and his encouragements and support throughout my studies made getting a Ph.D. an enjoyable experience for me. As someone not used to thinking out loud, it was while discussing my work with Tsachy that I gradually learned how to express my ideas in a coherent way. He was always patient, motivating, and encouraging. My associate advisor, Prof. Thomas Cover, serves as a role model for me from different perspectives. Tom is one of the best teachers I have ever had. His Information Theory classes have been very enjoyable and helpful to me, and I have learned a lot of interesting material on both information theory, and on an excellent teaching styles. Something I always admire about Tom is that, despite all his honors and awards, he is genuinely humble. His weekly fun group meetings have been a good inspiration for me throughout these years, and a place to get insight on different problems of a wide range. I am grateful to him for all these, and also for always being kind to me. I would also like to thank Prof. Andrea Montanari. In my last year at Stanford, I had the opportunity of working with Andrea. It was an enjoyable experience, and I learned a lot from him. My gratitude also goes to Prof. Abbas El Gamal for always being kind to me, and also for letting me be the TA of his famous course on multi-user information theory. His well-structured course provided a great opportunity for me to study fundemental problems in information theory. viii

I would like to thank my current and former students of Tsachy and Tom: Shahriar, Kamakshi, Taesup, Stephani, Navid, Styrmir, George, Paul, Haim, Lei and Gowtham. They made working at the office much more fun, and each of them helped me at some point during these years. Also I am thankful to ISL administrators Rashmi, Kelley, and Denise for always helping me out whenever I had a problem. Studying at Stanford is a great chance for Persian students because of its warm, friendly and supportive Persian community. They helped me endure being thousands of kilometers away from my family and country, and focus on my work. I made many good friends in this community, which if I want to name all of them will probably become half a page long. I am grateful to all of them, specially to Shadi, Bita, Vahideh, Maryam, Narges, Neda, Mahdieh, Hedi, Parastoo, Neda and Negin. To me, friends are gifts we receive throughout our lives, and I really feel lucky for having the opportunity of making a number of great friends at Stanford. I would like to express how indebted I am to my parents for their unconditional love and support, and also for their everyday sacrifices throughout my life. I deeply appreciate how they always trusted me, and let me experience whatever I decided to. Words cannot describe my love and respect for them. Also I would like to thank my sister, Nooshin, who has always been there for me, and listened patiently to my complaints. Her emotional support always eased my hard moments at Stanford. Although being younger than me, I even remember her encouraging me to study when I was preparing myself for the universities entrance exams in Iran. I am also thankful to her for cheering up our parents these years that I have been away. I am also thankful to Arian’s family for being helpful and supportive. There are many wonderful things that I will never forget about Stanford. However, the thing that makes Stanford extremely special to me is that it is the place I got to know Arian. It was here that we got each other’s best friends, started our journey of life, and had many of our first pleasant memories. It is very hard for me to express how grateful I am to him, and how his love and companionship has transformed my life at Stanford.

ix

Contents Preface

vi

Acknowledgement

viii

1 Introduction

1

1.1

Motivation and background . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Organization of this dissertation . . . . . . . . . . . . . . . . . . . . .

3

2 Definitions and notation

5

3 Rate-distortion via MCMC

8

3.1

An exhaustive search scheme for fixed-slope compression . . . . . . .

10

3.2

Universal lossy coding via MCMC . . . . . . . . . . . . . . . . . . . .

11

3.3

SB coding via MCMC . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.4

Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.4.1

Block coding . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.4.2

Sliding-block coding . . . . . . . . . . . . . . . . . . . . . . .

21

Application: optimal denoising . . . . . . . . . . . . . . . . . . . . . .

27

3.5.1

29

3.5

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Rate-distortion via Viterbi algorithm

35

4.1

Linearized cost function . . . . . . . . . . . . . . . . . . . . . . . . .

36

4.2

Computing the coefficients . . . . . . . . . . . . . . . . . . . . . . . .

38

4.3

Viterbi coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

x

4.4 Approximating the coefficients . . . . . . . . . . . . . . . . . . . . . .

42

4.5 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

5 Wyner-Ziv coding

51

5.1 DUDE with fidelity boosting information . . . . . . . . . . . . . . . .

53

5.2 Sliding block WZ coding . . . . . . . . . . . . . . . . . . . . . . . . .

57

5.2.1

Block coding . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

5.2.2

Sliding-Block WZ compression . . . . . . . . . . . . . . . . . .

58

5.2.3

WZ-DUDE algorithm . . . . . . . . . . . . . . . . . . . . . . .

60

5.2.4

Experimental results . . . . . . . . . . . . . . . . . . . . . . .

63

5.3 MCMC WZ-DUDE . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

5.3.1

Count vectors and DUDE operator . . . . . . . . . . . . . . .

79

5.3.2

MCMC-based WZ encoder . . . . . . . . . . . . . . . . . . . .

80

5.3.3

Simulation results . . . . . . . . . . . . . . . . . . . . . . . . .

82

6 Multiple description coding

86

6.1 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

6.2 Multiple description problem . . . . . . . . . . . . . . . . . . . . . . .

89

6.2.1

Block coding . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

6.2.2

Stationary coding . . . . . . . . . . . . . . . . . . . . . . . . .

90

6.3 Universal multiple description coding . . . . . . . . . . . . . . . . . .

92

6.4 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

7 Conclusions and future directions

99

A Proof of Theorem 1

102

B Proof of Theorem 2

105

C Proof of Theorem 3

114

D Stationarity condition

116

E Concavity of H(m)

118 xi

F Proof of Theorem 4

120

G Proof of Theorem 6

127

H Proof of Theorem 7

136

I

Bounding R(D)

139

J Proof of Theorem 11

147

Bibliography

149

xii

List of Tables

xiii

List of Figures 1.1

Basic data compression setup . . . . . . . . . . . . . . . . . . . . . .

2

3.1

Each pixels 6th order 2-D context . . . . . . . . . . . . . . . . . . . .

21

3.2

Comparing Algorithm 1 performance with the optimal q rate-distortion tradeoff for a (Bern(p)) i.i.d source, p = 0.1 and βt = n ⌈ nt ⌉. . . . .

22

3.3

Sample paths demonstrating evolutions of the empirical conditional

entropy, average distortion, and energy function when Algorithm 1 is applied to the output of a Bern(p) i.i.d source, with p = 0.1 and

3.4

3.5

4 n = 5 × 10 q . The algorithm parameters are k = 10, α = 4, and βt = 1.33n ⌈ nt ⌉. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

non lower bound for a BSMS with q = 0.2, βt = nαt1.5 /3 . . . . . . .

24

Comparing the algorithm rate-distortion performance with the Shan-

Applying Algorithm 1 to a 2-D binary image using the context shown in Fig. 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.6

25

Comparing the algorithm rate-distortion performance with the Shannon lower bound for a BSMS with q = 0.2. Algorithm parameters: n = 5 × 104 , k = 8, kf = 5 (Kf = 211 ), βt = Kf α log(t + 1), and slope

values α = 5.25, 5, 4.75 and 4.5. . . . . . . . . . . . . . . . . . . . . .

26

3.7

The 4th order grid used for de-randomization in MCMC-based denoising 30

3.8

The 20th order grid used for de-randomization in MCMC-based denoising 30 xiv

3.9 Comparing the denoiser based on MCMC coding plus derandomization with DUDE and optimal non-universal Bayesian denoiser which is implemented via forward-backward dynamic programming. The source is a BSMS(p), and the channel is assumed to be a DMC with transition probability δ = 0.1. The DUDE parameters are: kleft = kright = 4, and the MCMC coder uses α = 0.9, βt = 0.5 log t, r = 10n, n = 104 , k = 7. The derandomization window length is 2 × 4 + 1 = 9. . . . . . . . . .

3.10 Applying MCMC-based denoiser to a 2-D binary image . . . . . . . .

32 34

4.1 Iterative Viterbi lossy coder with I = 10 iterations. [Bern(q) i.i.d. source, q = 0.25, α = 2, n = 2 × 104 , k = 8, and L = 50] . . . . . . . . . . . .

46

k = 8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.2 Iterative Viterbi encoder. [Bern(q), i.i.d. source q = 0.3, n = 5 × 104 , 4.3 Reduction in the final cost after I = 4 iterations versus α. [Bern(q) i.i.d. source, q = 0.25, n = 2 × 104 , k = 8, and L = 50]

. . . . . . . .

48

4.4 Viterbi encoder with the coefficients computed at m(xn ). [BSMS(q), q = 0.2, n = 2 × 104 , k = 9, and L = 20] . . . . . . . . . . . . . . . . n

n

49

n

4.5 Demonstrating the reduction in the cost, Hk (ˆ x ) + αdn(x , xˆ ), result-

ing from the iterative approach. [BSMS(q), q = 0.2, n = 2×104 , k = 8, and L = 20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

5.1 The basic setup of WZ coding . . . . . . . . . . . . . . . . . . . . . .

53

5.2 Original binary image, 0.59 b.p.p. . . . . . . . . . . . . . . . . . . . .

64

5.3 Noise corrupted image, generated by passing the original signal through a BSC(0.25) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

5.4 The FB image, y, generated by lossy JPEG coding of the original image , 0.36 b.p.p., dn (xn , y n ) = 0.0749

. . . . . . . . . . . . . . . . . . . . . .

66

5.5 Output of DUDE for lz = 2. dn (xn , xˆn ) = 0.1048 . . . . . . . . . . . . .

67

5.6 Output of the WZ-DUDE decoder for lz = 2, ly = 1, and R = 0.36 b.p.p., dn (xn , x ˆn ) = 0.0444 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

5.7 Percentage of erasures that are recovered by WZ-DUDE decoder versus the size of the FB alphabet Y, for k = 1 and k = 2, and e = 0.1. . . . . . . . .

xv

75

5.8

100 e

ˆ i 6= Xi ) versus e, where P(X ˆ i 6= Xi ) is the probability of erP(X

ror of the optimal denoiser that knows the source distribution, and is computed using (5.18), for a BSMS with q = 0.25 passing through an 5.9

erasure channel which erasure probability of e. . . . . . . . . . . . . .

76

Noisy image generated by the binary symmetric channel with q = 0.15

83

n

5.10 FB signal generated by Algorithm 4, dn (x

n , yFB )

= 0.0441 [α = 0.01,

kz = 2, ky = 1, and k = 7 ] . . . . . . . . . . . . . . . . . . . . . . . .

84

5.11 Reconstructed image by the WZ-DUDE algorithm, dn (xn , xˆn ) = 0.0279.

85

6.1

Example setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

6.2

MD

coding setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

6.3

Reduction in the cost. At the end of the process, the final achived point is: (Hk (y n ), Hk (z n ), Hk,k1 (w n |y n , z n ), dn (xn , y n ), dn (xn , z n ), dn (xn , w n )) = (0.5503, 0.5586, 0.0038, 0.0505, 0.0483, 0.0036) . . . . . . . . . . . . . .

xvi

98

Chapter 1 Introduction 1.1

Motivation and background

Data compression, or as known in information theory, source coding, deals with encoding data such that it can be represented with fewer number of bits, and be retrieved later on by a decoder upon request. The compression algorithms are categorized as lossless and lossy based on whether the data is reconstructed perfectly, or some discrepancies are introduced between the original and reconstructed versions. In today’s digital world, data compression algorithms are ubiquitous; The applications range from sending voice/data over the wireless cellular networks to storing images on hard drives. While a typical application of lossless compression algorithms is in compressing text files, lossy compression is not usually useful in coding texts. Due to the incapability of human’s visual and auditory systems in distinguishing many of the subtleties, a classic application of lossy compressors is in coding multimedia data. Video, image and audio files are often coded with lossy compressors that by introducing some hard-to-detect errors, significantly reduce the required rate. The basic setup of a simple point-to-point source coding system is shown in Fig. 1.1. X = {Xi }∞ i=1 , a discrete-valued stationary ergodic process, denotes the

source to be encocded, and hence compressed. Each source output block of length n, X n , is mapped to an index of nR bits, f (X n ) ∈ {1, 2, . . . , 2nR }. The index is ˆ n = g(f (X n)). The sent to the decoder which decodes it to a reconstruction block X 1

2

CHAPTER 1. INTRODUCTION

Xn

Encoder

i(X n ) ∈ {1, . . . , 2nR }

Decoder

ˆn X

Figure 1.1: Basic data compression setup

performance of a coding scheme C = (f, g, n, R) is measured by its expected average

distortion between source and reconstruction blocks, i.e., n

X ˆ n )] , 1 ˆ i )], D = E[dn (X , X E[d(Xi , X n i=1 n

(1.1)

where d : X × Xˆ → R+ is a per-letter distortion measure. Here X and Xˆ denote the source and reconstruction alphabets respectively, which we assume are finite. For a

given source X and D ≥ 0, the minimum required rate (c.f. [1] for exact definition of achievability) for achieving distortion D is denoted by R(X, D), and is characterized as [2], [3], [4] 1 ˆ n ). I(X n ; X n→∞ p ˆ n n (·|·):E dn (X n ,X ˆ n )≤D n X |X

R(X, D) = lim

min

(1.2)

The initial achievability proofs in lossless or lossy source coding, as proposed by Shannon in [5], were mainly based on the following observation: if a source is compressible, then among the set of all possible sequences, there is a small subset called the set of typical sequences, which with very high probability the source output ¯ sequence lies in this subset. More precisely, for a source of entropy rate H(X), which in almost all cases of interest is smaller than log2 |X |, for large enough n, there is a set ¯

Tn with size ≈ 2nH(X) that with probability close to one, the source output sequence

belongs to this set. In other words, P(X n ∈ Tn ) → 1 as n → ∞. Therefore, the encoder can only encode the typical sequences, and ignore the rest. The procedure of only dealing with typical sequences works for both lossless and lossy compression cases. The problem with this method is that it assumes that the encoder knows the distribution under which the source is generated. This assumptions does not

1.2. ORGANIZATION OF THIS DISSERTATION

3

hold in many practical situations. Therefore, it is reasonable to ask how lack of this information affects the achievable rate-distortion performance. Interestingly, it can be shown that, asymptotically, there will be no loss in the rate-distortion performance due to unavailability of this knowledge [6]. This result implies existence of algorithms that perform optimally, i.e., close to the rate-distortion curve, for any stationary ergodic source; Such schemes are called universal. For the case of lossless compression (assuming a non-degenerate distortion measure), we know that the minimum required rate is the entropy rate of the source, i.e., −1 ¯ R(X, 0) = H(X) , lim H(X0|X−k ). In this case, there are known implementable k→∞

universal schemes, such as Lempel-Ziv coding [7] or arithmetic coding [8], that are able to losslessly describe any stationary ergodic source at rates as close to the entropy rate of the source as desired. In contrast to the situation of lossless compression, for D > 0, neither the explicit solution of (1.2) is known for a general source (even not for a first-order binary Markov source [9]), nor are there known practical schemes that universally achieve the rate-distortion curve. There has been a lot of work done in the literature on designing universal lossy compressors [10–15]. While some of the proposed algorithms are only interesting from a theoretical point of view due to their enormous computational complexity [10–13], others are implementable yet sub-optimal [14, 15]. On the other hand, for the non-universal setting, specifically the case of lossy compression of an i.i.d. source with a known distribution, there is ongoing progress towards designing codes that get very close to the optimal performance [16–20].

1.2

Organization of this dissertation

Chapter 2 introduces the definitions and notation widely used in this dissertation. In Chapters 3 and 4 we propose two new schemes for universal lossy compression of discrete stationary ergodic sources. Both algorithms are trying to approximate the solution of a fixed-slope universal lossy compression algorithm which has exponential (in the block size) computational complexity. The first one achieves this goal by using simulated annealing. Simulated annealing is a known method for doing discrete

4

CHAPTER 1. INTRODUCTION

valued optimization. It is an iterative approach that in the limit of infinite number of iterations achieves the global optimum. Of course, in practice the algorithm is used for a finite number of iterations. We show through our simulations results, that for a number of iterations equal to a small factor (≈ 10−15) times the block size, we can get close to the rate-distortion curve performance for binary memoryless, and first-order binary Markov sources. Our second algorithm takes a completely different path, by first having a linear approximation of the cost used in the original scheme, and then using Viterbi Algorithm. For using the second algorithm we need to compute some parameters. We derive a concave minimization problem whose solution gives us the desired parameters. But since solving a non-convex optimization is hard, we show some heuristic methods for estimating the parameters. We show the effectiveness of our approach through simulations. Chapter 5 investigates the problem of lossy compression with side information, known as Wyner-Ziv (WZ) coding [21]. We propose an alternative view of the problem, as denoising a sequence corrupted through a memoryless channel with the help of a fidelity boosting sequence described through a bit-pipe of rate R. This alternative viewpoint leads us to a new algorithm for universal WZ coding which inspires a new paradigm for implementable WZ coding. In addition to the theoretical justification, we provide several simulation results that prove the effectiveness of the new approach. Chapter 6 is on multiple-description (MD) coding problem. The notion of stationary MD coding is introduced which instead of two rates and three distortions comprises of three rates and three distortions. It is shown how similar ideas as those used in Chapter 3 are applicable to multi-user source coding problem of multiple descriptions. We are able to devise an implementable iterative algorithm for universal MD coding which, as the block length and the number of iterations grow, achieves the stationary MD

coding rate-distortion function of any stationary ergodic source. Finally, Chapter

7 presents some open problems for future work. Most of the work presented in this dissertation is published in the followings conference and journal papers: [22–29].

Chapter 2 Definitions and notation Let X = {Xi ; ∀ i ∈ N+ } be a stochastic process defined on a probability space

(X, Σ, µ), where Σ denotes the σ-algebra generated by cylinder sets C, and µ is a

probability measure defined on it. For a process X, let X denote the alphabet of Xi , which is assumed to be finite. The shift operator T : X ∞ → X ∞ is defined by x ∈ X ∞ , n ≥ 1.

(T x)n = xn+1 ,

¯ Moreover, for a stationary process X, let H(X) denote its entropy rate defined as ¯ H(X) = lim H(Xn+1|X n ). n→∞

Let X and Xˆ denote the source and reconstruction alphabets respectively. For k

y n ∈ Y n , define the matrix m(y n ) ∈ R|Y| × R|Y| to be (k + 1)th order empirical count of y n , i.e., its (β, b)th element is defined as mβ,b (y n ) =

1 i−1 k + 1 ≤ i ≤ n : yi−k = b, yi = β] , n−k

(2.1)

where b ∈ Y k , and β ∈ Y. Let Hk (y n ) denote the conditional empirical entropy of

order k induced by y n , i.e.,

Hk (y n ) = H(Yk+1|Y k ), 5

(2.2)

6

CHAPTER 2. DEFINITIONS AND NOTATION

where Y k+1 on the right hand side of (2.2) is distributed according to P(Y k+1 = [bβ]) = mβ,b (y n ).

(2.3)

For a vector v = (v1 , . . . , vℓ )T with non-negative components, we let H(v) denote the

entropy of the random variable whose probability mass function (pmf) is proportional to v. Formally,

H(v) =

 ℓ   P

i=1

vi kvk1

 

1 log kvk vi

if v 6= (0, . . . , 0)T

(2.4)

T

0

if v = (0, . . . , 0) ,

where 0 log(0) = 0 by convention. The conditional empirical entropy in (2.2) can be expressed as a function of m(y n ) as follows Hk (y n ) =

1X H (m·,b(y n )) 1T m·,b (y n ), n b

(2.5)

where 1 and m·,b (y n ) denote the all-ones column vector of length |Y|, and the column

in m(y n ) corresponding to b respectively.

Let m(w n |y n, z n ) denote the conditional k th order empirical distribution of w n

given y n and z n , whose (β, b0 , b1 , b2 )th element is defined as mβ,b0 ,b1 ,b2 =

1 i+k1 i+k1 i−1 |{i : wi = β, wi−k = b0 , yi−k = b1 , zi−k = b2 }|, 1 1 n − k1 − max{k, k1} (2.6)

where β ∈ Y, b0 ∈ Y k , and b1 ∈ W 2k1 +1 , b1 ∈ Z 2k1 +1 . Now define the conditional

empirical entropy of y n given w n and z n , Hk,k1 (y n |w n , z n ), in terms of m(y n |w n , z n ) as

Hk,k1 (y n |w n , z n ) =

X

b0 ,b1 ,b2

1T m·,b0 ,b1 ,b2 H (m·,b0 ,b1 ,b2 ) .

(2.7)

For vectors u and v both is Rn , let ku − vk1 denote the ℓ1 distance between u

7

and v, defined as follows ku − vk1 =

n X i=1

|ui − vi |.

(2.8)

Also the total variation between the two vectors is defined as 1 ku − vkTV = ku − vk1 . 2

(2.9)

In Chapter 5, let X , Xˆ , and Z denote the source, reconstructed signal, and channel

output alphabets respectively, and for simplicity, restrict attention to X = Xˆ = Z = {α1 , . . . , αN },

(2.10)

though our derivations and results carry over directly to nonidentical finite sets X , Xˆ , and Z. The discrete memoryless channel is described by its transition matrix Π, where

πi,j denotes the probability of getting αj at the output of the channel when the input is αi . For αi and αj in An , let π αi and dαj denote the columns of Π and Λ corresponding

to αi and αj respectively, i.e.,

Π = [π α1 | . . . |π αN ],

Λ = [dα1 | . . . |dαN ].

For N-dimensional vectors u and v, u ⊙ v denotes the N-dimensional vector that

results from componentwise multiplication of u and v, i.e., u ⊙ v[i] = ui vi .

(2.11)

As in (2.11), we denote the ith component of a vector by either a subindex or, when that could lead to some confusion, in square brackets.

Chapter 3 Rate-distortion via Markov chain Monte Carlo In this chapter, a new approach to implementable lossy source coding is presented. It borrows two well-known tools from statistical physics and computer science, namely Markov chain Monte Carlo (MCMC), and simulated annealing [30],[31]. MCMC methods refer to a class of algorithms that are designed to generate samples of a given distribution through generating a Markov chain having the desired distribution as its stationary distribution. MCMC methods include a large class of algorithms; For our application, we use the Gibbs sampler [32] also known as the heat bath algorithm, which is well-suited to the case where the desired distribution is hard to compute, but the conditional distributions of each variable given the rest are easy to work out. The second required tool is simulated annealing which is a well-known optimization method. Its goal is to find the minimum of a function fmin , min f (s) along with the minimizing state smin over a set of possibly huge number of states S. In order to do simulated annealing, a sequence of probability distributions p1 , p2 , . . . corresponding to the temperatures T1 > T2 > . . ., where Ti → 0 as i → ∞, and a sequence of

positive integers N1 , N2 , . . ., are considered. For the first N1 steps, the algorithm runs one of the relevant MCMC methods in an attempt to sample from distribution p1 . Then, for the next N2 steps, the algorithm, using the output of the previous part as the initial point, aims to sample from p2 , and so on. The probability distributions 8

9

are designed such that: 1) their output, with high probability, is the minimizing state smin, or one of the states close to it, 2) the probability of getting the minimizing state increases as the temperature drops. The probability distribution that satisfies these characteristics, and is often used, is the Boltzmann distribution pβ (s) ∝ e−βf (s) , where β∝

1 . T

It can be proved that using the Boltzmann distribution, if the temperature

drops slowly enough, the probability of ultimately getting the minimizing state as the output of the algorithm approaches one [32]. Simulated annealing has been suggested before in the context of lossy compression, either as a way for approximating the rate distortion function (i.e., the optimization problem involving minimization of the mutual information) or as a method for designing the codebook in vector quantization [33],[34], as an alternative to the conventional generalized Lloyd algorithm (GLA) [35]. In contrast, in this thesis we use the simulated annealing approach to obtain a particular reconstruction sequence, rather than a whole codebook. A brief description of how the new algorithm codes a sequence xn is as follows. First, to each reconstruction block y n , it assigns an energy, E(y n ), which is a linear

combination of its conditional empirical entropy, to be defined formally in the next section, and its distance from the source sequence xn . Then, it assumes a Boltzmann n

probability distribution over the reconstruction blocks as p(y n ) ∝ e−βE(y ) , for some

β > 0, and tries to generate xˆn from this distribution using Gibbs sampling [32].

As we will show, for β large enough, with high probability the reconstruction block of our algorithm would satisfy E(ˆ xn ) ≈ min E(y n ). The encoder will output LZ(ˆ xn ), which is the Lempel-Ziv [7] description of xˆn . The decoder, upon receiving LZ(ˆ xn ), reconstructs xˆn perfectly. Instead of working at a fixed rate or at a fixed distortion, we fix the slope. The idea of using fixed-slope algorithms for universal lossy compression have been used before in the literature [36]. A fixed slope rate-distortion scheme, for a fixed slope s = −α < 0, looks for the coding scheme that minimizes R+αD, where as usual R and D denote

the rate and the average expected distortion respectively. In comparison to a given coding scheme of rate R and expected distortion D, for any 0 < δ < R − R(X, D), there exists a code which works at rate R(X, D)+δ and has the same average expected

distortion, and consequently a lower cost. Therefore, it follows that any point that is

10

CHAPTER 3. RATE-DISTORTION VIA MCMC

optimal in the fixed-slope setup corresponds to a point on the rate-distortion curve.

3.1

An exhaustive search scheme for fixed-slope compression

Consider the following scheme for lossy source coding at fixed slope α > 0. For each source sequence xn let the reconstruction block xˆn be xˆn = arg min [Hk (y n ) + αdn (xn , y n )] . n y

(3.1)

The encoder, after computing xˆn , losslessly conveys it to the decoder using LZ compression.

Theorem 1. Let X be a stationary ergodic source, let R(X, D) denote its rate disˆ n denote the reconstruction using the above scheme on X n . tortion function, and let X Then

1 n→∞ n n ˆn ˆ E ℓLZ (X ) + αdn (X , X ) −→ min [R(X, D) + αD] . D≥0 n

(3.2)

The proof of Theorem 1 is given in Appendix A. In words, the above scheme universally attains the optimum rate-distortion performance at slope α for any stationary ergodic process. The drawback of the described algorithm is its computational complexity; It involves exhaustive search among the set of all possible reconstructions. The size of this set is |Xˆ |n which grows exponentially fast with n.

Remark: The exhaustive search algorithm described above is similar to the generic algorithm proposed in [36] which works for any length function resulting from some universal lossless compression algorithm. Here we use the specific length function corresponding to conditional empirical entropy, and for completeness provide our own proof of the theorem which is tailored to this specific cost function.

11

3.2. UNIVERSAL LOSSY CODING VIA MCMC

3.2

Universal lossy coding via MCMC

In this section, we will show how simulated annealing Gibbs sampling enables us to get close to the performance of the impractical exhaustive search coding algorithm described in the previous section. Throughout this section we fix the slope α > 0. Associate with each reconstruction sequence y n the energy E(y n ) , n [Hk (y n ) + αdn (xn , y n )] n X X T n n = 1 m·,u (y )H (m·,u (y )) + α d(xi , yi).

(3.3)

i=1

u∈Xˆ k

The Boltzmann distribution can now be defined as the pmf on Xˆ n given by pβ (y n ) =

1 exp{−βE(y n)}, Zβ

(3.4)

where Zβ is the normalization constant (partition function). Note that, though this dependence is suppressed in the notation for simplicity, E(y n), and therefore also pβ

and Zβ depend on xn and α, which are fixed until further notice. When β is large and Y n ∼ pβ , then with high probability Hk (Y n ) + αdn (xn , Y n ) ≈ min [Hk (y n ) + αdn (xn , y n )] . n y

(3.5)

Thus, for large β, using a sample from the Boltzmann distribution pβ as the reconstruction sequence, would yield performance close to that of an exhaustive search scheme that would use the achiever of the minimum in (3.5). Unfortunately, it is hard to sample from the Boltzmann distribution directly. We can, however, get approximate samples via MCMC, as we describe next. As mentioned earlier, the Gibbs sampler [32] is useful in cases where one is interested in sampling from a probability distribution which is hard to compute, but the conditional distribution of each variable given the rest of the variables is accessible. In our case, the conditional probability under pβ of Yi given the other variables

12

CHAPTER 3. RATE-DISTORTION VIA MCMC

Y n\i , {Yn : n 6= i} can be expressed as pβ (Yi = a, Y n\i = y n\i) pβ (Yi = a|Y n\i = y n\i ) = P , pβ (Yi = b, Y n\i = y n\i)

(3.6)

b∈Xˆ

n exp{−βE(y i−1ayi+1 )} P = , n exp{−βE(y i−1byi+1 )}

(3.7)

b∈Xˆ

n n exp{−βn Hk (y i−1ayi+1 ) + αdn (xn , y i−1ayi+1 ) } , (3.8) = P n n exp{−βn Hk (y i−1 byi+1 ) + αdn (xn , y i−1byi+1 ) } b∈Xˆ

= P

b∈Xˆ

1

exp{−β

n n∆Hk (y i−1byi+1 , a)

, + α∆d(b, a, xi ) }

(3.9)

n n where ∆Hk (y i−1byi+1 , a) and ∆d(y i−1 byi+1 , a, xi ) are defined as n n n ∆Hk (y i−1byi+1 , a) , Hk (y i−1byi+1 ) − Hk (y i−1ayi+1 ),

(3.10)

and ∆d(b, a, xi ) , d(b, xi ) − d(a, xi ), respectively.

n Evidently, pβ (Yi = yi |Y n\i = y n\i) depends on y n only through {Hk (y i−1byi+1 )−

n n n Hk (y i−1 ayi+1 )}a,b∈Xˆ and {d(xi , a)}a∈Xˆ . In turn, {Hk (y i−1byi+1 ) − Hk (y i−1 ayi+1 )}a,b

n depends on y n only through {m(y i−1 byi+1 )}b .

n Note that, given m(y n ), the number of operations required to obtain m(y i−1 byi+1 ), for any b ∈ Xˆ is linear in k, since the number of contexts whose counts are affected

by the change of one component of y n is at most 2k + 2. To be more specific, letting Si (y n , b) denote the set of contexts whose counts are affected when the ith component

3.2. UNIVERSAL LOSSY CODING VIA MCMC

13

of y n is flipped from yi to b, we have |Si (y n , b)| ≤ 2k + 2. Further, since n n n[Hk (y i−1byi+1 ) − Hk (y i−1ayi+1 )] =

X

u∈Si (y i−1 byi+1 ,a)

n n n n 1T m·,u (y i−1byi+1 )H m.u (y i−1 byi+1 ) − 1T m·,u (y i−1ayi+1 )H m·,u (y i−1 ayi+1 ) ,

(3.11)

n n it follows that, given m(y i−1 byi+1 ) and Hk (y i−1 byi+1 ), the number of operations ren n quired to compute m(y i−1 ayi+1 ) and Hk (y i−1ayi+1 ) is linear in k (and independent of

n). Now consider the following algorithm (Algorithm 1 below) based on the Gibbs ˆ n (X n ) denote its (random) outcome when sampling for sampling from pβ , and let X α,r taking k = kn and β = {βt }t to be deterministic sequences satisfying kn = o(log n)

and βt =

1

(n) T0

(n)

log(⌊ nt ⌋ + 1), for some T0 ∆ = max 8 i

max

i−1 ∈ X ˆ i−1 , > < u n ˆ ui+1 ∈ X n−i , > : a, b ∈ Xˆ

> n∆, where

E(ui−1 auni+1 ) − E(ui−1 buni+1 ) ,

(3.12)

applied to the source sequence X n as input.1 By the previous discussion, the computational complexity of the algorithm at each iteration is independent of n and linear in k.

Theorem 2. Let X be a stationary ergodic source. Then

1 n n n ˆ α,r + αdn (X , X ˆ α,r ) = min [R(X, D) + αD] . lim lim E ℓLZ X n→∞ r→∞ D≥0 n

(3.13)

Proof. The proof is presented in Appendix B.

1

Here and throughout it is implicit that the randomness used in the algorithms is independent of the source, and the randomization variables used at each drawing are independent of each other.

14

CHAPTER 3. RATE-DISTORTION VIA MCMC

Algorithm 1 Generating the reconstruction sequence Require: xn , k, α, {βt }t , r Ensure: a reconstruction sequence xˆn 1: y n ← xn 2: for t = 1 to r do 3: Draw an integer i ∈ {1, . . . , n} uniformly at random 4: For each y ∈ Xˆ compute q(y) = pβt (Yi = y|Y n\i = y n\i) given in (3.9) 5: Update y n by replacing its ith component yi by Z, where Z ∼ q 6: Update m(y n ) and Hk (y n ) 7: end for 8: x ˆn ← y n

3.3

Sliding-window rate-distortion coding via MCMC

The clasical approach to lossy source coding is block coding initiated by Shannon [2]. In this method, each possible source block of length n is mapped into a reconstruction block of the same length. One of the disadvantages of this method is that applying a block code to a stationary process converts it into a non-stationary reconstruction process. Another approach to the rate-distortion coding problem is sliding-block (SB), a.s. stationary, coding introduced by R.M. Gray, D.L. Neuhoff, and D.S. Ornstein in [37] and also independently by K. Marton in [38] both in 1975. In this method, a fixed SB map of a certain order 2kf + 1 slides over the source sequence and generates the reconstruction sequence which has lower entropy rate compared to the original process. The advantage of this method with respect to the block coding technique is that while the achievable rate-distortion regions of the two methods provably coincide, the stationarity of the source is preserved by a SB code [39]. Although SB codes seem to be a good alternative to block codes, there has been very little progress in constructing good such codes since their introduction in 1975, and to date there is no known practical method for finding practical SB codes. In this section we show how our MCMC-based approach can be applied to finding good entropy-constrained SB codes of a certain order 2kf + 1. Remark: There is a slight difference between SB codes as proposed in [37], and our

15

3.3. SB CODING VIA MCMC

entropy-constrained SB codes. In [37], it is assumed that after the encoder converts the source process into the coded process, with no more encryption, it can be directly sent to the decoder via a channel that has capacity of R bits per transmission. Then the decoder, using another SB code, converts the coded process into the reconstruction process. In our setup on the other hand, the encoder directly converts the source process into the reconstruction process, which has lower entropy, and then employs a universal lossless coder to describe the coded sequence to the decoder. The decoder then applies the universal lossless decoder that corresponds to the lossless encoder used at the encoder to retrieve the reconstruction sequence. A SB code of window length 2kf + 1, is a function f : X 2kf +1 → Xˆ which is applied

to the source process {Xn } to construct the reconstruction block as follows ˆ i = f(X i+kf ). X i−kf

(3.14)

The total number of (2kf + 1)-tuples taking values in X is Kf = |X |2kf +1 . Therefore, for specifying a SB code of window length 2kf + 1, there are Kf values to be determined, and f can be represented as a vector f Kf = [f0 , f1 , . . . , fKf −1 ] where fi ∈ Xˆ is the output of function f to the input vector b equal to the expansion of i in 2k Pf 2kf + 1 symbols modulo |X |, i.e., i = bj |X |j . j=0

For coding a source output sequence xn by a SB code of order 2kf + 1, among

2kf +1 |Xˆ ||X | possible choices, similar to the exhaustive search algorithm described in

Section 3.1, here we look for the one that minimizes the energy function assigned to each possible SB code as E(f Kf ) , n [Hk (y n ) + αdn (xn , y n )] ,

(3.15)

i+k

where y n = y n [xn , f Kf ] is defined by yi = f(xi−kff ). Like before, we consider a cyclic rotation as xi = xi+n , for any i ∈ N. Again, we resort to the simulated annealing

Gibbs sampling method in order to find the minimizer of (3.15). Unlike in (3.4),

16

CHAPTER 3. RATE-DISTORTION VIA MCMC

instead of the space of possible reconstruction blocks, here we define Boltzmann distribution over the space of possible SB codes. Each SB code is represented by a unique vector f Kf , and pβ (f Kf ) ∝ exp (−βE(y n)), where y n = y n [xn , f Kf ]. The

conditional probabilities required at each step of the Gibbs sampler can be written as K

pβ (fi = θ|f

Kf \i

pβ (f i−1 θfi+1f ) )= P , K pβ (f i−1 ϑfi+1f )

(3.16)

ϑ

=P ϑ

1 K

K

exp (−β(E(f i−1ϑfi+1f ) − E(f i−1 θfi+1f )))

.

(3.17)

Therefore, for computing the conditional probabilities we need to find out by how much changing one entry of f Kf affects the energy function. Compared to the previous section, finding this difference in this case is more convoluted and should be handled with more deliberation. To achieve this goal, we first categorize different positions in xn into |X |2kf +1 different types and construct the sn vector such that the label of xi , αi , is defined to be

αi ,

kf X

j=−kf

xn+j |X |kf +j .

(3.18)

In other words, the label of each position is defined to be the symmetric context of i+k

length 2kf + 1 embracing it, i.e., xi−kff . Using this definition, applying a SB code f Kf to a sequence xn can alternatively be expressed as constructing a sequence y n where yi = fαi .

(3.19)

From this representation, changing fi from θ to ϑ while leaving the other elements of f Kf unchanged only affects the positions of the y n sequence that correspond to the label i in the sn sequence, and we can write the difference between energy functions

17

3.3. SB CODING VIA MCMC

appearing in (3.17) as K

K

E(f i−1 ϑfi+1f ) − E(f i−1 θfi+1f ) =

n [Hk (m(y n ) − Hk (m(ˆ y n )] + α

X

(d(xj , ϑ) − d(xj , θ)),

(3.20)

j:αj =i

K

K

where y n and yˆn represent the results of applying f i−1 ϑfi+1f and f i−1 θfi+1f to xn respectively, and as noted before the two vectors differ only at the positions {j : αj = i}. Flipping each position in the y n sequence in turn affects at most 2(k + 1) columns

of the count matrix m(y n ). Here at each pass of the Gibbs sampler a number of positions in the y n sequence are flipped simultaneously. Algorithm 2 describes how we can keep track of all these changes and update the count matrix. After that in analogy to Algorithm 1, Algorithm 3 runs the Gibbs sampling method to find the best SB code of order 2kf + 1, and at each iteration it employs Algorithm 2. K

(n)

f denote the output of Algorithm 3 to input vector xn at slope α after r Let fβ,α,r

(n)

iterations, and annealing process β. Kf

(n)

= 22kf

+1

denotes the length of the vector

f representing the SB code. The following theorem proved in Appendix C states that Algorithm 3 is asymptotically optimal for any stationary ergodic source; i.e.,coding K

(n)

f to the source sequence, and then a source sequence by applying the SB code fβ,α,r

describing the output to the decoder using Lempel-Ziv algorithm, asymptotically, as the number of iterations and window length kf grow to infinity, achieves the ratedistortion curve.

(n)

(n)

(n)

Theorem 3. Given a sequence {kf } such that kf → ∞, schedule βt (n)

1) for some T0

=

1 (n) T0

log(⌊

t (n) ⌋+ Kf

> Kf ∆, where

∆ = max 8 i

max

i−1 ∈ X ˆ i−1 , > < f n fi+1 ∈ Xˆ Kf −i , > : ϑ, θ ∈ Xˆ

K

K

|E(f i−1ϑfi+1f ) − E(f i−1bfi+1f )|,

(3.21)

18

CHAPTER 3. RATE-DISTORTION VIA MCMC

Algorithm 2 Updating the count matrix of y n = f(xn ), when fi changes from θ to ϑ Require: xn , kf , k, m(y n ), i, ϑ, θ Ensure: m(ˆ y n) n 1: a ← 0 2: y ˆn ← y n 3: for j = 1 to n do 4: if αj = i then 5: yˆj ← θ 6: end if 7: end for 8: m(ˆ y n ) ← m(y n ) 9: for j = kf + 1 to n − kf do 10: if αj = i then 11: aj+k ←1 j 12: end if 13: end for 14: for j = k + 1 to n − k do 15: if aj = 1 then 16: myj ,yj−1 ← myj ,yj−1 − 1 j−k j−k 17: myˆj ,ˆyj−1 ← myˆj ,ˆyj−1 + 1 j−k j−k 18: end if 19: end for and k = o(log n). Then, for any stationary ergodic source X, we have

1 n n n ˆ + αdn (X , X ˆ ) = min [R(X, D) + αD] , lim lim E ℓLZ X n→∞ r→∞ D≥0 n

(3.22)

ˆ n is the result of applying SB code f Kf to X n . where X β,α,r Proof. The proof is presented in Appendix C. Note that in Algorithm 3, for a fixed kf , the SB code is a vector of length Kf = |X |2kf +1 . Hence, the size of the search space is |Xˆ |Kf which is independent of

n. Moreover, the transition probabilities of the SA as defined by (3.17) depend on the differences of the form presented in (3.20), which, for a stationary ergodic source K

and fixed kf , if n is large enough, linearly scales with n. I.e., for a given f i−1 , fi+1f , ϑ

19

3.4. SIMULATION RESULTS

Algorithm 3 Universal SB lossy coder based on simulated annealing Gibbs sampler Require: xn , kf , k, α, β, r Ensure: f Kf 1: for t = 1 to r do 2: Draw an integer i ∈ {1, . . . , Kf } uniformly at random 3: For each a ∈ Xˆ compute pβt (fi = θ|f Kf \i ) using Algorithm 2, equations (3.17), and (3.20) 4: Update f Kf by replacing its ith component fi by θ drawn from the pmf computed in the previous step 5: end for and θ, 1 K K [E(f i−1ϑfi+1f ) − E(f i−1θfi+1f )] = q n→∞ n lim

a.s.,

(3.23)

where q ∈ [0, 1] is some fixed value depending only on the source distribution. This is

an immediate consequence of the ergodicity of the source plus the fact that SB coding of a stationary ergodic process results in another process which is jointly stationary with the initial process and is also ergodic. On the other hand, similar reasoning proves that ∆ defined in (3.21) scales linearly by n. Therefore, overall, combining these two observations, for large values of n and fixed kf , the transition probabilities of the nonhomogeneous MC defined by the SA algorithm incorporated in Algorithm 3 are independent of n. This does not mean that the convergence rate of the algorithm is independent of n, because for achieving the rate-distortion function one needs to increase kf and n simultaneously to infinity.

3.4

Simulation results

We dedicate this section to the presentation of some initial experimental results obtained by applying the schemes presented in the previous sections on simulated and real data. The Sub-section 3.4.1 demonstrates the performance of Algorithm 1 on simulated 1-D and real 2-D data. Some results on the application of Algorithm 3 on simulated 1-D data is shown in Sub-section 3.4.2.

20

CHAPTER 3. RATE-DISTORTION VIA MCMC

3.4.1

Block coding

In this sub-section, some of the simulation results obtained from applying Algorithm 1 of Section 3.2 to real and simulated data are presented. The algorithm is easy to apply, as is, to both 1-D and 2-D data. As a first example, consider a Bernoulli(p) i.i.d source with p = 0.1. Fig. 1 compares the performance of Algorithm 1 against the optimal rate-distortion tradeoff given by R(D) = h(p) − h(D), where h(θ) =

−θ log(θ) − (1 − θ) log(1 − θ), for a source sequence of length n = 5 × 104 . Here, in order to get different points, α has been linearly decreased from α = 5 to α = 3. To

illustrate the encoding process, Fig. 2 depicts the evolutions of Hk (ˆ xn ), dn (xn , xˆn ), and E(ˆ xn ) = Hk (ˆ xn ) + αdn (xn , x ˆn ) during the encoding process of a source block of length n = 5 × 104 at α = 2.5, k = 10, and βt = 1 + t. As another example, Fig. 3.4 compares the performance of Algorithm 1 when applied to a binary symmetric Markov source (BSMS) with transition probability q = 0.2 against the Shannon lower bound (SLB) which sates that for a BSMS R(D) ≥ RSLB (D) , h(p) − h(D).

(3.24)

There is no known explicit characterization of the rate-distortion tradeoff for a BSMS except for a low distortion region. It has been proven that for D < Dc , where for p < 0.5 1 Dc = 2

1−

r

p 2 ) , 1−( 1−p

(3.25)

the SLB holds with equality, and for D > Dc , we have strict inequality, i.e., R(D) > RSLB [40]. In our case Dc = 0.0159. For distortions beyond Dc , a lower bound and an upper bound on the rate-distortion function, derived based on the results presented in Appendix I, are also shown for comparison. The parameters here are: n = 5 × 104 ,

k = 8 and βt = nαt1.5 /3.

Finally, consider applying the algorithm to a n × n binary image, where n = 252.

Fig. 3.5(a) shows the original image, while Fig. 3.5(b) shows the coded version after r = 50n2 iterations. The parameters are: α = 0.1, and βt = 0.1 log t. The empirical

conditional entropy of the image has decreased from Hk = 0.1025 to Hk = 0.0600 in

21

3.4. SIMULATION RESULTS

the reconstruction image, while an average distortion of D = 0.0337 per pixel is introduced. Fig. 3.1 shows the pixels that form the 2-D ‘history’ (causal neighborhood) of each pixel in our experiments, i.e., the pixels whose different configurations form the columns of the count vector m(y m×n ).

111 000 000 111 000 111 000 000111 111 000111 111 000 000 111 000 111 000 111 000 111 000 111 000 000111 111 000 111 000111 111 000 000111 111 000 Figure 3.1: Each pixels 6th order 2-D context

In Fig. 3.1, the black square depicts the location of the current pixel, and the squares with patterns denote its 6th order context. Comparing the required space for storing the original image as a PNG file with the amount required for the coded image reveals that in fact the algorithm has not only reduced the conditional empirical entropy of the image by 41.5%, but also has cut the size of the file by around 39%.

3.4.2

Sliding-block coding

Consider applying Algorithm 3 of Section 3.3 to the output of a BSMS with q = 0.2. Fig. 3.6 shows the algorithm output along with Shannon lower bound and lower/upper bounds on R(D) from Appendix I. Here the parameters are: n = 5 × 104 , k = 8, SB

window length of kf = 11 and βt = Kf |α| log(t + 1).

In all of the presented simulation results, it is the empirical conditional entropy of the final reconstruction block that we are comparing to the rate-distortion curve. It should be noted that, though this difference vanishes as the block size grows, for finite values of n there would be an extra (model) cost for losslessly describing the reconstruction block to the decoder.

22

CHAPTER 3. RATE-DISTORTION VIA MCMC

0.5 R(D) = h(p) − h(D) Algorithm 1 output: n = 5 × 104 , k = 10, α = 3 : 0.2 : 5

0.45

0.4

0.35

Hk (ˆ xn )

0.3

0.25

0.2

0.15

0.1

0.05

0

0

0.01

0.02

0.03

0.04

0.05 D

0.06

0.07

0.08

0.09

0.1

Figure 3.2: Comparing Algorithm 1 performance withqthe optimal rate-distortion tradeoff for a (Bern(p)) i.i.d source, p = 0.1 and βt = n ⌈ nt ⌉.

23

3.4. SIMULATION RESULTS

0.5

Hk (yn )

0.45 0.4 0.35 0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5 5

x 10

dn (xn , yn )

0.04 0.03 0.02 0.01 0

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5 5

x 10 Hk (yn ) + αdn (xn , yn )

0.47 0.46 0.45 0.44 0.43

0

0.5

1

1.5

2

2.5 t

3

3.5

4

4.5

5 5

x 10

Figure 3.3: Sample paths demonstrating evolutions of the empirical conditional entropy, average distortion, and energy function when Algorithm 1 is applied to the 4 output of a Bern(p) i.i.d source, with p = 0.1 q and n = 5 × 10 . The algorithm parameters are k = 10, α = 4, and βt = 1.33n ⌈ nt ⌉.

24

CHAPTER 3. RATE-DISTORTION VIA MCMC

0.75 R(D) − h(q) − h(D) Shannon lower bound Algorithm 1 output: n = 5e4, k = 8, s = −5 : 0.1 : −4 Lower bound on R(D) Upper bound on R(D) 0.7

R

0.65

0.6

0.55

0.5

0

0.005

0.01

0.0159 Dc

0.02 D

0.025

0.03

0.035

0.04

Figure 3.4: Comparing the algorithm rate-distortion performance with the Shannon lower bound for a BSMS with q = 0.2, βt = nαt1.5 /3

3.4. SIMULATION RESULTS

25

(a) Original image with empirical conditional entropy of 0.1025

(b) Reconstruction image with empirical conditional entropy of 0.0600 and average distortion of 0.0337 per pixel (r = 50n2 , α = 0.1, βt = 0.1 log t).

Figure 3.5: Applying Algorithm 1 to a 2-D binary image using the context shown in Fig. 3.1

26

CHAPTER 3. RATE-DISTORTION VIA MCMC

0.75 R(D) = h(q) − h(D) Shannon lower bound Algorithm 3 output: k = 8, kf = 5 Lower bound on R(D) Upper bound on R(D)

0.7

0.65

R

0.6

0.55

0.5

0.45

0.4

0

0.01

0.0159 0.02 Dc

0.03 D

0.04

0.05

0.06

Figure 3.6: Comparing the algorithm rate-distortion performance with the Shannon lower bound for a BSMS with q = 0.2. Algorithm parameters: n = 5 × 104 , k = 8, kf = 5 (Kf = 211 ), βt = Kf α log(t + 1), and slope values α = 5.25, 5, 4.75 and 4.5.

3.5. APPLICATION: OPTIMAL DENOISING

3.5

27

Application: optimal denoising via MCMCbased lossy coding

Consider the problem of denoising a stationary ergodic source X with unknown distribution corrupted by additive white noise V. Compression-based denoising algorithms have been proposed before by a number of researchers, c.f. [41], [42], [43] and references therein. The idea of using a universal lossy compressor for denoising was proposed in [42], and then refined in [43] to result in a universal denoising algorithm. In this section, we show how our new MCMC-based lossy encoder enables the denoising algorithm proposed in [43] to lead to an implementable universal denoiser. In [43], it is shown how a universally optimal lossy coder tuned to the right distortion measure and distortion level combined with some simple “post-processing” results in a universally optimal denoiser. In what follows we first briefly go over this compression-based denoiser described in [43], and then show how our lossy coder can be embedded in for performing the lossy compression part. Throughout this section we assume that the source, noise, and reconstruction alphabets are M-ary A = {0, 1, . . . , M − 1}, and the noise is additive modulo-M and

PV (a) > 0 for any a ∈ A, i.e., Zi = Xi + Vi .

As mentioned earlier, in the denoising scheme outlined in [43], first the denoiser lossily compresses the noisy signal appropriately, and partly removes the additive noise. Consider a sequence of good lossy coders characterized by encoder/decoder pairs (En , Dn ) of block length n working at distortion level H(V ) under the difference distortion measure defined as d(x, y) = log

1 . PV (x − y)

(3.26)

By good, it is meant that for any stationary ergodic source X, as n grows, the rate distortion performance of the sequence of codes converges to a point on the ratedistortion curve. The next step is a simple “post-processing” as follows. For a fixed m, define the following count vector over the noisy signal Z n and its quantized version

28

CHAPTER 3. RATE-DISTORTION VIA MCMC

Y n = Dn (En (Z n )), 1 i+k 2m+1 ˆ (2m+1) Q , y) , |{1 ≤ i ≤ n : (Zi−k , Yi ) = (z 2m+1 , y)}|. [Z n ,Y n ] (z n

(3.27)

After constructing these count vectors, the denoiser output is generated through the “post-processing” or “derandomization” process as follows ˆ i = arg min X x ˆ∈A

X

2m+1 ˆ (2m+1) Q , y)d(ˆ x, y), [Z n ,Y n ] (z

(3.28)

y∈A

where d(·, ·) is the original loss function under which the performance of the denoiser

is to be measured. The described denoiser is shown to be universally optimal [43], and the basic theoretical justification of this is that the rate-distortion function of the noisy signal Z under the difference distortion measure satisfies the Shannon lower bound with equality, and it is proved in [43] that for such sources

2

for a fixed k, the

k th order empirical joint distribution between the source and reconstructed blocks defined as ˆ (k)n n (xk , y k ) , 1 {1 ≤ i ≤ n : (X i+k−1 , Y i+k−1 ) = (xk , y k )} , Q i i [X ,Y ] n

(3.29)

resulting from a sequence of good codes converge to PX k ,Y k in distribution, i.e., d ˆ (k)n n ⇒ Q PX k ,Y k , [X ,Y ]

where PX k ,Y k is the unique joint distribution that achieves the k th order rate-distortion function of the source. In the case of quantizing the noisy signal under the distortion measure defined in (3.26), at level H(V ), PX k ,Y k is the k th order joint distribution 2m+1 ˆ (2m+1) between the source and noisy signal. Hence, the count vector Q , y) den n (z [Z ,Y ]

fined in (3.27) asymptotically converges to PXi |Z n which is what the optimal denoiser would base its decision on. After estimating PXi |Z n , the post-processing step is just making the optimal Bayesian decision at each position. 2

In fact it is shown in [43] that this is true for a large class of sources including i.i.d sources and those satisfying the Shannon lower bound with equality.

3.5. APPLICATION: OPTIMAL DENOISING

29

The main ingredient of the described denoiser is a universal lossy compressor. Note that the MCMC-based lossy compressor described in Section V is applicable to any distortion measure. The main problem is choosing the parameter α corresponding to the distortion level of interest. To find the right slope, we run the quantization MCMC-based part of the algorithm independently from two different initial points α1 and α2 . After convergence of the two runs we compute the average distortion between the noisy signal and its quantized versions. Then assuming a linear approximation, we find the value of α that would have resulted in the desired distortion, and then run the algorithm again from this starting point, and again compute the average distortion, and then find a better estimate of α from the observations so far. After a few repetitions of this process, we have a reasonable estimate of the desired α. Note that for finding α it is not necessary to work with the whole noisy signal, and one can consider only a long enough section of data first, and find α from it, and then run the MCMC-based denoising algorithm on the whole noisy signal with the estimated parameter α. The outlined method for finding α is similar to what is done in [44] for finding appropriate Lagrange multiplier.

3.5.1

Experiments

In this section we compare the performance of the proposed denoising algorithm against discrete universal denoiser, DUDE [45], introduced in [46]. DUDE is a practical universal algorithm that asymptotically achieves the performance attainable by the best n-block denoiser for any stationary ergodic source. The setting of operation of DUDE

is more general than what is described in the previous section, and in fact in

DUDE

the additive white noise can be replaced by any known discrete memoryless

channel. As a first example consider a BSMS with transition probability p. Fig. 3.9 compares the performance of DUDE with the described algorithm. The slope α is chosen such that the expected distortion between the noisy image and its quantized version using Algorithm 1 is close to the channel probability of error which is δ = 0.1 in our case. Here we picked α = 0.9 for all values of p and did not tune it specifically

30

111 000 000 111 000111 000111 111 000 000 111 000 111 000111 111 000 000111 000 111 000 111

CHAPTER 3. RATE-DISTORTION VIA MCMC

Figure 3.7: The 4th order grid used for de-randomization in MCMC-based denoising

each time. Though, it can be observed that, even without optimizing the MCMC parameters, the two algorithms performances are very close, and for small values of p the new algorithm outperforms DUDE. As another example consider denoising a binary image. The channel a is DMC with error probability of 0.04. Fig. 3.10(b) shows the noisy image. Fig. 3.10(c) shows the reconstructed image generated by DUDE and 3.10(d) depicts the reconstructed image using the described algorithm. In this experiment the DUDE context structure is set as Fig. 3.7. The 2-D MCMC coder employs the same context as the one used in the example of

000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 000 000 111 000111 111 000111 000111 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000111 111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 000 000 111 000 000111 111 000111 111 000 111 000111 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000111 111 000111 000

Section 3.4.1 shown in Fig. 3.1, and the derandomization block is chosen as Fig. 3.8.

Figure 3.8: The 20th order grid used for de-randomization in MCMC-based denoising

Discussion: The new proposed approach which is based on MCMC coding plus de-randomization is an alternative not only to the DUDE, but also to MCMC-based denoising schemes that have been based on and inspired by the Geman brothers’ work [32]. While algorithmically, this approach has much of the flavor of previous MCMC-based denoising approaches, ours has the merit of leading to a universal scheme, whereas the previous MCMC-based schemes guarantee, at best, convergence to something which is good according to the posterior distribution of the original

3.5. APPLICATION: OPTIMAL DENOISING

31

given the noisy data, but as would be induced by the rather arbitrary prior model placed on the data. It is clear that here no assumption about the distribution/model of the original data is made.

32

CHAPTER 3. RATE-DISTORTION VIA MCMC

0.12 0.11 0.1 0.09

dn (xn , x ˆn (zn ))

0.08 0.07 0.06 0.05 0.04 0.03 0.02

MCMC coding+ derandomization DUDE Forward−backward

0.01 0

0

0.05

0.1

p

0.15

0.2

0.25

Figure 3.9: Comparing the denoiser based on MCMC coding plus derandomization with DUDE and optimal non-universal Bayesian denoiser which is implemented via forward-backward dynamic programming. The source is a BSMS(p), and the channel is assumed to be a DMC with transition probability δ = 0.1. The DUDE parameters are: kleft = kright = 4, and the MCMC coder uses α = 0.9, βt = 0.5 log t, r = 10n, n = 104 , k = 7. The derandomization window length is 2 × 4 + 1 = 9.

3.5. APPLICATION: OPTIMAL DENOISING

(a) Original image.

(b) Noisy image corrupted by a BSC(0.04).

33

34

CHAPTER 3. RATE-DISTORTION VIA MCMC

(c) DUDE reconstruction image with dn (xn , x ˆn ) = 0.0081: kletf = kright = 4.

(d) MCMC coder + derandomization reconstruction image with dn (xn , x ˆn ) = 0.0128: α = 2, βt = 5 log t, r = 10n2 ,

Figure 3.10: Applying MCMC-based denoiser to a 2-D binary image

Chapter 4

Lossy compression via Viterbi Algorithm

In Chapter 3 a new implementable algorithm for lossy compression of discrete stationary ergodic sources was proposed. The drawback of the proposed scheme is that although its computational complexity per iteration is independent of the block length n and linear in a parameter kn = o(log n), there is no useful bound on the number of iterations required for convergence. In this chapter, inspired by the previous method, we propose yet another approach for lossy compression which universally achieves optimum rate-distortion performance. We start by assigning the same cost that was assigned in the previous method. The cost of each sequence is a linear combination of two terms: its empirical conditional entropy and its distance to the source sequence to be coded. We show that there exists proper linear approximation of the first term such that minimizing the linearized cost results in the same performance as minimizing the original cost. But the advantage is that minimizing the modified cost can be done via Viterbi algorithm in lieu of simulated annealing which was used for minimizing the original cost. 35

36

CHAPTER 4. RATE-DISTORTION VIA VITERBI ALGORITHM

4.1

Linearized cost function

In Section 3.1, we proposed a universal algorithm for rate-distortion coding. The suggested algorithm was not appealing from an application point of view since it had exponential complexity. In Section 3.2 we showed that the exhaustive search required by this algorithm can be tackled through simulated annealing Gibbs sampling. Here assuming the source is a discrete Markov source, we propose another method for finding a sequence achieving the minimum in (3.1). The advantage of the new method is that its computational complexity is linear in n for fixed k. Before describing the new scheme, consider the problems (P1) and (P2) described below.

(P1) :

min [Hk (m(y n )) + αdn (xn , y n )] , n

(4.1)

y

and (P2) :

min n y

" XX β

#

λβ,b mβ,b (y n ) + αdn (xn , y n ) .

b

(4.2)

Comparing (P1) with (3.1) reveals that it is the optimization required by the exhaustive search coding scheme described before. The question is whether it is possible to choose a set of coefficients {λβ,b }β,b , β ∈ Xˆ and b ∈ Xˆ k , such that (P1) and (P2) have the same set of minimizers or at least, the set of minimizers of (P2) is a subset of minimizers of (P1). If the answer to this question is affirmative, then instead of solving (P1) one can solve (P2), which, as we describe in Section 4.3, can be done simply via the Viterbi algorithm. Let S1 and S2 denote the set of minimizers of (P1) and (P2). Consider some

z n ∈ S1 , and let m∗n = m(z n ). Since H(m) is concave in m,1 for any empirical count 1

As proved in Appendix B.

37

4.1. LINEARIZED COST FUNCTION

matrix m, we have ∗

H(m) ≤ H(m ) + ˆ , H(m).

X β,b

∂ H(m) (mβ,b − m∗β,b ) ∂mβ,b m∗n

(4.3) (4.4)

Now assume that in (P2), the coefficients are chosen as follows λβ,b

∂ = H(m) . ∂mβ,b m∗n

(4.5)

Lemma 1. (P1) and (P2) have the same minimum value, if the coefficients are chosen according to (4.5). Moreover, if all the sequences in S1 have the same type, then S1 = S2 .

Proof. For any y n ∈ Xˆ n , n ˆ H(m(y n )) + αdn (xn , y n ) ≤ H(m(y )) + αdn (xn , y n ).

(4.6)

n ˆ min [H(m(y n )) + αdn (xn , y n )] ≤ min [H(m(y )) + αdn (xn , y n)] n n

(4.7)

Therefore,

y

y

n ˆ ≤ H(m(z )) + αdn (xn , z n )

= min [H(m(y n )) + αdn (xn , y n )]. n y

(4.8) (4.9)

This shows that (P1) and (P2) have the same minimum values. For any sequence y n with m(y n ) 6= m∗n , by strict concavity of H(m), n ˆ H(m(y )) + αdn (xn , y n) > H(m(y n)) + αdn (xn , y n ),

≥ min [H(m(y n )) + αdn (xn , y n )]. n y

(4.10) (4.11)

As a result all the sequences in S2 should have the empirical count matrix equal to ˆ m∗n . Since for these sequences H(m) = H(m), we also conclude that S2 ⊂ S1 . If

38

CHAPTER 4. RATE-DISTORTION VIA VITERBI ALGORITHM

there is a unique minimizing type m∗n , then S1 = S2 . This shows that if we knew the optimal type m∗n , then we could compute the optimal coefficients via (4.5), and solve (P2) instead of (P1). The problem is that m∗n is not known to the encoder (since knowledge of m∗n requires solving (P1) which is the problem we are trying to avoid). In the next section, we describe a method for approximating m∗n , and hence the coefficients {λβ,b}.

4.2

Computing the coefficients

For a given sequence xn and order k, define the set M(k) to be the set of all jointly ˆ k ) with marginal distributions with respect to X stationary distributions on (X k , X coinciding with the k th order empirical distribution induced by xn defined as follows (k)

|{k + 1 ≤ i ≤ n : (xi−k , . . . , xi−1 ) = ak }| , n−k n X 1 = 1 i−1 k , n − k i=k+1 xi−k =a

pˆ[xn ] (ak ) ,

(4.12)

where ak ∈ X k . More specifically a distribution p(k) in M(k) should satisfy the following constraints

1. Stationarity condition: as described in Appendix D, X

p(k) (ak , bk ) =

ak ∈X ,bk ∈Xˆ

X

p(k) (ak ak−1 , bk bk−1 ).

(4.13)

(k)

(4.14)

ak ∈X ,bk ∈Xˆ

2. Consistency condition: for each ak ∈ X k , X

bk ∈Xˆ k

p(k) (ak , bk ) = pˆ[xn ] (ak ).

39

4.2. COMPUTING THE COEFFICIENTS

For given xn , k and ℓ >> k, consider the following optimization problem ˆ k+1 |X ˆ k ) + α E d(X1 , X ˆ1) min H(X ˆ ℓ ) ∼ p(ℓ) s.t. (X ℓ , X p(ℓ) ∈ M(ℓ) .

(4.15)

An alternative representation of the above optimization problem is as follows min

X

H(m) + α

d(a, b)q(a, b)

a∈X ,b∈Xˆ

s.t.

∀ αℓ ∈ X ℓ , bℓ ∈ Xˆ ℓ , ∀ aℓ ∈ X ℓ , bℓ ∈ Xˆℓ ,

0 ≤ p(ℓ) (aℓ , bℓ ) ≤ 1, X p(ℓ) (aℓ , bℓ ) = 1, aℓ ,bℓ

X

p(ℓ) (aℓ , bℓ ) =

aℓ ∈X ,bℓ ∈Xˆ

X

bℓ ∈Xˆ ℓ

X

p(ℓ) (aℓ aℓ−1 , bk bℓ−1 ),

aℓ ∈X ,bℓ ∈Xˆ

(ℓ)

∀ aℓ−1 ∈ X ℓ−1 , bℓ−1 ∈ Xˆ ℓ−1,

p(ℓ) (aℓ , bℓ ) = pˆ[xn ] (aℓ ) ∀ aℓ ∈ X ℓ ,

q(a, b) =

X

p(ℓ) (aaℓ−1 , bbℓ−1 )

aℓ−1 ∈X ℓ−1 ,bℓ−1 ∈Xˆ ℓ−1

mβ,b =

X

aℓ ∈X ℓ ,bℓ−k ∈Xˆ ℓ−k

p(ℓ) (aℓ , bβbℓ−k ),

∀ β, b.

(4.16)

Note that the optimization in (4.16) is done over the joint distributions p(ℓ) of ˆ ℓ ). Let Pˆ ∗ denote the set of minimizers of (4.16), and Sˆ∗ be their k th order (X ℓ , X n n ˆ ˆ marginalized versions with respect to X. Let {λβ,b}β,b be the coefficients evaluated

ˆ ∗n ∈ Eˆn∗ using (4.5). Let X be a stationary ergodic Markov source, and at some m ˆ n be the reconstruction R(X, D) denote its rate distortion function. Moreover, let X sequence obtained by solving (P2) at the evaluated coefficients.

Theorem 4. If k = kn = o(log n), ℓ = ℓn = o(n1/4 ) and k = o(ℓ) such that kn , ℓn →

40

CHAPTER 4. RATE-DISTORTION VIA VITERBI ALGORITHM

∞, as n → ∞, then h

i n→∞ n n ˆn ˆ E Hk (m(X )) + αdn (X , X ) −→ min [R(X, D) + αD] . D≥0

(4.17)

The proof of Theorem 4 is presented in Appendix F. Remark: Theorem 4 implies the fixed-slope universality of the scheme which does the lossless compression of the reconstruction by first describing its count matrix (costing a number of bits which is negligible for large n) and then doing the conditional entropy coding.

4.3

Viterbi coder

After finding the optimal coefficients, {λb,β }, as described in Section 4.1, Encoder

needs to solve (P2). In this section, we show how this can be done efficiently via Viterbi algorithm. Moreover, we propose a heuristic iterative approach for finding the coefficients. Note that the linearized cost can be written as X

n

[λβ,b mβ,b (y n ) + αdn (xn , y n )] =

b∈Xˆ k β∈Xˆ

i 1 Xh λyi ,yi−1 + αd(xi , yi ) . i−k n i=1

(4.18)

The advantage of this alternative representation is that, as it will be described, instead of using simulated annealing, we can find the sequence that minimizes (6.6) via Viterbi algorithm. Viterbi algorithm is a dynamic programming optimization method which finds the path of minimum weight in a Trellis diagram efficiently. For i = k + 1, . . . , n, let si := y i be the state at time i, and S be the set of all |Xˆ |k+1 possible states. i−k

From this definition, si is a function of si−1 and yi , i.e., si = g(si−1, yi), for some g : S × Xˆ → S. This representation leads to a Trellis diagram corresponding to the evolution of the states {si }ni=k+1 in which each state has |Xˆ | states leading to it and |Xˆ | states branching from it. Now to each edge in this graph, we assign a weight which depends on the tail state and the input symbol, i.e., to the edge e = (s′ , s)

41

4.3. VITERBI CODER

connecting states s′ and s = bk+1 at time i we assign weight wi (e) as wi (e) := λbk+1 ,bk + αd(xi , bk+1 ).

(4.19)

Note that these edge weights depend on time through the input sequence xn . In this representation, there is a 1-to-1 correspondence between sequences y n ∈ Xˆ n , and

sequences of states {si}ni=k+1 , and minimizing (6.6) is equivalent to finding the path of minimum weight in the corresponding Trellis diagram, i.e., the path {si }ni=k+1 that P minimizes ni=k+1 wi (ei ), where ei = (si−1 , si). Solving this minimization can readily be done by Viterbi algorithm which can be described as follows. For each state s, let L(s) be the |Xˆ | states leading to it, and for any i > 1, define Ci (s) := ′min [wi ((s′ , s)) + Ci−1 (s′ )]. s ∈L(s)

(4.20)

For i = 1 and s = bk+1 , let C1 (s) := λbk+1 ,bk +αdk+1(xk+1 , bk+1 ). Using this procedure, each state s at each time j has a path of length j − k − 1 which is the minimum path among all the possible paths between the states from time i = k + 1 to i = j such

that sj = s. After computing {Ci (s)} for all s ∈ S and all i ∈ {k + 1, . . . , n}, at time i = n, let

s∗ = arg min Cn (s).

(4.21)

s∈S

It is not hard to see that the path leading to s∗ is the path of minimum weight among all possible paths. Note that the computational complexity of this procedure is linear in n but exponential in k because the number of states increases exponentially with k. Therefore, given the coefficients {λb,β }, solving (P2) is straightforward using

Viterbi algorithm. The problem is finding an approximation of the optimal coefficients. The procedure outlined in Section 4.1 for finding the coefficients involves solving a non-convex optimization. To bypass this process, an alternative heuristic method is as follows: Start with m(xn ). Compute the coefficients from (4.5) at m(xn ). Employ Viterbi algorithm to solve (P2) at the computed coefficients. Let y n

42

CHAPTER 4. RATE-DISTORTION VIA VITERBI ALGORITHM

denote the output sequence. Compute m(y n ), and recalculate the coefficients using (4.5) at m(y n ). Again, use Viterbi algorithm to solve (P2) at the updated coefficients. Iterate. The effectiveness of this approach is discussed in the next section through some simulations.

4.4

Approximating the coefficients

In Section 4.2 it was proposed that for finding the coefficients required by the new ˆ ∗ . Then the coefficients {λβ,b } can algorithm, first we should solve (4.16) to find m be computed by evaluating (4.5) at m∗ . But 4.16 is a concave minimization problem with high dimensions, and therefore hard to solve. In this section, we discuss de tours for approximating the coefficients which have moderate computational complexity. First assume that the given α is large, corresponding to some point in the low distortion region. In this case, the distance between xn and xˆn is small. Therefore, their types, i.e., (k + 1)th order empirical distributions, are close too. Therefore, m(xn ) provides a reasonable approximation for m∗ . This implies that if our desired distortion is small, we can simply compute the type of the input sequence, and evaluate the coefficients at m(xn ). In the case where the desired distortion is not very small, we can use an iterative approach as follows. Start with m(xn ). Compute the coefficients from (4.5) at m(xn ). Employ Viterbi algorithm to solve (P2) at the computed coefficients. Let xˆn denote the output sequence. Compute m(ˆ xn ), and recalculate the coefficients using (4.5) at m(ˆ xn ). Again, use Viterbi algorithm to solve (P2) at the updated coefficients. Iterate. In a nutshell, the iterative approach works as follows ˆ β,b} −→ x˜n −→ . . . . ˆ xn ) −→ {λ xn −→ m(xn ) −→ {λβ,b } −→ xˆn −→ m(ˆ In the following we show that this iterative approach converges at least to a local minimum of the cost function. ˆ xn ), Lemma 2. Assume that the algorithm starts by computing the coefficients at m(ˆ

43

4.4. APPROXIMATING THE COEFFICIENTS

where xˆn is a given arbitrary sequence, and by running the Viterbi algorithm using the computed coefficients gets x˜n . Then we have E(˜ xn ) ≤ E(ˆ xn ).

(4.22)

Proof. From the concavity of Hk (m) in m ˆ ⊙ (m ˜ ≤ Hk (m) ˆ +Λ ˜ − m), ˆ Hk (m) where A ⊙ B with A and B two matrices of the same dimensions is equal to On the other hand ˆ ⊙m ˆ = Λ =

X

ˆ β,b m λ ˆ β,b

X

 β ′ ∈X m ˆ β,b log  m ˆ β,b

(4.23) P

ai,j bi,j .

i,j

β,b

β,b

 P

ˆ = Hk (m).

m ˆ β ′ ,b

  

(4.24) (4.25)

Therefore, combining (4.23) and (4.24) yields ˆ ⊙ m. ˜ ≤Λ ˜ Hk (m)

(4.26)

Adding a constant term to the both sides of (4.27), we get ˆ ⊙m ˜ + αd(xn , x˜n ) ≤ Λ ˜ + αd(xn , x E(˜ xn ) = Hk (m) ˜n ).

(4.27)

But, since x˜n is assumed to be a minimizer of (P2) for the computed coefficients, ˆ ⊙m ˆ ⊙m ˜ + αd(xn , x ˆ + αd(xn , xˆn ), Λ ˜n ) ≤ Λ

ˆ + αd(xn , xˆn ), = Hk (m) = E(ˆ xn )

(4.28)

44

CHAPTER 4. RATE-DISTORTION VIA VITERBI ALGORITHM

Therefore, combining (4.27) and (4.28) yields the desired result, i.e., E(˜ xn ) ≤ E(ˆ xn ).

(4.29)

Lemma 2 indicates that through iterations the cost is decreasing, and we are converging to a local minimum. Since the number of different costs are finite, after a finite number of steps the algorithm converges.

4.5

Simulation results

As a first example, consider an i.i.d. Bern(q) source with q = 0.25. Fig. 4.1 shows the rate-distortion curve corresponding to this source, i.e., R(D) = h(q) − h(D), where

h(p) = −p log p−(1−p) log(1−p), and the point on the curve corresponding to α = 2. Moreover, the succession of points derived from the iterative approach after I = 10 iterations of the Viterbi encoder toward the optimal point is shown in the figure. Each point is the average performance of L = 50 independent runs of the algorithm. Fig. 2 shows how the iterative approach reduces the cost Hk (ˆ xn ) + αdn (xn , xˆn ) after I = 4 iterations. It can be observed that increasing α results in less reduction. One explanation is the following: larger values of α correspond to smaller values of desired distortion. Therefore, since xn and optimal y n are close together, so are m(xn ) and m(y n ). Hence, in such cases, there is not much gain in repeating the process of estimating the coefficients since they are close to the optimal from the beginning. Fig. 4.2 corresponds to an i.i.d. Bern(q) source with q = 0.3. Each curve shows the reduction of cost through the iterations. The algorithm starts by computing the coefficients at m(xn ), and then iterating. Here n = 5 × 104 and k = 8. It

can be observed that after a few number of iterations the algorithms reaches a local minimum, and the cost stays constant after that. As a next example, consider a binary symmetric Markov source (BSMS) with transition probability q = 0.2. Fig. 6.1 compares the average performance of the Viterbi encoder against R(D) and Shannon lower bound for different values of α. Note

4.5. SIMULATION RESULTS

45

that the rate-distortion of this source is not known in general except up to D = 0.0159. Beyond that there are upper and lower bounds on R(D), the most famous of which is Shannon lower bound: R(D) ≥ h(q)−h(D). The average is taken over L = 20 runs of

the algorithm. The coefficients are computed at m(xn ). As described before, for large values of α, or equivalently small values of D, m(xn ) is a reasonable approximation of m∗ . But as α decreases, it becomes a poor approximation. Fig. 4.5 shows the reduction in the cost obtained by using the heuristic iterative approach. Note that for large and small values of α the gain is not significant, but for moderate values there is a significant gain in using the iterations. There are different explanations for the same effect observed at the two extreme cases of small and large values of α. While for large values, the iterations does not help because the initial point is already a reasonable approximation of the optimal value, for the other case, iterations does not help, because m(xn ) is a too poor approximation of m∗ and leads the algorithm into a local minima which it cannot escape through more iterations.

46

CHAPTER 4. RATE-DISTORTION VIA VITERBI ALGORITHM

Iterative Viterbi encoder applied to an i.i.d. Bern.(q) source, q = 0.25 0.8 R(D) 0.7

progression towards the desired point after 10 iteratioons

0.6

point on R(D) corresponding to α = 2

Hk (ˆ xn )

0.5

0.4

0.3

0.2

0.1

0

0

0.05

0.1

0.15

0.2

0.25

d(xn , x ˆn )

Figure 4.1: Iterative Viterbi lossy coder with I = 10 iterations. [Bern(q) i.i.d. source, q = 0.25, α = 2, n = 2 × 104 , k = 8, and L = 50]

47

4.5. SIMULATION RESULTS

Bern(q) i.i.d. source, q = 0.3 0.95

0.9

0.85 α=3 α = 2.8 α = 2.6

Hk (ˆ xn ) + αd(xn , x ˆn )

0.8

α = 2.4 α = 2.2 α=2

0.75

0.7

0.65

0.6

0.55

0

10

20

30

40 I

50

60

70

80

Figure 4.2: Iterative Viterbi encoder. [Bern(q), i.i.d. source q = 0.3, n = 5 × 104 , k = 8]

48

CHAPTER 4. RATE-DISTORTION VIA VITERBI ALGORITHM

Iterative Viterbi coder applied to an i.i.d. Bern.(q) source, q = 0.25 0.85

Hk (ˆ xn ) + αd(xn , x ˆn )

0.8

0.75

0.7

1 iteration 2 iterations

0.65

3 iterations 4 iterations 2

2.5

α

3

3.5

Figure 4.3: Reduction in the final cost after I = 4 iterations versus α. [Bern(q) i.i.d. source, q = 0.25, n = 2 × 104 , k = 8, and L = 50]

49

4.5. SIMULATION RESULTS

Binary symmetric Markov source, q = 0.2 R(D) Shannon lower bound α=4 α=4 α=3 α = 2.5 α=2

0.7

0.65

Hk (ˆ xn )

0.6

0.55

0.5

0.45

0.4

0

0.01

0.02

0.03

0.04

0.05

d(xn , x ˆn )

Figure 4.4: Viterbi encoder with the coefficients computed at m(xn ). [BSMS(q), q = 0.2, n = 2 × 104 , k = 9, and L = 20]

50

CHAPTER 4. RATE-DISTORTION VIA VITERBI ALGORITHM

Iterative approach applied to a binary Markov source with q = 0.2 0.72 0.71 0.7

Hk (ˆ xn ) + αd(xn , x ˆn )

0.69 0.68 0.67 0.66 1 iteration

0.65

2 iterations 3 iterations

0.64 0.63

4 iterations 3

3.2

3.4

3.6

3.8

4 α

4.2

4.4

4.6

4.8

5

Figure 4.5: Demonstrating the reduction in the cost, Hk (ˆ xn ) + αdn (xn , xˆn ), resulting from the iterative approach. [BSMS(q), q = 0.2, n = 2 × 104 , k = 8, and L = 20]

Chapter 5 Wyner-Ziv Coding of stationary ergodic sources Consider the basic setup shown in Fig. 5.1 consisting of a source with unknown statistics driving a known discrete memoryless channel (DMC), and a decoder that receives a compressed version of the source in addition to the noisy channel output. The goal is to minimize the distortion between the source and the reconstructed signal by optimally designing the encoder and decoder. This is the problem of rate-distortion coding with decoder side information, commonly known as Wyner-Ziv compression after the seminal paper [21]. Even without side information, the problem of finding universal practical schemes that get arbitrarily close to a given point on the ratedistortion curve is notoriously challenging (see [27] and [47] for recently proposed practical schemes). Even when the discrete source distribution is known, no practical scheme is currently known to approach the rate-distortion function when the source has memory. Other than the region of low distortion, the rate-distortion function is not known even for a binary Markov source (see [48],[49],[23]). As an example of practical motivation for the setup shown in Fig. 1, consider the problem of audio/video broadcasting where in addition to the analog signal, the decoder has access to some additional information transmitted in a digital channel for instance. In such setup, a legacy receiver only observes the output of the channel, while a more sophisticated receiver in addition has access to coded information which 51

52

CHAPTER 5. WYNER-ZIV CODING

helps boosting reproduction fidelity. Thus, we can view the setup as one of universal systematic channel coding where the added “redundancy” is received error-free. An alternative view of this problem is as a denoising problem where the denoiser, in addition to the noise-corrupted data, has access to a fidelity-boosting (FB) sequence conveyed to it via a channel of capacity R. Both viewpoints are equivalent because the source/channel separation theorem [50] guarantees that there is no loss in separating the source coding and channel coding operations at least under certain sufficient conditions on source and channel [51]. Therefore, the encoder is able to send any information with entropy less than the channel capacity almost losslessly to the decoder. Consequently, although in practice we would often have a channel of capacity R, we simply consider the encoder-channel-decoder chain as a noiseless bit pipe of rate R. Note that in these two viewpoints the role of the main signal and fidelity-boosting signal is interchanged. In this chapter, we adopt the latter, and suggest a new algorithm for WZ coding of a source with unknown statistics. We show that, for stationary ergodic sources, the algorithm is asymptotically optimal in the sense that its average expected loss per symbol converges to the minimum attainable expected loss. The encoder of the proposed algorithm consists of a sliding-block (SB) coder followed by Lempel-Ziv (LZ) compression [7]. SB lossy compression is shown in [52] and [53] to be able to perform as well as conventional lossy block compression. We extend this result to the WZ coding setup, and show that the same result holds in this case as well. The reason we use SB codes instead of block codes in our algorithm is the special type of decoder we employ. The decoder is based on a modification of the discrete universal denoiser (DUDE) algorithm [46], DUDE with FB (fbDUDE), to take advantage of the FB sequence. We prove that the optimality results of the original DUDE carry over to fbDUDE as well. As mentioned before, in our setting we always assume that the channel transition matrix is known both to the encoder and the decoder. As argued in [46], the assumption of a known channel and an unknown signal is realistic in many practical scenarios. Furthermore, unlike the DUDE setting [46], in the setup of this thesis the decoder can easily learn the channel, e.g., by having the encoder dedicate a negligibly

53

5.1. DUDE WITH FIDELITY BOOSTING INFORMATION

small portion of its rate to describing the first few components of the source sequence which then act as a training sequence. Further still, if a modicum of feedback from decoder to encoder is allowed, then the encoder too can be informed of the channel arbitrarily precisely. Therefore, unlike in the DUDE setting where knowledge of the channel plays a key role [54],[55], in our setting, at least decoder knowledge of the channel is not crucial, and our schemes can be easily modified to accommodate channel uncertainty. Some progress towards practical WZ coding schemes has been made in recent years, as seen, e.g. in [56–61]. The proposed schemes, however, operate under specific assumptions of a known (usually memoryless) source and side information channel. Practical schemes for more general source and/or channel characteristics have yet to be developed and, a fortiori, no practical universal schemes for this problem are known. The problem of WZ coding of a source with unknown statistics was recently considered in [62], where existence of universal schemes in a setting similar to ours is established. In contrast, our schemes suggest a paradigm for WZ coding of discrete sources which is not only practical but is justified through universal optimality results. Unknown Source

xn

DMC

Encoder

Zn

Decoder/ Denoiser

ˆn X

yn

Figure 5.1: The basic setup of WZ coding

5.1

DUDE with fidelity boosting information

The DUDE algorithm was proposed in [46] for universal noncausal denoising of a discrete signal corrupted by a known DMC. The DUDE is described by (5.1) and (5.2).

54

CHAPTER 5. WYNER-ZIV CODING

ˆ n,lz (z n )[i] = arg min rT (z n , z i−1 , z i+lz )π −1 [dxˆ ⊙ π z ], X i i−lz i+1 x ˆ∈Xˆ

(5.1)

where, for β ∈ Z, i+lz i−1 r(z n , alz , blz )[β] = lz + 1 ≤ i ≤ n − lz : zi−l = alz , zi+1 = blz , zi = β z

(5.2)

Remark: Although in (5.1) it is implicitly assumed that the transition matrix π is a square matrix, this assumption is not necessary. As noted in [46], as long as the rows of π are linearly independent, the results can be generalized to non-square matrices by replacing π −1 with π T (π π T )−1 in (5.1). The linear independence of the rows of π requires the channel inputs to be identifiable, i.e., it is not possible to fake the output distribution corresponding to any of them using some input distribution over the other input symbols. This property, which holds for most channels of interest, will be assumed throughout this chapter. In short, DUDE works as follows. In its first pass through the noisy data, it estimates the conditional marginal distributions of the clean data given their noisy observation of PXi |Z n (·|·) by first estimating the bidirectional conditional probabilities PZi |Z n\i (·|·) through counting, and then using the invertibility of the DMC. Then in the second pass, it finds xˆi based on these estimations. The DUDE denoising algorithm is non-causal and therefore each output depends on the whole noisy sequence. The following optimality results have been shown for DUDE [46]. 1. Stochastic setting: the source is assumed to be stationary, and no further constraints are imposed on its distribution. Asymptotically DUDE performs as well as the best denoiser that knows the source distribution provided that the context length ℓ grows adequately with the data size. 2. Semi-stochastic setting: the source is assumed to be an individual sequence, and the only randomness is assumed to originate from the channel. Asymptotically,

5.1. DUDE WITH FIDELITY BOOSTING INFORMATION

DUDE

55

performs as well as the optimal denoiser in the class of sliding-window

denoisers for that particular sequence. Better experimental performance than the DUDE has been achieved by alternative methods to estimate the bidirectional conditional probabilities [63], [64] and [65]. As we discussed in Section I, decoding for the WZ problem can also be considered as a denoising problem where the denoiser, in addition to the noisy signal, has access to a FB sequence

designed by the source encoder to be as helpful as possible to the decoder.

From this perspective, we are motivated to generalize DUDE so as to handle not only the output of the DMC, but the fidelity boosting information. A desirable feature of such a generalization is the optimality in senses analogous to those described above. A natural way to accomplish this, which we refer to as fbDUDE is described in (5.3) and (5.4).

ˆ n,lz ,ly (z n , y n )[i] = arg min rT (z n , y n , z i−1 , z i+lz , y i+ly )π −1 [dxˆ ⊙ π z ], X i i−lz i+1 i−ly x ˆ∈Xˆ

(5.3)

where, for β ∈ Z, and t = max{lz , ly } l

y r(z n , y n, alz , blz , c−l )[β] = y n o i+ly ly i+lz i−1 lz lz = a , z = b , y = c , z = β . t + 1 ≤ i ≤ n − t : zi−l i+1 i−ly −ly i z

(5.4)

Note that the counting process is done simultaneously in both the noisy and FB sequences. Although seemingly more involved, the denoising algorithm described in (5.3) and (5.4), is simply the DUDE algorithm working on an enlarged context. For the fbDUDE,

the context of each symbol in the noisy signal in addition to the conventional

DUDE context of noisy neighboring symbols, FB

consists of the same context window of the

sequence. Note that in contrast to the conventional context of noisy neighboring

symbols, in the context window of the fidelity boosting sequence there is no “hole in the middle”. It should be noted that our proposed generalization of the DUDE will not be effective with reasonable computational complexity unless the fidelity boosting

56

CHAPTER 5. WYNER-ZIV CODING

sequence depends on the original clean signal in a sequential manner, such as a slidingwindow. An example of a non-sequential dependence is a fidelity boosting sequence generated by an arbitrary linear block code. In order to show that the optimality results of [46] carry over to this case, consider ˜ with input x˜i = (xi , yi ), and output z˜i = (zi , yi ), where zi is the output of a channel π, the original channel π, when the input is xi . Note that this channel does not disturb the second component of its input vector (xi , yi). As shown in the next result, since ˜ inherits its invertibility from the original channel π, the the newly defined channel π results of [46] concerning asymptotic optimality of DUDE can be applied to this case as well. Theorem 5. Provided that tn |X |2tn = o(n/ log n), ∀ x, y lim

n→∞

"

# n−t Xn 1 d(xi , xˆi ) = lim Dlz,n ,ly,n (xn , y n , Z n ), a.s., n→∞ n − 2tn i=t +1

(5.5)

n

where, Dlz,n ,ly,n (xn , y n , z n ) =

min

f :Y 2lz,n +1 ×Z 2ly,n +1 →X

n−t Xn 1 i+lz,n i+ly,n d xi , f (yi−lz,n , zi−ly,n ) , n − 2tn i=t +1 n

and xˆi is the output of the denoiser in (5.3) and (5.4) with parameters lz,n , ly,n , and tn , max{lz,n , ly,n }. Remark 1: Here the class of decompressors is restricted to sliding-window decoders of finite-window length on both noisy data and FB sequence. Theorem 5 states that in the semi-stochastic setting where both the source and the FB sequence are individual sequences, with probability one, the asymptotic accumulated loss of the fbDUDE

decoder is no more than the loss incurred by the best decoder of the same

order in this class. Remark 2: Although Theorem 5 is stated for the semi-stochastic setting, as in [46], there is a counterpart in the stochastic setting where the source and FB sequences are jointly stationary processes.

5.2. SLIDING BLOCK WZ CODING

5.2

57

Approach 1: Sliding-Window Wyner-Ziv coding

The majority of achievability proofs in the information theory literature are based on the idea of random block coding. Shannon pioneered this technique for proving his coding theorems for both lossy compression and channel coding. In rate-distortion coding, besides block codes, another type of codes are sliding block codes which were introduced by R. Gray, and shown to achieve the (block-coding) rate-distortion function in [52]. SB encoders apply a function with a finite number of arguments to the source sequence, outputting another sequence that has lower entropy, but resembles the original sequence as much as the designer desires. In the rest of the section, we show that sliding block WZ coding achieves the Wyner-Ziv rate-distortion function for stationary sources.

5.2.1

Block coding

A WZ block code of length n and rate R consists of encoding and decoding mappings, fn and gn respectively, which are defined as follows: fn : X n → {1, 2, . . . , ⌈2nR ⌉}, gn : Z n × {1, 2, . . . , ⌈2nR ⌉} → Xˆ n . The performance of such code is defined as the expected average distortion per symbol between the source and reconstruction sequences, i.e., " n # X 1 ˆ n )] , E ˆ i) , E[dn (X n , X d(Xi , X n i=1 ˆ n = gn (Z n , fn (X n )). where X The rate distortion pair (R, D) is said to be achievable if for any given ǫ > 0,

58

CHAPTER 5. WYNER-ZIV CODING

there exists fn , and gn , such that E[dn (X n , gn (Z n , fn (X n )))] ≤ D + ǫ, for all sufficiently large n. For a given source X, and memoryless channel described by transition matrix π, the infimum of all achievable distortions at rate R is called DX,π , i.e., DX,π (R) = inf{D : (R, D) is achievable}. More explicitly, DX,π (R) is the distortion-rate function of our WZ coding setting.

5.2.2

Sliding-Block WZ compression

An extension of the idea of SB rate distortion coding is SB WZ coding. In this section, using the techniques of [53], we show that in WZ coding any performance that is achievable by block codes is also achievable by SB codes. A SB-WZ code consists of two time-invariant encoding and decoding mappings f and g. The encoding mapping f with constraint length of 2k + 1 maps every 2k + 1 source symbols into a symbol of Y which is the alphabet of the FB sequence; In other words

f : X 2k+1 → Y.

(5.6)

This encoder moves over the source sequence and generates the FB sequence Y by letting i+k Yi = f (Xi−k ).

(5.7)

On the other hand, the decoding mapping g with the constraint length of max{2lz + 1, 2ly + 1} maps a block of length 2lz + 1 of the noise corrupted signal and a block of length 2ly + 1 of Y sequence to a reconstruction symbol, i.e., g : Z 2lz +1 × Y 2ly +1 → Xˆ .

(5.8)

The decoder slides over the noisy and the FB sequences in a synchronous manner, and

5.2. SLIDING BLOCK WZ CODING

59

generates the reconstruction sequence by letting ˆ i = g(Z i+lz , Y i+ly ). X i−lz i−ly

(5.9)

The following theorem states that SB-WZ codes perform at least as well as WZ block codes. Theorem 6. Let (R, D) be an interior point in the (block) WZ rate-distortion region of a jointly stationary processes X and Z representing the source and FB sequences respectively. For any given ǫ1 > 0, there exists a SB-WZ encoder f : X 2k+1 → Y, where log |Y| ≥ R, and a SB decoder g with parameters lz and ly , such that h i i+ly i+lz i+k 1. E d(Xi, g(Zi−l , Y )) ≤ D + ǫ1 , where Yi = f (Xi−k ), i−ly z 1 H(Y1, . . . , Yn ) n→∞ n

2. H(Y) = lim

≤ R − ǫ2 , for some ǫ2 > 0.

Proof. The complete proof is presented in Appendix A; a sketch of the main idea follows. The proof is an extension of the proof given in [53] for showing that SB codes achieve the same performance of block codes in the rate-distortion problem, which in our scenario corresponds to the case where the decoder has only access to the FB sequence and there is no channel output. Since (R, D) is assumed to be an interior point of the achievable region in the R − D plane, it is possible to find a point (R1 , D) such that R1 < R, but still the new point is an interior point of the achievable region. Since (R1 , D) is an interior point,

there exists a block WZ encoder/decoder, (fn , gn ) of rate R1 and block length n, and average expected distortion less than D + ǫ, for any ǫ > 0. Instead of considering the initial point of (R, D), we consider this new point with R1 < R because, according to the theorem, our goal is to show that there exists a SB encoder resulting in a FB sequence with entropy rate lower than R. In order to achieve this goal we follow techniques similar to those in [53]. To derive a SB code from a block code, the most challenging part is dividing the source sequence into blocks of fixed length such that it is possible to apply the block code to these sub-blocks. This is a demanding task first because we are looking for a stationary SB code, and second because the decoder is also a SB decoder which should be able to discriminate between different coded

60

CHAPTER 5. WYNER-ZIV CODING

blocks concatenated by the encoder. The main tool in our proof, as in [53], is the Rohlin-Kakutani Theorem of ergodic theory. This theorem enables us to define a SB encoder which finds blocks of length n to apply the block code to them, and puts a tag sequence of negligible length ⌈nǫ⌉ after each encoded block. This tag sequence is not included in any of the codewords of the WZ block coder (fn , gn ) (the existence of such

tag sequence is shown in the proof). This would enable the decoder to discriminate between different coded blocks, while letting the encoder to generate a stationary FB sequence. The rest of the proof is devoted to showing that the SB code defined in this way would satisfy our desired constraints.

To conclude this section, note that the two-step achievability proof of Wyner and Ziv in [21] for proving their WZ theorem (rate-distortion with side information) is extended in [66] to devise a method which is used to prove a few SB source coding theorems (theorems of Berger, Kaspi and Tung [67], [68], [69]) for a general finitealphabet ergodic multiterminal source. The focus in [66] is on multiterminal sources for which we can no longer use the method used in [53] to derive SB codes because the tag sequences in the coded version received by different terminal are not synchronized. In our case, since we have only one terminal to code, and the side information is just the output of the DMC due to the source, it is still possible to use the method used in [53].

5.2.3

WZ-DUDE algorithm

In Section 5.1, we introduced the fbDUDE, a natural extension of the DUDE algorithm to the case where in addition to the noisy signal the denoiser has access to encoded side information. As described in Section III, this extension could easily be obtained by considering a larger context for denoising each symbol which comes from working on both signals simultaneously. Then Theorem 5 expressed the asymptotic optimality of the fbDUDE denoiser. Section IV introduced SB-WZ coding, and in Theorem 6 it was shown that using SB-WZ codes instead of WZ block codes incurs no loss of optimality. Motivated by the results established so far, in this section, we propose a new WZ

61

5.2. SLIDING BLOCK WZ CODING

coding scheme, and prove its asymptotic optimality. For any given block length n, let fn∗ and gn∗ denote the encoder and the decoder of the scheme respectively. The scheme has a number of parameters, namely lzn , kn , lyn and δ, that their meaning is made explicit in the following description of the algorithm. 1. Encoder: For a given source sequence xn define S(xn , k, R) to be the set of all SB

mappings of window length 2kn + 1 with the property that their output is

a sequence whose Lempel-Ziv compressed version LZ(·) is not longer than nR, i.e., n

S(x , kn , R) ,

f :X

1 n → Y : ℓLZ (f (x )) ≤ R . n

2kn +1

(5.10)

n Note that f (xn ) is assumed to be equal to y n , where yi = f (xi+k i−kn ) for kn + 1 ≤

i ≤ n − kn , and yi = 0 otherwise. For each f ∈ S, and integers ln and lyn define V (f, lzn , lyn ) = min E

"

n−k Xn

i=kn +1

# i+l i+l d xi , g(Zi−lzznn , yi−lyynn ) ,

(5.11)

where the minimization is over all decoding mappings g : Z 2lzn +1 × {1, . . . , ⌈2R ⌉}2lyn +1 → Xˆ . Let f ∗ (lzn , lyn ) be the mapping in S that minimizes V (f, lzn , lyn ), i.e., f ∗ (lzn , lyn ) = arg min V (f, lzn , lyn ). f ∈S

(5.12)

Then, the FB encoded sequence is the LZ compression of fn∗ (xn ) which is sent to the decoder. 2. Decoder: Upon obtaining fn∗ (xn ) with an LZ decompressor, the decoder employs the fbDUDE described in Section III, i.e.,the reconstructed signal is ˆ n,kn,lyn (Z n , f ∗ (xn )). X

62

CHAPTER 5. WYNER-ZIV CODING

The main result of this chapter is the following theorem, which shows that the described WZ-DUDE coding algorithm is asymptotically optimal. Theorem 7. Let kn , lzn , and lyn increase without bound with n, but sufficiently slowly that tn |X |tn = o(n/ log n), where tn = max{lzn , lyn }. Then, for any R ≥ 0, and any

stationary ergodic source X,

h i ˆ n,lz,n,lyn (Z n , Y n )) = DX,π (R). lim E dn X n , X

n→∞

(5.13)

Proof. The full proof can be found in Appendix H. A brief outline of the proof is as follows. The first step is using Theorem 6, to find a SB-WZ code with mappings f and g which results in a final expected distortion less that D + 2ǫ , and a FB sequence of entropy rate less than R−ǫ2 (ǫ2 goes to zero as ǫ1 does). The second step uses the fact that for any stationary ergodic process, the LZ coding algorithm is an asymptotically optimal lossless compression scheme. Therefore, by choosing sufficiently large block length, the difference between the bit per symbol resulting from LZ compression of the FB sequence, and its entropy rate could be made sufficiently small. The third step is using the asymptotic optimality of fbDUDE decoding algorithm which guarantees that by choosing decoding window lengths properly, there is no loss in using fbDUDE decoder instead of any other possible sliding-window decoder.

Note that the only part of our scheme of questionable practicality is its encoding which requires listing all mappings of some finite window length which generate a FB sequence with LZ description length less than some fixed value depending on block length and coding rate. This is a huge number, e.g. for the binary case there are 2k+1

22

mappings having a window length of 2k + 1 (for k = 1 there are 256 mappings).

Therefore, we cannot implement the algorithm as described, but as shown in Section IV, this new scheme inspires pragmatic universal WZ coding schemes attaining good performance. Finally, it is worth mentioning the relationship between our encoding algorithm and the Yang-Kieffer fixed-rate lossy coding algorithm described in [13]: For block length n, the encoder constructs a codebook Cn consisting of all reconstruction blocks

63

5.2. SLIDING BLOCK WZ CODING

having LZ description length less than nR. The xn is represented by the nearest codeword xˆn in Cn . Yang and Kieffer [13] show that for any stationary ergodic source, this conceptually simple (but not implementable) scheme achieves the rate-distortion function as n goes to infinity. In our case, we construct our codebook in a similar way, but since the encoder knows that the decoder has access to the output of the DMC as well, the codeword that results in minimum average expected loss is chosen (the expectation is taken over all possible channel outputs for the best possible SB decoder).

5.2.4

Pragmatic approaches and experimental results

As mentioned earlier, the demanding aspect of the WZ-DUDE algorithm is finding the optimal mapping f ∗ . In all of the following cases, instead of looking for the optimal mapping, we choose a not-necessarily optimal mapping along with the WZDUDE

decoder. Furthermore, in all of the following cases, the distortion measure is

Hammig distortion, i.e., for x, xˆ ∈ X d(x, xˆ) =

(

0, x = xˆ; 1, x 6= xˆ.

(5.14)

Binary Image with BSC In this experiment, instead of looking for the optimal mapping, we use a lossy JPEG encoder. Since except for the encoding of the DC component, JPEG works on the 8×8 blocks separately, it can be considered as a SB encoder of window length 1 working on the super-alphabets formed by 8 × 8 binary blocks. Fig. 5.2.4 and Fig. 5.2.4 show the original binary image and its noise-corrupted version under a binary symmetric

channel with crossover probability 0.25. Fig. 5.2.4 shows the JPEG encoded image which requires 0.36 bit per pixel (b.p.p.) after PNG (Portable Network Graphics) lossless compression, compared to 0.59 b.p.p. required by the original image. The average Hamming distortion between the original image and lossy compressed version is 0.0749. Fig. 5.2.4 shows the result of denoising the noise corrupted image with DUDE ignoring the FB sequence. In this case the resulting average distortion is 0.1048. On

64

CHAPTER 5. WYNER-ZIV CODING

Figure 5.2: Original binary image, 0.59 b.p.p.

the other hand, Fig. 5.2.4 shows the result of denoising the noisy signal when the FB

sequence is also taken into account. The decoder/denoiser in this case is WZ-

DUDE

with parameters lz = 2 and ly = 1. The final average distortion between the

reconstructed image and the original image is 0.0444.

Text with erasure channel In this section we consider the case where our source is an English text document, and the DMC is a memoryless erasure channel that erases each symbol with probability e. To construct the FB sequence, we use a method which is similar to the first run of

the DUDE algorithm in which it tries to estimate PZi |Z n\i (·|·), to estimate PXi |X n\i (·|·). For a given window length of 2k + 1, the encoder generates the count matrix renc as follows

5.2. SLIDING BLOCK WZ CODING

65

Figure 5.3: Noise corrupted image, generated by passing the original signal through a BSC(0.25)

66

CHAPTER 5. WYNER-ZIV CODING

Figure 5.4: The FB image, y, generated by lossy JPEG coding of the original image , 0.36 b.p.p., dn (xn , y n ) = 0.0749

5.2. SLIDING BLOCK WZ CODING

Figure 5.5: Output of DUDE for lz = 2. dn (xn , xˆn ) = 0.1048

67

68

CHAPTER 5. WYNER-ZIV CODING

Figure 5.6: Output of the WZ-DUDE decoder for lz = 2, ly = 1, and R = 0.36 b.p.p., dn (xn , x ˆn ) = 0.0444

69

5.2. SLIDING BLOCK WZ CODING

k i+k k renc (xn , ak , bk )[β] = |{i : xi−1 i−k = a , xi+1 = b , xi = β}|.

(5.15)

For each left and right contexts ak and bk , the vector renc (xn , ak , bk ) is a 1 × |X |

vector with β th component being equal to the number of times the β th element of X

have appeared in xn sequence with its right and left contexts being equal to ak and i+k bk respectively. Therefore, from the count vector renc (xn , xi−1 i−k , xi+1 ) corresponding to

the right and left contexts of xi , the MAP estimation of xi is i+k arg max renc (xn , xi−1 i−k , xi+1 )[β],

(5.16)

β∈X

which is the symbol in X is the most frequent symbol in xn among those with the

same right and left contexts of zi . Similarly, for given right and left contexts, we can rank all the symbols in X according to their repetition frequency in our text within

the given contexts. Now for a FB alphabet cardinality of N, define Y = {1, . . . , N}.

The encoding function f is as follows

f (xi+k i−k ) =

    ℓ,

i+k th largest element, if xi = β, where renc (xn , xi−1 i−k , xi+1 )[β] is ℓ

and ℓ < N;    N, otherwise.

(5.17)

After constructing the sequence y n by sliding f over the original text, the LZ description of the resulting sequence is transmitted to the decoder. As mentioned in [46], the DUDE denoising rule for an erasure channel is equivalent to a majorityvote of the context counts, i.e., replacing each erasure with the most frequent symbol with the same context. WZ-DUDE decodes the erased symbol xi by first computing i−1 i+k rdec (z n , zi−k , zi+1 ). For moderate values of e, one would expect renc , and rdec to rank

the symbols similarly. Therefore, based on rdec count vector, and yi, xˆi is the source i−1 i+k alphabet corresponding to the yith largest element of rdec (z n , zi−k , zi+1 ). Note that

in this case, the window length of the SB encoder and decoder should be the same, otherwise y n does not help the decoder. Fig. 7 shows the percentage of erased symbols that are recovered by our WZ-DUDE

70

CHAPTER 5. WYNER-ZIV CODING

decoder for different values of N. For our experiments we have used the English translation of Don Quixote de La Mancha, by Miguel de Cervantes Saavedra (15471616) 1 . The text consists of approximately 2.3 × 106 characters. The channel is

assumed to have erasure probability 0.1. In Fig. 7, N = 1, corresponds to the case where there is no FB available to the decoder, or in other words, it corresponds to the performance of the DUDE. As it can observed, for N = 1 using larger context size k improves performance; As N increases, k = 1 outperforms k = 2. In addition

both curves seem to eventually saturate as N increases. Although one might expect that increasing N would always improve the performance, and tending it to |X |, one should be able to recover all erased symbols, we see in Fig. 8, this does not hold for our scheme. The reason is that, once one of the symbols in the context of an erased symbol is erased, the decoder is not able to construct the count vector i−1 i+k , zi+1 ), which is crucial in interpreting the FB sequence. In such cases we rdec (z n , zi−k

let xˆi to be the space character which has the largest frequency in the text. Therefore, the best error-correction performance that can be achieved by our scheme is upper bounded by the probability that none of the symbols in the context of an erased symbol are erased, which is equal to (1 − e)2k . In our example, for k = 1 the upper bound is 0.92 = 0.81, and for k = 2, it is 0.94 ≈ 0.66, which coincides with our curves.

To illustrate the performance of the algorithm, a small excerpt of length 154 of the

original text, its noise-corrupted version, and the outputs of its DUDE and WZ-DUDE decoded versions, for different values of k, and N are presented. • Clean text:

. . . methodising with rare patience and judgment what had been previously

brought to light, he left, as the saying is, no stone unturned under which anything . . . • Erasure-corrupted source: (12 erasures)

. . .*et*odising with ra*e pati*nce and judgment what had been previously brought to ligh*, he l*ft* as the s*yin* is, no stone untu*ned under which any*hin*. . .

1

which is available online from the Project Gutenberg web-site at http:promo.netpg

71

5.2. SLIDING BLOCK WZ CODING

• DUDE denoiser with no FB sequence, k = 1 (7 errors + 1 erasure)

. . .bethodising with rave pationce and judgment what had been previously brought to ligho, he lifto as the sayind is, no stone unturned under which any*hing . . .

• WZ-DUDE denoiser, k = 1, N = 2, R = 0.16 b.p.s. (2 errors)

. . .mhethodising with rare patience and judgment what had been previously brought to light, he lefth as the saying is, no stone unturned under which anything. . .

• DUDE denoiser with no FB, k = 2 (3 errors)

. . .bethodising with rate patience and judgment what had been previously brought to light, he lefts as the saying is, no stone unturned under which anything . . .

• WZ-DUDE denoiser, k = 2, N = 2, R = 0.12 b.p.s. (2 errors)

. . .rethodising with race patience and judgment what had been previously brought to light, he left, as the saying is, no stone unturned under which anything . . .

Binary Markov Source with Binary Erasure Channel Consider a binary symmetric Markov source with transition probability q, denoted by BSMS(q), which goes through a memoryless binary erasure channel (BEC) with erasure probability e. Let X n and Z n be the input and output of the channel respectively. Note that in this case X = {0, 1}, and Z = {0, 1, e}, where e denotes an

erased symbol. Without having access to any other information, the optimal denoisˆ i = arg max P(Xi = α|Z n ). The ing rule which minimizes the probability of error is X probability of error of this optimal denoiser is

α∈X

ˆ i 6= Xi ) = P(X ˆ i 6= Xi , Zi 6= e)+ P(X ∞ X ∞ X ˆ i 6= Xi , Zj = e for j = i − l + 1, . . . , i + r − 1, Zi−l 6= e, Zi+r 6= e , P X l=1 r=1 ∞ X ∞ X

=

XX

l=1 r=1 α∈X β∈X

f (l, r, α, β),

(5.18)

72

CHAPTER 5. WYNER-ZIV CODING

where i+r−1 ˆ f (l, r, α, β) , P Xi 6= Xi , Zi−l+1 = er+l−1 , Zi−l = α, Zi+r = β

i+r−1 = min{P(Xi = 0, Zi−l+1 = er+l−1 , Zi−l = α, Zi+r = β), i+r−1 P(Xi = 1, Zi−l+1 = er+l−1 , Zi−l = α, Zi+r = β)},

(5.19)

where em denotes a vector of length m with all elements equal to e. Note that f (l, m, α, β) is an easily computable function. For example, for q < 21 , f (1, 1, 0, 0) = f (1, 1, 1, 1), ˆ i 6= Xi , Zi = e, Zi−1 = Zi+1 = 1) = P(X =

q2 e(1 − e)2 , 2

(5.20)

and f (1, 1, 0, 1) = f (1, 1, 1, 0), ˆ i 6= Xi , Zi = e, Zi−1 = 1 − Zi+1 = 1) = P(X 1 = q(1 − q)e(1 − e)2 , 2

(5.21)

f (1, 2, 0, 0) = f (1, 2, 1, 1) = f (2, 1, 1, 1) = f (2, 1, 0, 0) ˆ i 6= Xi , Zi = Zi+1 = e, Zi−1 = Zi+2 = 1) = P(X = q 2 (1 − q)e2 (1 − e)2 ,

(5.22)

f (1, 2, 0, 1) = f (1, 2, 1, 0) = f (2, 1, 0, 1) = f (2, 1, 1, 0) ˆ i 6= Xi , Zi = Zi+1 = e, Zi−1 = 1 − Zi+2 = 1) = P(X 1 3 = q + (1 − q)2 q e2 (1 − e)2 , 2

(5.23)

5.2. SLIDING BLOCK WZ CODING

73

f (2, 2, 0, 0) = f (2, 2, 1, 1) ˆ i 6= Xi , Zi−1 = Zi = Zi+1 = e, Zi−2 = Zi+2 = 1) = P(X = 2q 2 (1 − q)2 e3 (1 − e)2 ,

(5.24)

f (2, 2, 0, 1) = f (2, 2, 1, 0) ˆ i 6= Xi , Zi−1 = Zi = Zi+1 = e, Zi−2 = 1 − Zi+2 = 1) = P(X 1 = (5.25) 2q(1 − q)3 + 2q 3 (1 − q) e3 (1 − e)2 , 2 Similar expressions can be derived for higher values of l and m. Note that lim

e→0

and lim

e→1

1

ˆ i 6= Xi ) = q, P(X

(5.26)

ˆ i 6= Xi ) = 0.5. P(X

(5.27)

e

1 e

Fig. 8 shows the percentage of erased symbols decoded erroneously versus e, for a BSMS with q = 0.25. The points in Fig. 8 are computed by evaluating (5.18), and therefore reflect the performance of an optimal denoiser that knows the source distribution. It can be observed that as e increases this percentage increases as well. The reason is that for small values of e each erased symbol is surrounded by nonerased symbols with high probability, and therefore since q is relatively small, the erased symbol can be recovered correctly with high probability. On the other hand, as e increases, the probability of having two consecutive erased symbols, which are harder to recover, increases as well. Now consider the WZ setup, where in addition to the output of the BEC, the decoder has access to a FB sequence of rate R designed by the encoder to improve the decoder’s performance. For generating this FB sequence we again use DUDE counts. For a fixed k, the encoder first forms the count matrix consisting of renc (ak , bk ) for all

74

CHAPTER 5. WYNER-ZIV CODING

22k possible right and left contexts, each of length k. Then the FB sequence is Yi =

(

i−1 i+k i−1 i+k 0, if renc (Xi−k , Xi+1 )[Xi ] ≥ renc (Xi−k , Xi+1 )[1 − Xi ];

1, otherwise.

(5.28)

In other words, Yi is 1 whenever Xi is different from what it is predicted to be from its context. As mentioned above, the DUDE decision rule for the BEC is majority-vote decoding. In the case of a binary source instead of text, if there are erased symbols in ˆ i equal to some pre-fixed symbol. the context of an erased bit, we do not simply let X i+k When in addition to Zi some other bits of Zi−k are erased, the decoder’s count vector i−1 i+k rdec (Zi−k , Zi+1 ) is the average of count vectors corresponding to all possible binary i+k contexts coinciding with Zi−k at the non-erased positions. If ke bits out of 2k are

erased, then there exist 2ke such contexts that agree with original context in the i+k i−1 , Zi+1 )[β], then 2k − ke non-erased bits. Let b = arg max rdec (Zi−k β∈{0,1}

ˆi = X

(

b,

if Yi = 0;

1 − b, if Yi = 1.

(5.29)

Consider again the BSMS with q = 0.25 passed through a BEC with e = 0.1. From Fig. 8, an optimal denoiser that only has access to the output of BEC will decode at least 25.13% of the erased bits wrongly. On the other hand, the DUDE denoiser decodes 25.44% of erased bits erroneously, which is almost equal to the performance of the optimal non-universal denoiser which knows the statistics of the source. Now assume that the encoder also sends to the decoder the FB sequence Y n constructed as described in (5.28). From our simulation results, for this case, the entropy of the FB sequence is around R = 0.3, and applying the described WZ-DUDE decoder to (Y n , Z n ) reduces the probability of error to 19.3%.

75

5.2. SLIDING BLOCK WZ CODING

75

70

erasure correction %

65

60

55

50 k=1 k=2 45

1

2

3

4

5

6

7

8

|Y |

Figure 5.7: Percentage of erasures that are recovered by WZ-DUDE decoder versus the size of the FB alphabet Y, for k = 1 and k = 2, and e = 0.1.

76

CHAPTER 5. WYNER-ZIV CODING

27

26.8

min percentage of non−recovered erased bits

26.6

26.4

26.2

26

25.8

25.6

25.4

25.2

25

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

e

ˆ i 6= Xi ) versus e, where P(X ˆ i 6= Xi ) is the probability of error Figure 5.8: 100 P(X e of the optimal denoiser that knows the source distribution, and is computed using (5.18), for a BSMS with q = 0.25 passing through an erasure channel which erasure probability of e.

77

5.3. MCMC WZ-DUDE

5.3

Approach 2: MCMC-encoder plus fbDUDE decoder

As elaborated before, the problem with the scheme described in Section 5.2 is its encoder computational complexity. In this section, based on the ideas used in Chapter 3, we propose an altenative encoder which is implementable. Again, the idea is using simulated annealing for finding an approximation of the solution. We start by proposing yet another exhaustive-search-based WZ coding scheme. Given xn and α > 0, the encoder constructs the FB sequence Y n as follows Y n = Y n (xn , Z˜ n ) i h n n ˆ n,lz ,ly n ˜ n (y , Z )) , = arg min H (y ) + αd (x , X k n n y

(5.30)

where Z˜ n is a simulated output of the channel Π for the input xn generated by the ˆ n,lz ,ly (y n , Z˜ n ) is the output of a WZ-DUDE decoder with parameters lz encoder, and X and ly as defined in Section 5.2.3. More specifically, Z˜ n is a random sequence sampled from the following conditional distribution implied by the DMC: P(Z˜ n = z˜n |X n = xn ) =

n Y

π(xi , z˜i ).

(5.31)

i=1

Then the encoder sends the LZ description of Y n to the decoder. The decoder retrieves Y n by decoding its LZ description, and also receives the output of the ˆ n by applying fbDUDE to (Y n , Z n ), i.e., X ˆn = DMC channel, Z n . It reconstructs X ˆ n,lz ,ly (Y n , Z n ), where and lzn , and lyn grow without bound with n, but sufficiently X slowly that tn |X |tn = o(n/ log n), where tn = max{lzn , lyn }.

Theorem 8. Let X be a stationary and ergodic source, and RΠ (X, D) denote its associated WZ rate distortion function. Let Y˜ n refer to the FB signal generated by the

78

CHAPTER 5. WYNER-ZIV CODING

the above WZ coding scheme, then h i n n ˆ n,lzn ,lyn n ˜n ˜ (Z , Y ) = min[RΠ (X, D) + αD], w.p.1. lim sup Hk (Y ) + αdn (X , X D≥0

n→∞

(5.32)

Proof. Let (Dα , Rα ) be the point on the WZ rate-distortion function of a source X that minimizes R + αD. For given ǫ > 0, consider the process Y jointly stationary ergodic with X such that 1. Processes Y and Z are conditionally independent given X. ¯ 2. H(Y) ≤ R + ǫ, ˆ 1 ) ≤ D + ǫ, where the process X ˆ is formed by applying some finite3. E d(X1 , X length sliding-window WZ decoder g to the processes Z and Y.

Existance of such process is proved in Theorem 6. Based on the process Y, consider the WZ coding scheme that uses Y n as the FB sequence, and runs fbDUDE at the decoder. The assumption is that the encoder has access to the process Y. By ergodicity of (X, Y, Z), and conditional independence of the processes Y and Z given the process X, n n n n ˆ n,lzn ,lyn ˜ n n ˆ n,lzn ,lyn (Z , Y )) = 0, w.p.1. (Z , Y )) − dn (X , X lim dn (X , X n

(5.33)

On the other hand, by i) our assumptions about Y, ii) asymptotic optimality of the fbDUDE

algorithm, and iii) the fact that ¯ lim sup |Hk (Y n ) − H(Y)| → 0, w.p.1,

(5.34)

n

we have h i ˆ n,lzn ,lyn (Z n , Y n )) lim sup Hk (Y n ) + αdn (X n , X n→∞

≤ Rα + αDα + ǫ(1 + α)

= min[RΠ (X, D) + αD] + (1 + α)ǫ, w.p.1, D≥0

(5.35)

79

5.3. MCMC WZ-DUDE

But ˆ n,lzn ,lyn (Z n , Y n )). ˆ n,lzn ,lyn (Z n , Y˜ n )) ≤ Hk (Y n ) + αdn (X n , X Hk (Y˜ n ) + αdn (X n , X

(5.36)

Therefore, since ǫ is arbitrary, combing (5.36) and (5.35) yields the desired result.

In the next section, inspired by the MCMC-based lossy quantizer described earlier, we put forward an implementable MCMC-based WZ encoder to be replaced with the exhaustive search encoder of the WZ coding algorithm just described. We show that asymptotically the new encoder incurs no loss compared to the original one.

5.3.1

Count vectors and DUDE operator l

y z Let u = ul−l ∈ Z 2lz +1 , and v = v−l ∈ Y 2ly +1 , and define f be a |Z|-dimensional z y

vector with uth 0 component, where u0 ∈ Z, defined as lz f(z n , y n , u−1 −lz , u1 , v)[u0 ] =

o 1 n i+ly i+lz lz i−1 −1 = u , z = u , y = v i : zi = u0 , zi−l . 1 −lz i+1 i−ly z n (5.37)

Likewise, let g be |X |-dimensional vector defined as g(z n , y n, xn , u, v)[x] =

o 1 n i+ly ly i+lz lz = u , y = v i : xi = x, zi−l −lz i−ly −ly , z n

(5.38)

where x ∈ X . Note that g represents a more refined count vector compared to f ,

and they are related as follows

lz T n n n f(z n , y n , u−1 −lz , u1 , v)[u0 ] = 1 g(z , y , x , u, v),

where 1 represents a |X |-dimensional vector of all ones.

(5.39)

80

CHAPTER 5. WYNER-ZIV CODING

For a |Z|-dimensional vector θ define the DUDE operator as ˆ DUDE (θ, z) = arg min θ T Π−1 [dxˆ ⊙ π z ]. X

(5.40)

x ˆ

To interpret the definition of (5.40), note that when θ corresponds to the count vector obtained by a DUDE denoiser [46] such that for α ∈ Z, θα is the number of appearances α within some specific left and right contexts along the noisy signal, then

(5.40) represents the DUDE decision rule for denoising symbol z in the noisy signal having the same left/right contexts.

5.3.2

MCMC-based WZ encoder

Like in the lossy compression case, in order to solve the minimization required to find the FB sequence by the exhaustive search WZ coder, we resort to simulated annealing. Again we assign some energy to each possible FB sequence as follows ˆ n,lz ,ly (y n , Z˜ n )), E(y n ) , Hk (y n ) + αdn (xn , X X X = 1T m.,b (y n )H(m.,b (y n )) + α g(Z˜ n , y n , xn , u, v)T dXˆDUDE (f (Z˜n ,yn ,u−1 ,ulz ,v),u0 ) , u,v

b

−lz

1

(5.41)

and then assume the Boltzman probability distribution over the space of FB sequences n

as pβ (y n ) ∝ e−βE(y ) . The advantage of decomposing the energy function as in (5.41) is that now we can run simulated annealing Gibbs sampling algorithm in a bid for eventually sampling from the defined Boltzman distribution for large values of β. The energy of the output FB sequence is going to be with high probability very close to the energy of the minimizer of E(y n) which in turn is what the exhaustive search WZ

coder finds at its first step. From (5.41), in order to compute the change in E(y n)

resulting from updating the value of some yi chosen at random, as required by the Gibbs sampler, we should update the value of the count vectors m, f and g which at most requires a number of calculations linear in max(k, lz , ly ). After finding the FB

sequence through this process, then the rest of the procedure is similar to the WZ

81

5.3. MCMC WZ-DUDE

coder described in the pervious section. Like before let βt =

1 (n) T0

(n)

log(⌊ nt ⌋ + 1), for some T0

∆ = max maxn max 8 n n n x ∈X

z ∈Z

i

max

i−1 ∈ Y i−1 , > < u n ui+1 ∈ Y n−i , > : a, b ∈ Y

> n∆, where

E(ui−1 aun ) − E(ui−1 bun ) . i+1 i+1

(5.42)

Since the energy assigned to a FB sequence depends on the sequence to be coded and the simulated noisy sequence Z˜ n generated by the encoder, therefore in (5.42), the maximization is also done on the choice of the noisy sequence Z˜ n and also the input sequence xn . Again let lzn and lyn grow without bound with n, but sufficiently slowly that tn |X |tn = o(n/ log n), where tn = max{lzn , lyn }. Also let k = o(log n). Theorem 9. For a stationary ergodic source X, 1 n n ˆ n,lzn ,lyn n n (Z , Y )) = min[RΠ (X, D) + αD]. lim sup lim E Hkn (Y ) + αdn (X , X D≥0 n n→∞ r→∞ (5.43)

Algorithm 4 Generating the FB sequence via simulated annealing Require: xn , Z˜ n , k, ly , lz , α, {βt }rt=1 , r n Ensure: a FB sequence yFB n n 1: y ← x ly z Ensure: Compute {g(Z˜ n , y n, xn , u, v)} for any u = ul−l ∈ Z 2lz +1 , and v = v−l ∈ z y 2ly +1 n Y and m = m(y ) 2: for t = 1 to r do 3: Draw an integer i ∈ {1, . . . , n} uniformly at random 4: For each y ∈ Y compute q(y) = pβt (Yi = y|Y n\i = y n\i ) 5: Update y n by letting yi = V , where V ∼ q 6: Update {g(Z˜ n , y n , xn , u, v)} and m 7: end for n 8: yFB ← yn

82

5.3.3

CHAPTER 5. WYNER-ZIV CODING

Simulation results

Consider the binary image shown in Fig. 3.4.2 passed through a binary symmetric channel with crossover probability of q = 0.15. Fig. 5.10 shows the FB sequence generated by Algorithm 4 when the input parameters are βt = 0.08t log t, α = 0.01, kz = 2, ky = 1, and k = 7. Here the image is first converted to a 1-D signal by Peano-Hilbert scan of the 2-D signal, and then the algorithm is applied to the transformed signal. Fig. 5.9 shows the noisy image received by the decoder, and Fig. 5.11 shows the reconstructed image by the WZ-DUDE algorithm. The average n ) = 0.0441, and distance between the original signal and the FB signal is dn (xn , yFB n Hk (yFB ) = 0.1135 versus Hk (xn ) = 0.1742. Finally the average distortion between

the reconstructed image and the original image is dn (xn , xˆn ) = 0.0279.

5.3. MCMC WZ-DUDE

83

Figure 5.9: Noisy image generated by the binary symmetric channel with q = 0.15

84

CHAPTER 5. WYNER-ZIV CODING

n Figure 5.10: FB signal generated by Algorithm 4, dn (xn , yFB ) = 0.0441 [α = 0.01, kz = 2, ky = 1, and k = 7 ]

5.3. MCMC WZ-DUDE

85

Figure 5.11: Reconstructed image by the WZ-DUDE algorithm, dn (xn , xˆn ) = 0.0279.

Chapter 6 Multiple description of stationary ergodic sources Consider a packet network where a signal is to be described to several receivers. In a basic setup, the source is coded by a lossy encoder, and several copies of the packet containing the source description is sent over the network to make sure that each receiver gets at least one copy. Receiving more than one copy of these packets is not advantageous, because all the packets contain similar information. In contrast to this setup, one can think of a more reasonable scenario where the packets flooded into the network are not exactly the same; They are designed such that receiving each one of them is sufficient for recovering the source, but receiving more packets improves the quality of the reconstructed signal. The described scenario is referred to as multiple description. The information-theoretic statement of the MD problem, and early results on the MD problem can be found in [70–73]. Even for the seemingly simple case where there are only two receivers, and the source is i.i.d., the characterization of the achievable rate-distortion region is not known in general. For this case, there are two well-known inner bounds due to El Gamal-Cover [74] and Zhang-Berger [75]. There is also a combined region, introduced in [76], which includes both regions, but recently shown to be no better than the Zhang-Berger region [77]. In any case, full characterization of the achievable region is not yet known. 86

87

Since even for i.i.d. sources, the single-letter characterization of the achievable rate-distortion region is not known in general, there are few works done on the MD of non-i.i.d. sources. The rate-distortion region of Gaussian processes is derived in [78], and is shown to be achievable using a scheme based on transform lattice quantization. In [79], a multi-letter characterization of the achievable weighted rate-distortion region of discrete stationary ergodic sources is derived. In this chapter, we consider the MD of discrete ergodic processes where the distribution of the source is not known to the encoder and decoder. We introduce a universal algorithm which can asymptotically achieve any point in the achievable rate-distortion region. In order to get this result, we start by defining two notions of MD coding, namely, (i) conventional block coding, and (ii) stationary coding. In the normal block-coding MD, there are two rates but three reconstruction processes. In the stationary coding setup, there are three rates and three reconstruction processes. The additional rate corresponds to the conditional entropy rate of the the ergodic process reconstructed by the privileged decoder, which receives two descriptions of the source, given the two other ergodic reconstruction processes. We show that these two setups are closely related and, in fact, characterize each other. The beneficial point of the new definition is that it enables us to devise a universal MD algorithm. The introduced algorithm takes advantage of simulated annealing as we used in Chapter 3 to design an asymptotically optimal universal algorithm for lossy compression of discrete ergodic sources. Sˆ1 R1 bits Sˆ0

(S1 , S2 , S0 ) R2 bits

Sˆ2 Figure 6.1: Example setup

88

6.1

CHAPTER 6. MULTIPLE DESCRIPTION CODING

Simple example

Before formally defining the MD problem, consider the setup shown in Fig. 6.1. This example is meant to provide some insight into the MD problem. Also, the results of this section will be used in the proof of Theorem 11 in Appendix J. Here S1 ∈

S1 , S2 ∈ S2 and S0 ∈ S0 denote three correlated discrete random variables, and (S1 , S2 , S0 ) ∼ P(s1 , s2 , s0 ). The Encoder’s goal is to send R1 bits to Decoder 1, and

R2 bits to Decoder 2 such that Decoder 1 and 2 are able to reconstruct S1 and S2

respectively. Moreover, the transmitted bits are required to be such that receiving both of them enables Decoder 0 to reconstruct S0 . In all three cases, the probability of error is assumed to be zero. Let M1 ∈ {1, . . . , 2R1 }, and M2 ∈ {1, . . . , 2R2 } denote

the messages sent to the decoders 1 and 2 respectively. The question is to find the set

of achievable rates (R1 , R2 ). The following theorem states some necessary conditions for (R1 , R2 ) to be achievable. It is very similar to Theorem 2 of [74], and the two theorems are in fact easily seen to prove each other. The version we give here is most suited for our later needs.

Theorem 10. For any achievable rate (R1 , R2 ) for the setup shown in Fig. 6.1, R1 ≥H(S1 ) R2 ≥H(S2 ) R1 + R2 ≥H(S1 ) + H(S2 ) + H(S0 |S1 , S2 ).

(6.1)

Proof. R1 ≥ H(M1 ) and R2 ≥ H(M2 ) follow from Shannon’s lossless coding Theorem. It is also clear that we should have

R1 + R2 ≥ H(S1 , S2 , S0 ) = H(S1 , S2 ) + H(S0 |S1 , S2 ).

(6.2)

But, perhaps somewhat counterintuitively, (6.2) is just an outer bound, and is not enough. R1 + R2 in fact satisfies the tighter condition stated in (6.1), as can be seen

89

6.2. MULTIPLE DESCRIPTION PROBLEM

Decoder 1

ˆn X 1

Decoder 0

ˆn X 0

Decoder 2

ˆn X 2

M1 X

n

Encoder M2

Figure 6.2: MD coding setup

via the following chain of inequalities: R1 + R2 ≥ H(M1 ) + H(M2 ), = H(M1 , S1 ) + H(M2 , S2 ), = H(S1 ) + H(M1 |S1 ) + H(S2 ) + H(M2 |S2 ), ≥ H(S1 ) + H(S2 ) + H(M1 |S1 , S2 ) + H(M2 |S1 , S2 ), ≥ H(S1 ) + H(S2 ) + H(M1 , M2 |S1 , S2 ), ≥ H(S1 ) + H(S2 ) + H(M1 , M2 , S0 |S1 , S2 ), ≥ H(S1 ) + H(S2 ) + H(S0 |S1 , S2 ).

6.2

(6.3)

Multiple description problem

Consider the basic setup of MD problem shown in Fig. 6.2. In this figure, X n is generated by a stationary ergodic source X. Remark: In order to see the connection between the example described in Section ˆ n , i ∈ {1, 2}, and S0 = X ˆ n , the MD 6.1, and the MD problem, note that letting Si = X i

0

problem can be described as the problem of describing (S1 , S2 , S0 ) to the respected receivers error-free. In other words, for each code design, we have a problem equivalent

90

CHAPTER 6. MULTIPLE DESCRIPTION CODING

to the one described in Section 6.1.

6.2.1 MD

Block coding

coding problem can be described in terms of encoding mappings (f1 , f2 ), and

decoding mappings (g1 , g2 , g0) as follows 1. fi : X n → [1 : 2nRi ], for i = 1, 2, 2. gi : [1 : 2nRi ] → Xˆ n , for i = 1, 2, 3. g0 : [1 : 2nR1 ] × [1 : 2nR2 ] → Xˆ n , 4. Mi = fi (X n ), for i = 1, 2, ˆ n = gi (Mi ), for i = 1, 2, 5. X i ˆ n = g0 (M1 , M2 ). 6. X 0 (R1 , R2 , D1 , D2 , D0 ) is said to be achievable for this setup, if there exists a sequence of codes (n)

(n)

(n)

(n)

(n)

(f1 , f2 , g1 , g2 , g0 ) such that ˆ n ) ≤ Di , for i = 1, 2, lim sup E dn (X n , X i n

ˆ n ) ≤ D0 . lim sup E dn (X n , X 0 n

Let R B be the set of all (R1 , R2 , D1 , D2 , D0 ) that are achievable by block MD coding of source X.

6.2.2

Stationary coding

Define (R11 , R22 , R0 , D1 , D2 , D0 ) to be achievable by stationary coding of source X, if ˆ (ǫ) , X ˆ (ǫ) and X ˆ (ǫ) such that (X, X ˆ (ǫ), X ˆ (ǫ) , X ˆ (ǫ) ) for any ǫ > 0, there exist processes X 1

2

0

1

2

0

91

6.2. MULTIPLE DESCRIPTION PROBLEM

are jointly stationary ergodic processes, and ¯ X ˆ (ǫ) ) ≤ R11 + ǫ H( 1

(6.4)

≤ R22 + ǫ

(6.5)

¯ X ˆ (ǫ) ) H( 2

¯ X ˆ (ǫ) ˆ (ǫ) ˆ (ǫ) H( 0 |X1 , X2 )+ ≤ R0 + ǫ

ˆ (ǫ) ) ≤ D1 + ǫ E d(X0 , X 1,0 (ǫ)

ˆ 2,0 ) ≤ D2 + ǫ E d(X0 , X

ˆ (ǫ) ) ≤ D0 + ǫ. E d(X0 , X 0,0

(6.6) (6.7) (6.8) (6.9)

Let R P denote the set of all (R11 , R22 , R0 , D1 , D2 , D0 ) that are achievable by stationary MD coding of source X. The following theorem characterizes R B in terms of R P . Theorem 11. Let X be a stationary ergodic source. For any (R1 , R2 , D1 , D2 , D0 ) ∈ R B , there exists (R11 , R22 , R0 , D1 , D2 , D0 ) ∈ R P such that R11 ≤ R1

(6.10)

R22 ≤ R2

(6.11)

R11 + R22 + R0 ≤ R1 + R2

(6.12)

On the other hand, if (R11 , R22 , R0 , D1 , D2 , D0 ) ∈ R P , any point (R1 , R2 , D1 , D2 , D0 )

satisfying (6.10)-(6.12) belongs to R B .

The proof of Theorem 11 is presented in Appendix J. Remark: The theorem implies that R B can be characterized as the set of (R1 , R2 , D1 , D2 , D0 ) such that ˆ 1 ) ≤ R1 ¯ X H(

¯ X ˆ 2 ) ≤ R2 H(

¯ X ˆ 1 ) + H( ¯ X ˆ 2 ) + H( ¯ X ˆ 0 |X ˆ 1, X ˆ 2 ) ≤ R1 + R2 , H( ˆ 1, X ˆ 2, X ˆ 0 ) which satisfy (6.7)-(6.9). for some jointly stationary ergodic processes (X, X

92

CHAPTER 6. MULTIPLE DESCRIPTION CODING

6.3

Universal multiple description coding

Equipped with the characterization of the achievable region established in the previous section, we now turn to our construction of a universal scheme for this problem. Consider the following MD algorithm for the setup shown in Fig. 6.2. Let (ˆ xn1 ,ˆ xn2 , x ˆn0 ) , arg min [γ1 Hk (y n ) + γ2 Hk (z n ) + γ0 Hk,k1 (w n |y n , z n )

(y n ,z n ,w n )

+α1 dn (xn , y n ) + α2 dn (xn , z n ) + α0 dn (xn , w n )] ,

(6.13)

Assume that γi ≥ 0 and αi ≥ 0, for i ∈ {0, 1, 2}, are given Lagrangian coefficients.

Also, assume that k1 ≤ k = o(log n) such that k1 → ∞ as n → ∞.

ˆ 1n , X ˆ 2n , X ˆ 0n ) denote the Theorem 12. Let X be a stationary ergodic process, and (X output of the above algorithm to input sequence X n . Then, h ˆ n ) + γ 2 H k (X ˆ n ) + γ0 Hk,k1 (X ˆ n |X ˆ n, X ˆ n )+ lim sup γ1 Hk (X 1 2 0 1 2 n i ˆ n ) + α2 dn (X n , X ˆ n ) + α0 dn (X n , X ˆ n) α1 dn (X n , X 1 2 0 = min [γ1 R11 + γ2 R22 + γ0 R0 + α1 D1 + α2 D2 + α0 D0 ]

(6.14)

almost surely, where the minimization is over all (R11 , R22 , R0 , D1 , D2 , D0 ) ∈ R P . Proof. For an ergodic source X, let µ(γ, α) :=

min

(R11 ,R22 ,R0 ,D1 ,D2 ,D0 )∈R P (X)

[γ1 R11 + γ2 R22 + γ0 R0 + α1 D1 + α2 D2 + α0 D0 ] . (6.15)

No coding strategy can beat µ(γ, α) on a set of non-zero probability. Therefore, the left hand side of (6.14) is lower bounded by its right hand side. Therefore, we only need to prove the other direction. By definition, for any (R11 , R22 , R0 , D1 , D2 , D0 ) ∈ ˆ 1, X ˆ 2 and X ˆ 0 such that (6.4)-(6.9) are satisfied. On R P (X), there exist processes X ˆ n, X ˆ n, X ˆ n ) is generated by jointly ergodic processes (X ˆ 1, X ˆ 2, X ˆ 0 ), the other hand, if (X 1 2 0

6.3. UNIVERSAL MULTIPLE DESCRIPTION CODING

93

ˆ n) → then for k = o(log n) and k1 = o(log n) such that k, k1 → ∞ as n → ∞, Hk (X i n ˆn ˆn ˆ ˆ ˆ ˆ ¯ ˆ ¯ H(Xi ), for i ∈ {1, 2}, and moreover Hk,k1 (X0 |X1 , X2 ) → H(X0 |X1 , X2). This im-

plies that

ˆ n ) + α2 dn (X n , X ˆ n )+ lim sup min[γ1 Hk (X 1 2 ˆ n ) + α1 dn (X n , X ˆ n )+ γ 2 H k (X 2 1 ˆ n |X ˆ n, X ˆ n ) + α0 dn (X n , X ˆ n )] γ0 Hk,k1 (X 0 1 2 0

(6.16)

is upper-bounded by µ(γ, α) + ǫn , where ǫn → 0. Combining these two results in the desired conclusion.

After finding (ˆ xn1 , xˆn2 , xˆn0 ), xˆn1 and xˆn2 will be described to Decoders 1 and 2 respectively using one of the well-known universal lossless compression algorithms, e.g., Lempel Ziv algorithm. Then Encoder forms a description of xˆn0 conditioned on knowing xˆn1 and xˆn2 using conditional Lempel Ziv algorithm or some other universal algorithm for lossless coding with side information [80]. A portion 0 ≤ θ ≤ 1 of these bits will be included in the message M1 and the rest in message M2 .

For finding an approximate solution of (6.13) instead of doing the required exhaustive search directly, as we did before, we can employ simulated annealing. To do this, we assign a cost to each (y n , z n , w n ) ∈ Xˆ n × Xˆ n × Xˆ n as follows E(y n , z n , w n ) :=

γ1 Hk (y n ) + γ2 Hk (z n ) + γ0 Hk,k1 (w n |y n, z n )

+ α1 dn (xn , y n) + α2 dn (xn , z n ) + α0 dn (xn , w n ),

and then define the Boltzmann probability distribution at temperature T = 1/β as pβ (y n , z n , w n ) :=

1 −βE(yn ,z n ,wn) e , Z

(6.17)

where Z is a normalizing constant. Sampling from this distribution at a very low ˆ n, X ˆ n, X ˆ n ) with energy close to the minimum possible energy, temperature yields (X 1

2

0

94

CHAPTER 6. MULTIPLE DESCRIPTION CODING

i.e., ˆ n, X ˆ n, X ˆ n) ≈ E(X 1 2 0

min

(y n ,z n ,w n )

E(y n , z n , w n ).

(6.18)

Since sampling from (6.17) at low temperatures is almost as hard as doing the exhaustive search, we turn to simulated annealing (SA) which is a known method for solving discrete optimization problems. The SA procedure works as follows: it first defines Boltzmann distribution over the optimization space, and then tries to sample from the defined distribution while gradually decreasing the temperature from some high T to zero according to a properly chosen annealing schedule. Given E(y n, z n , w n ), using a similar argument to what we used in Chapter 3, the

n n n number of computations required for calculating E(y i−1ayi+1 , z i−1 bzi+1 , w i−1cwi+1 ),

when only one of the following is true: a 6= yi, b 6= zi , or c 6= wi, for some i ∈ {1, . . . , n} and a, b, c ∈ Xˆ , is linear in k and k1 , and is independent of n. Therefore, this energy

function lends itself to a heat bath type algorithm as simply and naturally as the one in the original setting of Chapter 3. Now consider Algorithm 5 which is based on the Gibbs sampling method for ˆn , X ˆn , X ˆ n ) denote its random outcome for the input sampling from pβ , and let (X 1,r 2,r 0,r sequence X n after r iterations1 , when taking k1 = k1,n , k = kn and β = {βt }t to be deterministic sequences satisfying k1,n = o(log n), kn = o(log n) such that k, k1 → ∞ as n → ∞, and βt =

1 (n) T0

(n)

log(⌊ nt ⌋ + 1), for some T0

> n max(∆1 , ∆2 , ∆0 ), where

∆1 = max

8 > i ∈ {1, . . . , n} > > > > > y i−1 ∈ Xˆ i−1 , > > > < y n ∈ Xˆ n−i , i+1 > a, b ∈ Xˆ , > > > > > z n ∈ Xˆ n , > > > : w n ∈ Xˆn ,

1

E(y i−1ay n , z n , w n ) − E(y i−1by n , z n , w n ) , i+1 i+1

(6.19)

Here and throughout it is implicit that the randomness used in the algorithms is independent of the source, and the randomization variables used at each drawing are independent of each other.

6.4. SIMULATION RESULTS

95

∆2 = max 8 > i ∈ {1, . . . , n} > > > > i−1 ∈ X ˆ i−1 , > z > > > < z n ∈ Xˆ n−i , i+1 > a, b ∈ Xˆ , > > > > n ∈ X ˆn , > y > > > : w n ∈ Xˆ n ,

n n E(y n, z i−1 azi+1 , w n ) − E(y n, z i−1 bzi+1 , w n ) ,

(6.20)

E(y n, z n , w i−1aw n ) − E(y n , z n , w i−1 bw n ) . i+1 i+1

(6.21)

∆0 = max 8 > i ∈ {1, . . . , n} > > > > > w i−1 ∈ Xˆ i−1 , > > > < w n ∈ Xˆn−i , i+1 > a, b ∈ Xˆ , > > > > n ∈X ˆn , > y > > > : z n ∈ Xˆ n ,

As discussed before, the computational complexity of the algorithm at each iteration is independent of n and linear in k and k1 . Following exactly the same steps as in the proof of Theorem 2, we can prove the following theorem which establishes universal optimality of Algorithm 5. Theorem 13. For any ergodic process X, ˆ n, X ˆ n, X ˆ n) lim lim E(X 1 2 0

n→∞ r→∞

= min [γ1 R11 + γ2 R22 + γ0 R0 + α1 D1 + α2 D2 + α0 D0 ]

(6.22)

almost surely, where the minimization is over all (R11 , R22 , R0 , D1 , D2 , D0 ) ∈ R P (X).

6.4

Simulation results

In this section, we present some results showing the actual implementation of the algorithm described in Section 6.3. The simulated source here is a symmetric binary Markov source with transition probability p = 0.2. The considered block length is

96

CHAPTER 6. MULTIPLE DESCRIPTION CODING

Algorithm 5 Generating the reconstruction sequences Require: xn , k1 , k, {αi }2i=0 , {βi }2i=0 {βt }rt=1 , r Ensure: a reconstruction sequences (ˆ xn1 , x ˆn2 , x ˆn0 ) 1: y n ← xn 2: z n ← xn 3: w n ← xn 4: for t = 1 to r do 5: Draw an integer i ∈ {1, . . . , n} uniformly at random 6: For each y ∈ Xˆ compute q1 (y) = pβt (Yi = y|Y n\i = y n\i , Z n = z n , W n = w n ) 7: Update y n by letting yi = V1 , where V1 ∼ q1 8: For each z ∈ Xˆ compute q2 (z) = pβt (Zi = z|Y n = y n , Z n\i = z n\i , W n = w n ) 9: Update z n by letting zi = V2 , where V2 ∼ q2 10: For each y ∈ Xˆ compute pβt (Yi = y|Y n\i = y n\i ) 11: Update w n by letting wi = V0 , where V0 ∼ q0 12: Update m(y n ), m(z n ) and m(w n |y n , z n ) 13: end for 14: x ˆn ← y n n = 10, 000, and the context sizes are k = 5 and k1 = 1. The annealing schedule was chosen according to T (t) =

1 , 2nt1/10

where t is the iteration number. The number of iterations, r, is equal to 50n. The algorithm with the specified parameters, for β1 = β2 = β0 = α1 = α2 = a0 = 1, achieves the following set of rates and distortions: Hk (y n ) = 0.5503, Hk (z n ) = 0.5586, Hk,k1 (w n |y n , z n ) = 0.0038, dn (xn , y n ) = 0.0505, dn (xn , z n ) = 0.0483, and dn (xn , w n ) = 0.0036. Fig. 6.3 shows how the total cost is reducing in this case, as the number of iterations increases. One interesting thing to note here is that although the sequences y n and z n have almost the same distance from the original sequence xn , they are far from being equal. In fact, dn (y n , z n ) = 0.0966, which, given dn (xn , y n ) = 0.0505 and dn (xn , z n ) = 0.0483, suggests that they are almost maximally distant. As another example, consider the case where n = 5, 000 and α1 = α2 = 2. The

6.4. SIMULATION RESULTS

97

rest of the parameters are left unchanged. The achieved point in this case is going to be Hk (y n ) = 0.6091, Hk (z n ) = 0.5951, Hk,k1 (w n |y n , z n ) = 0, dn (xn , y n) = 0.0200, dn (xn , z n ) = 0.0240, and dn (xn , w n ) = 0.0010. Here, Hk,k1 (w n |y n , z n ) = 0 implies that wi is a deterministic function of its context,

i+k1 i+k1 i−1 (wi−k , yi−k , zi−k1 ). This of course does not mean that no additional rate is required for 1

describing w n when the decoder already knows y n and z n , because this deterministic mapping itself is not known to the decoder beforehand. Here again y n and z n are almost maximally distant because dn (y n , z n ) = 0.0436. Note that the fundamental performance limits are unknown even for memoryless sources and, a fortiori, for the Markov source in our experiment. Thus the performance of our algorithm cannot be compared to the corresponding optimum performance. The results of the preceding section, however, imply that our algorithm attains that performance in the limit of many iterations and large block length. Thus, the performance attained by our algorithm, can alternatively be viewed as approximating the unknown optimum.

98

β1 Hk (yn ) + β2 Hk (zn ) + β0 Hk,k1 (wn |yn , zn ) + α1 dn (xn , yn ) + α2 dn (xn , z n ) + α0 dn (xn , wn )

CHAPTER 6. MULTIPLE DESCRIPTION CODING

1.45

1.4

1.35

1.3

1.25

0

0.2

0.4

0.6

0.8 1 1.2 t (the # of iterations)

1.4

1.6

1.8

2 5

x 10

Figure 6.3: Reduction in the cost. At the end of the process, the final achived point is: (Hk (y n ), Hk (z n ), Hk,k1 (w n |y n , z n ), dn (xn , y n ), dn (xn , z n ), dn (xn , w n )) = (0.5503, 0.5586, 0.0038, 0.0505, 0.0483, 0.0036)

Chapter 7 Conclusions and future directions In Chapter 3, a new universal lossy source coding algorithm based on simulated annealing Gibbs sampling is proposed. The proposed algorithm in the limit of infinite number of iterations can get arbitrarily close to the rate-distortion curve of any stationary ergodic source. It gives rise to a new paradigm for universal lossy source coding which its effectiveness is justified through simulations. For coding a source sequence xn , the algorithm starts from some initial reconstruction block, and updates one of its coordinates at each iteration. The algorithm can be viewed as a process of systematically introducing ‘noise’ into the original source block, but in a biased direction that results in a decrease of its description complexity. We further developed the application of this new method to universal denoising. In practice, the proposed algorithms 1 and 3, in their present form, are only applicable to the cases where the size of the reconstruction alphabet, |Xˆ |, is small. The reason is twofold: first, for larger alphabet sizes the contexts will be too sparse to give a true estimate of the empirical entropy of the reconstruction block, even for small values of k. Second, the size of the count matrix m grows exponentially with |Xˆ | which makes storing it for large values of |Xˆ | impractical. Despite this

fact, there are practical applications where this constraint is satisfied. An example

is lossy compression of binary images, like the one presented in Section 3.4. Another application for lossy compression of binary data is shown in [81] where one needs to compress a stream of 0 and 1 bits with some distortion. 99

100

CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS

The convergence rate of the new algorithms and the effect of different parameters on it is a topic for further study. As an example, one might wonder how the convergence rate of the algorithm is affected by choosing an initial point other than the source output block itself. Although our theoretical results on universal asymptotic optimality remain intact for any initial starting point, in practice the choice of the starting point might significantly impact the number of iterations required. In the non-universal setup, where the optimal achievable rate-distortion tradeoff is known in advance, this extra information can be used as a stopping criterion for the algorithm. For example, we can set it to stop after reaching optimum performance to within some fixed distance. When the optimal coefficients are known, the Viterbi-based algorithm proposed in Chapter 4 is a very elegant compression algorithm, but the daunting part is finding those parameters. For the low-distortion region, we can have very good approximation of them. One heuristic for estimating them in the other regions is to start from the low-distortion region and gradually increase the desired distortion. Establishing the optimality of this procedure, as our experimental results suggest, is another interesting line of research for future work. In Chapter 5 WZ coding of a source with unknown statistics was studied; A new WZ

coding algorithm, WZ-DUDE, was presented and its asymptotic optimality was

established. In order to optimize the scheme one would list all possible mappings that have a certain property and look for the one that gives minimum expected loss. However, we saw that even a simple encoding mapping, namely an off-the-shelf lossy compressor, achieves considerable improvement compared to the case where either the FB sequence or the noisy signal are not present at the decoder. The original DUDE is tailored to discrete-alphabet sources going through a DMC, and making it applicable to continuous alphabet sources entails more than a trivial extension, which has been accomplished in [82] and [83]. As mentioned in Section III, since our fbDUDE decoder is a special case of the original DUDE algorithm, one would expect that by following the same methods used in [82],[83], it might be possible to devise a decoder which works on continuous data. The non-trivial part is finding a proper encoder. In this case it is not possible to list all SB encoders of some finite block

101

length, because there are an infinite number of them even for window length of one. One simple solution is to look into all mappings which map to a quantized version of the source alphabet. How to choose this quantized alphabet, and whether this would result in a scheme that asymptotically achieves the performance bounds is a question that requires further study. Finding a sequential version of the described scheme, where the decoder is subject to a delay constraint is another interesting open avenue. Adapting our WZ-DUDE algorithm to perform effectively with non-stationary data is another open avenuee. For example, often real data is more accurately modeled as a piecewise stationary source. In the recent work [84], the sDUDE denoising algorithm is described which, unlike DUDE, tries to compete with a genie-aided SB denoiser that can switch between SB denoisers up to m times, where m is sub-linear in the block length n. When the clean data is emitted by a piecewise stationary process, the sDUDE algorithm achieves the optimum distribution-dependent performance.

Appendix A Proof of Theorem 1 First we show that for any n ∈ N,

1 n n ˆn ˆ E ℓLZ (X ) + αd(X , X ) ≥ min [R(X, D) + αD] . D≥0 n

(A.1)

(n) ˆ n ). For any fixed n, the expected average loss of our scheme would be D0 , E d(X n , X

For this expected average distortion, the rate of our code can not be less than (n)

(n)

R(X, D0 ) which is the minimum required rate for achieving distortion D0 . Therefore,

1 (n) (n) (n) n ˆ E ℓLZ (X ) + αD0 ≥ R(X, D0 ) + αD0 n ≥ min [R(X, D) + αD] . D≥0

(A.2)

Now letting n go to infinity we get

1 n n ˆn ˆ lim inf E ℓLZ (X ) + αd(X , X ) ≥ min [R(X, D) + αD] . n→∞ D≥0 n

(A.3)

On the other hand, in order to obtain the upper bound, we first split the cost 102

103

function into two terms as follows 1 n n n ˆ ) + αd(X , X ˆ ) E ℓLZ (X n 1 n n n n ˆn ˆ ˆ ˆ = E ℓLZ (X ) − Hkn (X ) + Hkn (X ) + αd(X , X ) , n h i 1 n n ˆ ˆ ˆ n ) + αd(X n , X ˆ n) . = E ℓLZ (X ) − Hkn (X ) + E Hkn (X n

(A.4) (A.5) (A.6)

From [85], for kn = o(log n) and any given ǫ > 0, there exists Nǫ ∈ N such that for any individual infinite-length sequence x ˆ = (ˆ x1 , xˆ2 , . . .) and any n ≥ Nǫ Therefore, for n ≥ Nǫ

1 n n ℓLZ (ˆ x ) − Hkn (ˆ x ) ≤ ǫ. n

1 n n ˆ ˆ E ℓLZ (X ) − Hkn (X ) ≤ ǫ. n

(A.7)

(A.8)

Consider an arbitrary point (R(X, D), D) on the rate-distortion curve of source ˜ such that (X, X) ˜ are X. Then we know that for any δ > 0 there exists a process X jointly stationary ergodic, and moreover ˜ ≤ R(X, D), 1. H(X) ˜ 0 ) ≤ D + δ, 2. E d(X0 , X −1 ˜ = H(X ˜ 0 |X ˜ −∞ ˜ [86]. Now since for where H(X) ) is the entropy rate of process X ˆ n is chosen to minimize Hk (X ˆ n) + each source block X n , the reconstruction block X ˆ n ), we have αd(X n , X

ˆ n ) + αd(X n , X ˆ n ) ≤ H k n (X ˜ n ) + αd(X n , X ˜ n ). H k n (X

(A.9)

For a fixed k, from the definition of the k th order entropy, we have 1 X T n n n ˜ ˜ ˜ 1 m·,u (X )H m·,u (X ) , H k (X ) = n ˆk u∈X

(A.10)

104

APPENDIX A. PROOF OF THEOREM 1

where n

X 1 ˜ n) = 1 muk+1 ,u(X 1 ˜i k+1 n n i=1 Xi−k =u n→∞ ˜ 0 = uk+1 , −→ P X −k

(A.11) w.p.1.

(A.12)

˜ n ) converges to Therefore, combining (A.10) and (A.12), as n goes to infinity, Hk (X ˜ 0 |X ˜ −1 ) with probability one. It follows from the monotonicity of Hk (ˆ H(X xn ) in k, −k

(A.9), and the convergence we just established that for any xˆn and any k, ˆ n ) + αd(X n , X ˆ n ) ≤ H(X ˜ 0 |X ˜ −1 ) + ǫ + αd(X n , X ˜ n ), H k n (X −k

eventually a.s. (A.13)

On the other hand n

˜ n , X n) = d(X

1X ˜ i ) n→∞ d(Xi, X −→ E d(X˜0 , X0 ) ≤ D + δ. n i=1

(A.14)

Combining (A.7) and (A.13) yields

1 n n n ˆ ) + αd(X , X ˆ ) ≤ H(X ˜ 0 |X ˜ −1) + 2ǫ + α(D + δ). lim sup E ℓLZ (X −k n n→∞

(A.15)

The arbitrariness of k, ǫ and δ implies 1 n n ˆn ˆ lim sup E ℓLZ (X ) + αd(X , X ) ≤ R(X, D) + αD, n n→∞

(A.16)

for any D ≥ 0. Since D is also arbitrary, it follows that

1 n n ˆn ˆ lim sup E ℓLZ (X ) + αd(X , X ) ≤ min[R(X, D) + αD], D≥0 n n→∞

(A.17)

Finally, combining (A.3), and (A.17) we get the desired result: 1 n n n ˆ ) + αd(X , X ˆ ) = min[R(X, D) + αD]. lim E ℓLZ (X n→∞ D≥0 n

(A.18)

Appendix B Proof of Theorem 2 Our proof follows the results presented in [87]. Throughout this section, An = Xˆ N

denotes the state space of our Markov chain (MC), P defines a stochastic transition matrix from An to itself, and π defines a distribution on An satisfying πP = π.

Moreover,



   P=  

p1 p2 .. . pN



   ,  

(B.1)

where N := |X |n is the size of the state space. From this definition, each pi is a row P N vector of length N in R+ such that pij = 1. j

Definition 1 (Ergodic coefficient). Dobrushin’s ergodic coefficient of P , δ(P), is defined to be δ(P) = max kpi − pj kTV .

(B.2)

0 ≤ δ(P) ≤ 1.

(B.3)

1≤i,j≤N

From the definition,

105

106

APPENDIX B. PROOF OF THEOREM 2

Moreover, since N

kpi − pj kTV

1X = |pik − pjk | 2 k=1 N

=

1 X (pik − pjk )1pik ≥pjk + (pjk − pik )1pjk >pik 2 k=1

N N X 1X 1 = (1 − pik 1pik
N

k=1

k=1

X 1X 1 + (1 − pjk 1pjk ≤pik ) − pik 1pik
=1− =1−

N X k=1 N X

pik 1pik

min(pik , pjk ),

(B.4)

k=1

the ergodic coefficient can alternatively be defined as δ(P) = 1 − min

1≤i,j≤N

N X

min(pik , pjk ).

(B.5)

k=1

The following theorem states the connection between the ergodic coefficient of a stochastic matrix and its convergence rate to the stationary distribution. Theorem 14 (Convergence rate in terms of Dobrushin’s coefficient). Let µ and ν be two probability distributions on An . Then kµPt − νPt k1 ≤ kµ − νk1 δ(P)t

(B.6)

Corollary 1. By substituting ν = π in (B.6), we get kµPt − πk1 ≤ kµ − πk1 δ(P)t .

(B.7)

Thus far, we talked about homogenous MCs with stationary transition matrix.

107

However, in simulated annealing we deal with a nonhomogeneous MC. The transition probabilities of a nonhomogeneous MC depend on time and vary as time proceeds. Let P(t) denote the transition Matrix of the MC at time t, and for 0 ≤ n1 < n2 ∈ N,

define

P

(n1 ,n2 )

:=

nY 2 −1

P(t) .

(B.8)

t=n1

By this definition, if at time n1 the distribution of the MC on the state space An

is µn1 , at time n2 , the distribution evolves to µn2 = µn1 P(n1 ,n2 ) . The following two definitions characterize the steady state behavior of a nonhomogeneous MC.

Definition 2 (Weak Ergodicity). A nonhomogeneous MC is called weakly ergodic if for any distributions µ and ν over An , and any n1 ∈ N, lim sup kµP(n1 ,n2 ) − νP(n1 ,n2 ) k1 = 0.

(B.9)

n2 →∞

Definition 3 (Strong Ergodicity). A nonhomogeneous Markov chain is called strongly ergodic if there exists a distribution over the state space An such that for any distributions µ and n1 ∈ N,

lim sup kµP(n1 ,n2 ) − πk1 = 0.

(B.10)

n2 →∞

Any strongly ergodic MC is also weakly ergodic, because by triangle inequality kµP(n1 ,n2 ) − νP(n1 ,n2 ) k1 ≤ kµP(n1,n2 ) − πk1 + kνP(n1 ,n2 ) − πk1 .

(B.11)

The following theorem states a necessary and sufficient condition for weak ergodicity of a nonhomogeneous MC.

Theorem 15 (Block criterion for weak ergodicity). A MC is weakly ergodic iff there

108

APPENDIX B. PROOF OF THEOREM 2

exists a sequence of integers 0 ≤ n1 < n2 < . . ., such that ∞ X i=1

(1 − δ(P(ni ,ni+1) )) = ∞.

(B.12)

Theorem 16 (Sufficient condition for strong ergodicity). Let the MC be weakly ergodic. Assume that there exists a sequence of probability distributions, {π (i) }∞ i=1 , on

An such that

π (i) P(i) = π (i) .

(B.13)

Then the MC is strongly ergodic, if ∞ X i=1

kπ (i) − π (i+1) k1 < ∞.

(B.14)

After stating all the required definitions and theorems, finally we get back to our main goal which was to prove that by the mentioned choice of the {βt } sequence,

Algorithm 1 converges to the optimal solution asymptotically as block length goes to infinity. Here P(j) , the transition matrix of the MC at the j th iteration, depends on

βj . Using Theorem 15, first we prove that the MC is weakly ergodic. Lemma 3. The ergodic coefficient of P(jn,(j+1)n) , for any j ≥ 0 is upper-bounded as

follows

¯

δ(P(jn,(j+1)n) ) ≤ 1 − e−n(βj ∆+ǫn ) ,

(B.15)

∆ = max δi ,

(B.16)

where

i∈{1,...,n}

for δi = max{|E(ui−1auni+1 ) − E(ui−1buni+1 )|; ui−1 ∈ Xˆ i−1 , uni+1 ∈ Xˆ n−i , a, b ∈ Xˆ }.

109

and ǫn = log n −

log(n!) . n

(B.17)

Proof. Let y1n and y2n be two arbitrary sequences in Xˆ n . Since the Hamming distance

between these two sequence is at most n, starting from any sequence y1n , after at most n steps of the Gibbs sampler, it is possible to get to any other sequence y2n . On the other hand at each step the transition probabilities of jumping from one state to a neighboring state, i.e., P(t) (ui−1 buni+1 |ui−1 auni+1 ) =

exp(−βt E(ui−1auni+1 )) P , n exp(−βt E(ui−1buni+1 ))

(B.18)

b∈Xˆ

can be upper bounded as follows. Dividing both the numerator and denominator of (B.18) by exp(−βt Emin,i (ui−1 , uni+1)), where Emin,i (ui−1, uni+1 ) = min E(ui−1buni+1 ), we b∈Xˆ

get

P(t) (ui−1buni+1 |ui−1auni+1 ) =

exp(−βt (E(ui−1auni+1 ) − Emin,i (ui−1 , uni+1 ))) P , n exp(−βt (E(ui−1 buni+1 ) − Emin,i (ui−1, uni+1 ))) b∈Xˆ

(B.19)

−βt ∆

≥

e

n|Xˆ |

.

(B.20)

Therefore, min

y1n ,y2n ∈Xˆn

where β¯j =

1 n

P

(jn,(j+1)n)

jn+n−1 P t=jn

βt .

(y1n , y2n )

jn+n−1 ¯ e−n(βj ∆+ǫn ) n! Y e−βt ∆ ≥ n = , n t=jn |Xˆ | |Xˆ |n

(B.21)

110

APPENDIX B. PROOF OF THEOREM 2

Using the alternative definition of the ergodic coefficient given in (B.5), δ(P(jn,(j+1)n) ) = 1 −

min

y1n ,y2n ∈Xˆ n

X

min(P(jn,(j+1)n) (y1n , z n ), P(jn,(j+1)n) (y2n , z n ))

z n ∈Xˆ n

1 −n(β¯j ∆+ǫn) e ≤ 1 − |Xˆ |n |Xˆ |n

(B.22)

¯

= 1 − e−n(βj ∆+ǫn ) .

Corollary 2. Let βt =

log(⌊ nt ⌋+1) (n)

T0

(n)

, where T0

(B.23)

= cn∆, for some c > 1, and ∆ is defined

in (B.16), in Algorithm 1. Then the generated MC is weakly ergodic. Proof. For proving weak ergodicity, we use the block criterion stated in Theorem 15. Let nj = jn, and note that β¯j = log(j+1) in this case. Observe that T0 ∞ X j=0

(1 − δ(P

(nj ,nj+1 )

)) = ≥ =

∞ X

(1 − δ(P(jn,(j+1)n) ))

j=1 ∞ X

j=0 ∞ X

¯

e−n(βj ∆+ǫn ) −n(∆

e

log(j+1) +ǫn ) T0

(B.24) (B.25)

j=0

−nǫn

=e

∞ X 1 = ∞. 1/c j j=1

(B.26)

This yields the weak ergodicity of the MC defined by the simulated annealing and Gibbs sampler.

Now we are ready to prove the result stated in Theorem 2. Using Theorem 16, we prove that the MC is in fact strongly ergodic and the eventual steady state distribution of the MC as the number of iterations converge to infinity is a uniform distribution over the sequences that minimize the energy function.

111

At each time t, the distribution defined as π (t) (y n ) =

1 −βt E(yn ) e Zβ t

(B.27)

satisfies π (t) P(t) = π (t) . Therefore, if we prove that ∞ X t=1

kπ (t) − π (t+1) k1 < ∞,

(B.28)

by Theorem 16, the MC is also strongly ergodic. But it is easy to show that π (t) converges to a uniforms distribution over the set of sequences that minimize the energy function, i.e., lim π (t) (y n ) =

t→∞

(

0; 1 ; |H|

yn ∈ / H,

y n ∈ H,

(B.29)

where H , {y n : E(y n) = min E(z n )}. z n ∈Xˆ n

ˆ n denote the output of Algorithm 1 after t iterations, then Hence, if we let X t ˆ n ) = min E(y n ), lim E(X t

t→∞

y n ∈Xˆ n

(B.30)

which combined with Theorem 1 yields the desired result.

In order to prove (B.28), we prove that π (t) (y n ) is increasing on H, and eventually

112

APPENDIX B. PROOF OF THEOREM 2

decreasing on Hc , hence there exists t0 such that for any t1 > t0 , t1 X t=t0

kπ

(t1 )

−π

(t+1)

k1 =

t1 X 1 X t=t0

2

y n ∈Xˆn

|π (t) (y n ) − π (t+1) (y n )|,

(B.31)

t1 1 XX = (π (t+1) (y n ) − π (t) (y n )) 2 yn ∈H t=t 0

+

1 2

t1 X

X

y n ∈Xˆ n \H t=t0

(π (t) (y n ) − π (t+1) (y n )),

1 X (t1 +1) n = (π (y ) − π (t0 ) (y n )) 2 yn ∈H 1 X (π (t0 ) (y n ) − π (t1 +1) (y n )) + 2 n n

(B.32)

(B.33)

y ∈Xˆ \H

1 < (1)(|Xˆ n|). 2

(B.34)

Since the right hand side of (B.34) of does not depend on t1 , ∞ X t=0

kπ (t) − π (t+1) k1 < ∞.

(B.35)

Finally, in order to prove that π (t) (y n ) is increasing for y n ∈ H, note that n

e−βt E(y ) π (y ) = P −βt E(z n ) e (t)

n

z n ∈Xˆ n

= P

1

e−βt (E(z n )−E(yn ))

.

(B.36)

z n ∈Xˆ n

Since for y n ∈ H and any z n ∈ Xˆ n , E(z n ) − E(y n) ≥ 0, if t1 < t2 , X

z n ∈Xˆ n

e−βt1 (E(z

n )−E(y n ))

>

X

z n ∈Xˆ n

e−βt2 (E(z

n )−E(y n ))

,

113

and hence π (t1 ) (y n ) < π (t2 ) (y n ). On the other hand, if y n ∈ / H, then n

e−βt E(y ) π (t) (y n ) = P −βt E(z n ) e z n ∈Xˆ n

=

P

z n :E(z n )≥E(y n )

e−βt (E(z n )−E(yn ))

1 +

P

eβt (E(yn )−E(z n ))

.

(B.37)

z n :E(z n )
For large β the denominator of (B.37) is dominated by the second term which is increasing in βt and therefore π (t) (y n ) will be decreasing in t. This concludes the proof.

Appendix C Proof of Theorem 3 First we need to prove that a result similar to Theorem 1 holds for SB codes. I.e., (n)

(n)

we need to prove that for given sequences {kf }n and {kn }n such that lim kf = ∞ n→∞

and kn = o(log n), finding a sequence of SB codes according to (n)

(n)

fˆKf = arg min E(f Kf ),

(C.1)

(n) K f f

(n)

(n)

where E(f Kf ) is defined in (3.15) and Kf

(n)

= 22kf

+1

, results in a sequence of

asymptotically optimal codes for any stationary ergodic source X at slope α. In other words, 1 n n n ˆ ) + αdn (X ˆ , Y ) = min [R(X, D) + αD] , lim E ℓLZ (X n→∞ D≥0 n

a.s.

(C.2)

(n)

ˆn = X ˆ n [X n , fˆKf ]. After proving this, the rest of the proof follows from the where X proof of Theorem 2 by just redefining δi as (n) (n) (n) (n) Kf Kf Kf Kf −i i−1 i−1 i−1 i−1 δi = max E(f afi+1 ) − E(f afi+1 ) ; f ∈ Y , fi+1 ∈ Y , a, b ∈ Y . For establishing the equality stated in (C.2), like the proof of Theorem 1, we prove

consistent lower and upper bounds which in the limit yield the desired result. The 114

115

lower bound, 1 n n ˆn ˆ lim inf E ℓLZ (X ) + αd(X , X ) ≥ min [R(X, D) + αD] , n→∞ D≥0 n

(C.3)

follows from an argument similar to the one given in the Appendix A. For proving the upper bound, we split the cost into two terms, as done in the equation (A.6). The convergence to zero of the first term again follows from a similar argument. The only difference is in upper bounding the second term. Since, asymptotically, for any stationary ergodic process X, SB codes have the same rate-distortion performance as block codes, for a point (R(X, D), D) on the ǫ

rate-distortion curve of the source, and any ǫ > 0, there exits a SB code f 2κf +1 of ˜ some order κǫ such that coding the process X by this SB code results in a process X f

which satisfies ¯ X) ˜ ≤ R(X, D), 1. H( ˜ 0 ) ≤ D + ǫ. 2. E d(X0 , X On the other hand, for a fixed n, E(f Kf ) is monotonically decreasing in Kf . Therefore, (n)

for any process X and any δ > 0, there exists nδ such that for n > nδ and kf ≥ κǫf h i ˆ n ) + αdn (X n , X ˆ n ) ≤ R(X, D) + α(D + ǫ) + δ, lim sup Hkn (X

w.p. 1.

(C.4)

n→∞

Combining (C.3) and (C.4), plus the arbitrariness of ǫ, δ and D yield the desired result.

Appendix D Stationarity condition Assume that we are given a |Xˆ | × |Xˆ |k matrix m with all elements positive and

summing up to one. The question is under what condition(s) this matrix can be the (k + 1)th order stationary distribution of a stationary process. For the ease of notations, instead of matrix m consider p(xk+1 ) as a distribution defined on Xˆ k+1 . We

show that a necessary and sufficient condition is the so-called stationarity condition which is X

β∈Xˆ

p(βxk ) =

X

p(xk β).

(D.1)

β∈Xˆ

- Necessity: The necessity of (D.1) is just a direct result of the definition of stationarity of a process. If p(xk+1 ) is to represent the (k + 1)th order marginal distribution of a stationary process, then it should be consistent with the k th order marginal distribution as satisfy (D.1).

- Sufficiency: In order to prove the sufficiency, we assume that (D.1) holds, and build a stationary process with the (k + 1)th order marginal distribution of p(xk+1 ). Consider a k th order Markov chain with transition probabilities of q(xk+1 |xk ) = 116

p(xk+1 ) . p(xk )

(D.2)

117

Note that p(xk ) is well-defined by (D.1). Moreover, again from (D.1), p(xk+1 ) is the stationary distribution of the defined Markov chain, because X x1

q(xk+1 |xk )p(xk ) =

X

p(xk+1 ) = p(xk+1 2 ).

(D.3)

x1

Therefore we have found a stationary process that has the desired marginal distribution. Finally we show that if m is the count matrix of a sequence y n , then there exist a stationary process with the marginal distribution coinciding with m. From what we just proved, we only need to show that (D.1) holds, i.e., X β

mβ,b =

X

mbk ,[β,b1 ...,bk−1 ] .

β

i+k But this is true because both sides of (D.4) are equal to |{i : yi+1 = b}|/n.

(D.4)

Appendix E Concavity of H(m) For simplicity assume that X = Xˆ = {0, 1}. By definition H(m) =

X

(m0,b + m1,b )h(

b∈{0,1}k

m0,b ), m0,b + m1,b

(E.1)

where h(α) = α log α + α ¯ log α ¯ and α ¯ = 1 −α. We need to show that for any θ ∈ [0, 1],

and empirical count matrices m(1) and m(2) ,

(2) ¯ ¯ (2) ). θH(m(1) ) + θH(m ) ≤ H(θm(1) + θm

(E.2)

From the concavity of h, it follows that (1)

(1) θ(m0,b

+

(1) m1,b )h(

(2)

m0,b

m0,b ¯ (2) + m(2) )h( ) + θ(m ) 0,b 2,b (1) (1) (2) (2) m0,b + m1,b m0,b + m2,b

(1) (1) ¯ (2) + m(2) )) = (θ(m0,b + m1,b ) + θ(m 0,b 1,b (i)

(i)

(i)

θi (m0,b + m1,b ) m0,b h( (i) ) (1) (1) (2) (2) (i) ¯ (θ(m + m ) + θ(m + m )) m + m 0,b 1,b 0,b 1,b 0,b 1,b i∈{1,2} X

≤

(1) (θ(m0,b

¯ θm0,b + θm 0,b ), (1) (1) ¯ (2) + m(2) ) θ(m0,b + m1,b ) + θ(m 0,b 1,b (1)

+

(1) m1,b )

¯ (2) + m(2) ))h( + θ(m 0,b 1,b

118

(2)

(E.3)

119

where θ1 = 1 − θ2 = θ. Now summing up both sides of (E.3) over all b ∈ Xˆ k , yields

the desired result.

Appendix F Proof of Theorem 4 Proof. By rearranging the terms, the cost that is to be minimized in (P1) can alternatively be represented as follows n−k

1X d(xi , yi), Hk (y ) + αdn (x , y ) = Hk (m(y )) + α n i=1 n

n

n

n

n X 1 X d(xi , yi) = Hk (m(y )) + α 1xi =a,yi =b , n n

i=k+1

= Hk (m(y n )) + α

a∈X ,b∈Xˆ

n 1 X X d(a, b)1xi =a,yi =b , n i=k+1 ˆ a∈X ,b∈X

n

= Hk (m(y )) + α

X

a∈X ,b∈Xˆ

= Hk (m(y n )) + α

X

n 1 X d(a, b) 1xi =a,yi =b , n i=k+1

(1)

d(a, b)ˆ p[xn ,yn ] (a, b),

a∈X ,b∈Xˆ

= Hpˆ(k+1) (Yk+1 |Y k ) + α Epˆ(1) [y n ]

[xn ,y n ]

(·,·)

d(X1 , Y1 ).

(F.1)

This new representation reveals the close connection between (P1) and (4.15). Although the costs we are trying to minimize in the two problems are equal, there is a fundamental difference between them: (P1) is a discrete optimization problem, while the optimization space in (4.15) is continuous. 120

121

Let En∗ and Pn∗ be the sets of minimizers of (P1), and joint empirical distribu(ℓ)

tions of order ℓ, pˆ[xn ,yn ] (·, ·), induced by them respectively. Also let Sn∗ be the set of marginalized distributions of order k in P ∗ with respect to Y . Finally, let C ∗ and Cˆ ∗ n

n

n

be the minimum values achieved by (P1) and (4.16) respectively.

In order to make the proof more tractable, we break it down into several steps as follows. (ℓ)

I. Let y n ∈ En∗ , and pˆ[xn ,yn ] (·, ·) be the induced joint empirical distribution. It is (ℓ)

easy to check that pˆ[xn ,yn ] (·, ·) satisfies all the constraints mentioned in (4.16). The only condition that might need some thought is the stationarity constraint, which also holds because X

1 ℓ−1 i−1 ℓ−1 i : xi−1 = a , y = b , i−ℓ+1 i−ℓ+1 n−ℓ X (ℓ) = pˆ[xn ,yn ] (aℓ aℓ−1 , bk bℓ−1 ). (F.2)

(ℓ)

pˆ[xn ,yn ] (aℓ , bℓ ) =

aℓ ∈X ,bℓ ∈Xˆ

aℓ ∈X ,bℓ ∈Xˆ

Therefore, since Cˆn∗ is the minimum of (4.16), we have Cˆn∗ ≤ Hk (m(y n )) + α Epˆ(1)

[xn ,y n ]

(·,·)

(Xk+1, Yk+1 )

= Hk (m(y n )) + αdn (xn , y n) = Cn∗ .

(F.3)

II. Let p∗(ℓ) ∈ Pˆn∗ . Based on this joint probability distribution and xn , we construct ˜ n as follows: divide xn into r = ⌈ n ⌉ consecutive a reconstruction sequence X ℓ

blocks:

(r−1)ℓ

n xℓ , x2ℓ ℓ+1 , . . . , x(r−2)ℓ+1 , x(r−1)ℓ+1 ,

where except for possibly the last block, the other blocks have length ℓ. The new sequence is constructed as follows ˜ ℓ, X ˜ 2ℓ , . . . , X ˜ (r−1)ℓ , X ˜n X ℓ+1 (r−1)ℓ+1 , (r−2)ℓ+1

122

APPENDIX F. PROOF OF THEOREM 4

˜ iℓ where for i = 1, . . . , r − 1, X (i−1)ℓ+1 is a sample from the conditional distribution ∗(ℓ) ˆ n n ˆ ℓ |X ℓ = xiℓ ˜n p∗(ℓ) (X (X(r−1)ℓ+1 |X(r−1)ℓ+1 = xn(r−1)ℓ+1 ). (i−1)ℓ+1 ), and X(r−1)ℓ+1 ∼ p

∗(k+1) III. Assume that x = {xi }∞ i=1 is a given individual sequence. For each n, let p be the (k + 1)th order marginalized version of the solution of (4.16) on Xˆ (k+1) .

˜ n be the constructed as described in the previous item, and pˆ(k+1) Moreover, let X ˜ n] [X th n ˜ . We now prove that be the (k + 1) order empirical distribution induced by X (k+1)

kp∗(k+1) − pˆ[X˜ n ] k1 → 0, a.s.,

(F.4)

˜ n. where the randomization in (F.4) is only in the generation of X Remark: Since p∗(ℓ) satisfies stationarity condition, its (k +1)th order marginalized distribution, p∗(k+1) , is well-defined and can be computed with respect to any of the (k + 1) consecutive positions in 1, . . . , ℓ. In other words for ak+1 ∈ Xˆ k+1 , p∗(k+1) (ak+1 ) =

X

p∗(k+1) (bj ak+1 bℓ−k−1 j+1 ),

(F.5)

bℓ−k−1 ∈Xˆ n

for any j ∈ {0, . . . , ℓ − k − 1}, and the result does not depend on the choice of j.

(k+1)

In order to show that the difference between pˆ[X˜ n ] (ak+1 ) and p∗(k+1) (ak+1 ) is (k+1)

going to zero almost surely, we decompose pˆ[X˜ n ] (ak+1 ) into the average of ℓ − k

terms each of which is converging to p∗(k+1) (ak+1 ). Then using the union bound (k+1)

we get the desired result which is the convergence of pˆ[X˜ n ] (ak+1 ) to p∗(k+1) (ak+1 ).

123

For ak+1 ∈ Xˆ k+1 , (k+1) k+1 ∗(k+1) k+1 (a ) pˆ[X˜ n ] (a ) − p n X 1 ∗(k+1) k+1 1X˜i−k (a ) = i =ak+1 − p n − k − 1 i=k+1 ℓ−k−1 r−1 X X 1 ∗(k+1) k+1 = 1X˜iℓ−j−k iℓ−j + δ − p (a ) , ℓ =ak+1 n − k − 1 j=0 i=1 # " r−1 ℓ−k−1 X X r 1 ∗(k+1) k+1 1 ˜ iℓ−j = + δℓ − p (a ) , k+1 n − k − 1 r i=1 Xiℓ−j−k =a j=0 " # ℓ−k−1 r−1 X X 1 1 ′ ∗(k+1) k+1 = 1X˜iℓ−j−k iℓ−j + δ − p (a ) , n,ℓ =ak+1 ℓ − k − 1 r j=0

(F.6)

i=1

′ where δℓ accounts for the edge effects between the blocks, and δn,ℓ is defined such ′ that δn,ℓ − δℓ takes care of the effect of replacing

0 ≤ δℓ ≤

4k , ℓ

r n−k−1

with

1 . ℓ−k−1

Therefore,

′ δℓ → 0 as ℓ → ∞, and finally |δn,ℓ − δℓ | = o(k/ℓ).

The interesting point about the new representation is that it decomposes a sequence of correlated random variables, {1X˜ i

k+1 i−k =a

}ni=k+1 , into ℓ − k sub-

sequences where each of them is an independent process. For achieving this

some counts that lie between two blocks are ignored, i.e., if 1X˜ i =ak+1 is such i−k ˜ iℓ that it depends on more than one block of the form X (i−1)ℓ+1 , we ignore it. The effect of such ignored counts will be no more than δr which goes to zero as k, ℓ → ∞ because the Theorem requires k = o(ℓ). More specifically in (F.6), {1X˜ iℓ−j

k+1 iℓ−j−k =a

}ri=1 is a sequence of independent random variables for each

j ∈ {0, . . . , ℓ − k − 1}.

′ For n large enough, |δn,ℓ | < ǫ/2. Therefore, by Hoeffding inequality [88], and

124

APPENDIX F. PROOF OF THEOREM 4

the union bound, (k+1) k+1 ∗(k+1) k+1 P pˆ[X˜ n ] (a ) − p (a ) > ǫ , " # ! ℓ−k−1 r−1 ǫ X 1X 1 ∗(k+1) k+1 ≤P , 1X˜iℓ−j−k iℓ−j (a ) > =ak+1 − p ℓ − k − 1 2 r j=0 i=1 r−1 ! ℓ−k−1 ǫ X 1 X 1 ∗(k+1) k+1 ≤P 1X˜iℓ−j−k iℓ−j (a ) > , =ak+1 − p 2 ℓ − k − 1 j=0 r i=1 r−1 ! ℓ−k−1 1 X ǫ X ∗(k+1) k+1 1X˜iℓ−j−k , ≤ P iℓ−j (a ) > =ak+1 − p r 2 i=1 j=0 ≤ 2(ℓ − k)e−rǫ

2 /2

.

(F.7)

Now again by applying the union bound, P

(k+1) kˆ p[X˜ n ]

−p

≤

X

∗(k+1)

ak+1 ∈Xˆk+1

k1 > ǫ

(k+1) k+1 ∗(k+1) k+1 P pˆ[X˜ n ] (a ) − p (a ) > nǫ2

− ≤ |Xˆ |k+1 2(ℓ − k)e 2ℓ|Xˆ|2(k+1) .

ǫ |Xˆ |k+1

!

, (F.8)

Our choices of k = kn = o(log n), ℓ = ℓn = o(n1/4 ), k = o(ℓ), and kn , ℓn → ∞,

as n → ∞ now guarantee that the right hand side of (F.8) is summable on n which together with Borel-Cantelli Lemma yields the desired result of (F.4).

IV. Using similar steps as above we can prove that (1)

kq ∗ − qˆ[xn ,X˜ n ] k → 0, a.s.

(F.9)

(1) Again we first prove that |q ∗ (a, b) − qˆ[xn ,X˜ n ] (a, b)| → 0 for each a ∈ X and b ∈ Xˆ .

For doing this we again need to decompose

{1xi =a,X˜i =b }ni=1

125

into ℓ sub-sequences each of which is a sequence of independent random variables, and then apply Hoeffding inequality plus the union bound. Finally we apply the union bound again in addition to the Borel-Cantelli Lemma to get our desired result. V. Combing the results of the last two parts, and the fact that Hk (m) and Eq d(X, Y ) are bounded continuous functions of m and q(·, ·) respectively, we conclude that ˜ n ) + αdn (xn , X ˜ n ) = H (k+1) (Yk+1 |Y k ) + α E (1) H k (X pˆ qˆ ˜ n] [X

˜ n] [xn ,X

d(X1 , Y1)

= Hp∗(k+1) (Yk+1|Y k ) + α Eq∗ d(X1 , Y1 ) + ǫn = Cˆn∗ + ǫn ,

(F.10)

where ǫn → 0 with probability 1. VI. Since Cn∗ is the minimum of (P1), we have ˜ n ) + αdn (xn , X ˜ n ), Cn∗ ≤ Hk (X = Cˆn∗ + ǫn .

(F.11)

On the other hand, as shown in (F.3), Cˆn∗ ≤ Cn∗ . Therefore, |Cn∗ − Cˆn∗ | → 0

(F.12)

as n → ∞. VII. For a given set of coefficients λ = {λβ,b }β,b computed at some m according to (4.5), define

f (λ) = min

y n ∈Xˆn

"

X β,b

#

λβ,b mβ,b (y n ) + αdn (xn , y n ) .

(F.13)

It is easy to check that f is continuous, and bounded by 1 + α. Therefore, since

126

APPENDIX F. PROOF OF THEOREM 4

λ is in turns a continuous function of m, and as proved in (F.4), (k+1)

kp∗(k+1) − pˆ[X˜ n ] k1 → 0, we conclude that, ˆ → 0, |f (λ∗ ) − f (λ)|

(F.14)

ˆ are the coefficients computed at p∗(k+1) and pˆ(k+1) respectively. where λ∗ and λ ˜ n] [X ¯ n be the output of (P2) when the coefficients are computed at m(X ˜ n ). VIII. Let X Then, by Lemma (2), ¯ n ) + αdn (xn , X ¯ n ) ≤ H k (X ˜ n ) + αdn (xn , X ˜ n) H k (X = Cˆn∗ + ǫn .

(F.15)

˜ n ), we Since, ǫn → 0, this shows that haven computed the coefficients at m(X

would get a universal lossy compressor. But instead, we want to compute the coefficients at m∗ . From (F.14), the difference between the performances of these two algorithms goes to zero. Therefore, we finally get our desired result which is h i ˆ n ) + αdn (X n , X ˆ n ) n→∞ E H k (X −→ min [R(X, D) + αD] . D≥0

(F.16)

Appendix G Proof of Theorem 6 The proof is an extension of the proof given in [39] which is for the case where there is no FB sequence. Let X = {Xi ; ∀ i ∈ N+ } be a stochastic process defined on a probability space (X, Σ, P), where Σ denotes the σ-algebra generated by cylinder

sets, and P is a probability measure defined on it. The shift operator T : X ∞ → X ∞ is defined by

(T x)n = xn+1 ,

x ∈ X ∞ , n ≥ 1.

Let X and Xˆ denote the source and reconstruction alphabets respectively, which are both assumed to be finite.

Since (R, D) is assumed to be an interior point of the achievable region in the R − D plane, there exists δ0 > 0, such that (R − δ, D) is also an interior point for any 0 < δ ≤ δ0 . Define R1 , R − δ. Since (R1 , D) is an achievable point, for any given ǫ > 0, there exists a block WZ encoder/decoder, (fn , gn ) of rate R1 and block length n, which is sufficiently large based on ǫ, and average expected distortion less than D + ǫ. Assume that among our infinite choices, we pick a WZ code whose block length n is large enough such that 1 log n max{ √ , } < ǫ. n n log |Y|

(G.1)

This constraint will be useful in our future analysis. In order to prove that there exists a SB code satisfying the constraints given 127

128

APPENDIX G. PROOF OF THEOREM 6

in Theorem 6, the given block code (fn , gn ) should somehow be embedded in the SB encoder/decoder mappings. For defining a SB code based on a block code, the natural question is how to define blocks in the infinite length source sequence. Moreover, after finding a way for distinguishing blocks in the input sequence, the next problem is how the decoder is going to detect the coded blocks in the infinite length received FB

sequence. To answer the first question, as usually done in the literature, we resort

to Rohlin-Kakutani Theorem, or shortly the R-K Theorem, of the ergodic theory [89], which is stated below. Theorem 17 (Rohlin-Kakutani Theorem). Given the ergodic source [A, µ, U], integers L and n ≤ L, and ǫ > 0, there exists an event F (called the base) such that 1. F, T F, . . . , T L−1 F are disjoint, L−1 S i T F ≥ 1 − ǫ, 2. P i=0

3. P (S(an )|F ) = P (S(an )), where S(an ) = {x : xn = an }. This theorem states that for any given L, and any n less than L, there exists a base event F , such that the base and its L disjoint shifts, basically cover the event space, i.e., any given sequence X with high probability belongs to T iF for some 0 ≤ i ≤ L − 1. The last property states that the probability distribution of the n-tuples is the same both in the base and in the whole space.

For a given ǫ > 0, n, the block length of the block encoder/decoder (fn , gn ), and Ln , n + ⌈nǫ⌉, let F be the base event given by the R-K theorem for these

parameters. Define G to be everything in the event space which is not included in SLn −1 i i=0 T F . Note that by the R-K theorem P(G) ≤ ǫ. To show the existence of a

finite length SB encoder, we first prove the existence of an infinite length SB encoder, f (∞) , and then show that it can be truncated appropriately such that the resulting finite window length code also satisfies our desired properties.

Note that f (∞) maps every infinite length sequence x into a symbol in the FB sequence alphabet Y, and defines the FB sequence as yˆi = f (∞) (T i x). As mentioned

earlier, one problem is enabling the decoder to discriminate between the encoded

129

blocks embedded in the FB sequence. One simple solution is requiring the SB encoder to interject a pre-defined synchronization sequence, which is not contained in any of the codewords, between the encoded blocks. Let s = 1⌈nǫ⌉ , where 1r is a vector of length r with all of its elements equal to 1, denote the synchronization block. From now on, symbols 0 and 1 denote two arbitrary distinct symbols in Y. The following

lemma shows that as long as |Y| > 2R−ǫn , it is possible to construct a codebook of 2nR distinct codewords none of them containing s. Lemma 4. If |Y| > 2R−ǫn , where ǫn = n

C ⊂ Y with 2

nR

log(1−n|Y|−⌈nǫ⌉ ) , n

it is possible to find a codebook

codewords such that s = 1r is not contained in any of them.

Proof. Let Ns denote the number of sequences in Y n that contain s as part of them.

There are n − ⌈nǫ⌉ + 1 positions that might be the start of the s. For each of them

it is possible to construct |Y|n−⌈nǫ⌉ sequences that contain s starting at that certain position. Therefore, Ns is upper-bounded as follows,

Ns < (n − ⌈nǫ⌉ + 1)|Y|n−⌈nǫ⌉.

(G.2)

On the other hand, if |Y|n − Ns > 2nR , then it is possible to choose 2nR codewords

as desired. Combining this with (G.2), it is sufficient to have |Y|n − 2nR > n|Y|n−⌈nǫ⌉,

(G.3)

|Y|n (1 − n|Y|−⌈nǫ⌉) > 2nR ,

(G.4)

log |Y| > R − ǫn ,

(G.5)

or

or

where ǫn is as defined in the statement of the lemma. Therefore, using the previous lemma, it is possible to construct a codebook C˜ with

2nR1 codewords, such that none of them contain s. Further, we can assume that the codewords in C˜ are chosen such that the first and the last symbols of all of them are equal to 0. Note that if |Y| satisfies (G.5), then the number of codewords that satisfy

130

APPENDIX G. PROOF OF THEOREM 6

the requirement of Lemma 4 is exponentially more than 2nR ; Therefore, it is possible to choose such a codebook. This assumption makes sure that for any y n ∈ C, the synchronization sequence can uniquely be detected in y1 , . . . , yn , s1 , . . . , s⌈nǫ⌉ and also in s1 , . . . , s⌈nǫ⌉ , y1 , . . . , yn with no ambiguity. Now each codeword in the codebook C can be mapped into a unique codeword in

˜ The role of each element in C in the coding is then played by the corresponding C. vector in C˜ that it is mapped to. Since in the WZ coding, the codebook is only an indexing of the input blocks, such a mapping only acts as a renaming of the vectors in the codebook, and does not have any other effect. Now we define an infinite length encoder f (∞) based on the partitioning of the event space given by the R-K Theorem as follows,

1. x ∈ T i F , for some 0 ≤ i ≤ n − 1: let f (∞) (x) = yˆi where [ˆ y0 , yˆ1, . . . , yˆn−1 ] , fn (xn−i−1 ). −i

2. x ∈ T i F , for some n ≤ i ≤ Ln − 1: let f (∞) (x) = s[i − n + 1], 3. x ∈ G: let f (∞) (x) = y0 , where y0 is an element in Y which is not used in s. After defining the SB encoder, we can define the SB decoder, g, which generates ˆ i = g(Z i+M , Y i+M ) with M = 2(n + ⌈nǫ⌉) + 1. The the reconstruction process as X i−M

decoder g searches the block

i+M Yi+1

i−M

for a synchronization sequence. At most there

will be one such sequence. If it detects one string s, which starts at position i + r, ˆ i = Un−r+1 , where [Uˆ1 , Uˆ2 , . . . , Uˆn ] , gn (Z i+r−n , Y i+r−n ). 1 ≤ r ≤ n + 1, then it lets X i+r−1

i+r−1

If it detects no synchronization sequence, the decoder outputs some fixed arbitrary symbol. In order to compute the expected average distortion between the source and reconstruction sequences, note that since the original process and its reconstruction are jointly stationary, the average expected distortion between them is equal to

131

ˆ0) E d(X0 , X ˆ0) = E d(X0 , X

n−1 X i=0

h i ˆ 0 )|T iF P(T i F )+ E d(X0 , X "

ˆ 0 )| E d(X0 , X

L[ n −1

#

T iF P

i=n h i ˆ E d(X0 , X0 )|G P(G).

L[ n −1 i=n

T iF

!

+ (G.6)

By the stationarity of the source, P

L[ n −1

T iF

i=0

L

n −1

and P

S

i=n

T iF

!

= (n + ⌈nǫ⌉) P(F ),

(G.7)

= ⌈nǫ⌉ P(F ). Therefore, P

L[ n −1

T iF

i=n

!

⌈nǫ⌉ = P n + ⌈nǫ⌉ ≤ (a)

⌈nǫ⌉ n + ⌈nǫ⌉

L[ n −1 i=0

T iF

!

≤ ǫ,

(G.8)

where (a) follows from (G.1). Moreover, from the R-K Theorem, P(G) < ǫ, which together with (G.8) shows that ˆ0) < E d(X0 , X

n−1 X i=0

h i i ˆ E d(X0 , X0 )|T F P(T i F ) + 2dmax ǫ.

(G.9)

132

APPENDIX G. PROOF OF THEOREM 6

For bounding the first term in (G.9), note that n−1 X i=0

n−1 h i h i X i i ˆ ˆ i )|F P(F ) E d(X0 , X0 )|T F P(T F ) = E d(Xi , X i=0

h i ˆ n )|F , < E dn (X n , X

(G.10)

where the last line follows from thei fact that P(F ) < n1 . On the other hand, by h ˆ n )|F = E[dn (X n , gn (Z n , fn (X n )))]. Consequently, the R-K theorem, E dn (X n , X combining all of the previous results,

ˆ 0 )] < E[dn (X n , gn (Z n , fn (X n )))] + 2dmax ǫ, E[d(X0 , X < D + (1 + 2dmax )ǫ.

(G.11)

From the finite SB code approximation theorem (Theorem 3.1 in [52]), for any σ > 0, and any infinite SB code f (∞) , there exists k = k(σ, f (∞) ), and finite SB code f (k) of window-length 2k +1, such that the outputs of the codes f (∞) and f (k) coincide except on a set of probability no larger than σ. This result enables us to truncate f (∞) , and get a finite code f (k) such that P f (∞) (X) 6= f (k) (X) < σ. Now assume that σ = ǫ/(2M + 1) and k = k(σ, f (∞) ), and define {Yi} and {Y˜i } as Yi = f (k) (X i+k ) i−k

and Y˜i = f (∞) (X). Then, from Lemma 3.2 of [52],

i+M i+M i+M ˜ i+M P g(Zi−M , Yi−M ) 6= g(Zi−M , Yi−M ) ≤ (2M + 1) P Yi 6= Y˜i , < (2M + 1)σ, = ǫ.

(G.12)

(G.12) states that from the truncation of f ∞ the expected distortion will not increase by more than λmax ǫ. Therefore, combing this with our previous results, we conclude that ˆ 0 )] < D + (1 + 3dmax )ǫ. E[d(X0 , X

(G.13)

So far we have shown the existence of a SB code which generates a reconstruction sequence within maximum distance of D + (1 + 3dmax )ǫ to the source. Now we show

133

i+k that the entropy rate of the {Yi} sequence, where Yi = f (k) (Xi−k ), is as close to R as desired. In order to do this, we first bound the entropy rate of the {Y˜i } process,

where Y˜i = f (∞) (X). Then

1 1 H(Y m ) ≤ H(Y m , Y˜ m ), m m 1 1 = H(Y˜ m ) + H(Y m |Y˜ m ), m m 1 1 = H(Y˜ m ) + H(Y m ⊕ Y˜ m |Y˜ m ), m m 1 1 ≤ H(Y˜ m ) + H(Y m ⊕ Y˜ m ), m m 1 ≤ H(Y˜ m ) + hb (σ), m

(G.14)

where for y, y˜ ∈ Y, y ⊕ y˜ = 0 if y = y˜ and 1 otherwise, also for 0 ≤ α ≤ 1, hb (α) = −α log α − (1 − α) log(1 − α). Therefore, letting m grow to infinity, we conclude that

¯ ¯ Y) ˜ + hb (σ). H(Y) ≤ H(

(G.15)

ˆ Y). ˜ Let {θi } denote a sequence defined as follows, Now we turn to bounding H(  th ˜    j, Yi is the j letter of a codeword, θi = where j ∈ {1, . . . , Ln };    0, otherwise.

(G.16)

134

APPENDIX G. PROOF OF THEOREM 6

˜ can be upper-bounded as follows Then the entropy rate of the generated FB process Y ¯ Y) ˜ = lim 1 H(Y˜ m ), H( m→∞ m 1 ≤ lim H(Y˜ m , θm ), m→∞ m i 1 h m m m ˜ = lim H(θ ) + H(Y |θ ) , m→∞ m 1 m m−1 m ˜ ˜ = lim H(θ ) + H(Ym|Y ,θ ) , m→∞ m 1 m m−1 ˜ ˜ ≤ lim H(θ ) + H(Ym |Y , θm ) , m→∞ m

(G.17)

where H(Y˜m|Y˜ m−1 , θm ) = (a)

Ln X j=0

≤ ǫ log |Y| +

≤ ǫ log |Y| +

H(Y˜m |Y˜ m−1 , θm = j) P(θm = j),

n X

m−1 H(Y˜m |Y˜m−j+1 , θm = j) P(θm = j),

j=1 n X

1 n

j=1

m−1 , θm = j), H(Y˜m |Y˜m−j+1

1 H(fn (X n )), n ≤ ǫ log |Y| + R1 ,

= ǫ log |Y| +

(G.18)

where (a) follows from the facts that, for n + 1 ≤ j ≤ Ln , H(Y˜m|Y˜ m−1 , θm = j) = 0,

and P(θm = 0) < ǫ. Moreover, we show that the entropy rate of the {θi } process can be made arbitrary small:

1 H(θm ) = lim H(θm |θm−1 ), m→∞ m→∞ m ≤ lim H(θm |θm−1 ), lim

=

m→∞ Ln X j=0

P(θ0 = j)H(θ1 |θ0 = j),

(G.19)

135

where the last step is a result of stationarity. By the definition of the {θi } sequence,

H(θ1 |θ0 = j) = 0, for 1 ≤ j ≤ n − 1, P(θ0 = 0) = P(G) ≤ ǫ, and P(θ0 = Ln ) =

P(F ) ≤ n1 . Given θ0 = 0, θ1 can either be zero or one, therefore H(θ1 |θ0 = 0) ≤ 1. Similarly, conditioned on θ0 = Ln , θ1 can only be zero or one, and H(θ1 |θ0 = n)

computes the uncertainty that one has in determining whether a sequence belongs to L Sn i T F or not when it is known that X ∈ T −Ln F . Since conditioning can only reduce i=0

entropy, and P(G) < ǫ, it follows that H(θ1 |θ0 = Ln ) < hb (ǫ). Consequently, 1 H(θm ) = P(θ0 = 0)H(θ1 |θ0 = 0)+ m→∞ m lim

P(θ0 = Ln )H(θ1 |θ0 = Ln ) ≤ ǫ +

1 hb (ǫ). n

(G.20)

Combining (G.15), (G.17) and (G.20), it follows that 1 ¯ H(Y) ≤ R1 + (1 + log |Y|)ǫ + hb (ǫ) + hb (σ), n where as defined before σ=

(G.21)

ǫ . 2(n + ⌈nǫ⌉) + 1

Note that (1 + log |Y|)ǫ + n1 hb (ǫ) + hb (σ) goes to zero as ǫ goes to zero. Therefore,

there exists ǫ′ > 0, such that ( n1 + log |Y|)ǫ + n1 hb (ǫ) + hb (σ) < δ2 , for any ǫ < ǫ′ . By

definition R1 = R − δ. Consequently, by choosing ǫ < min{ǫ′ , ǫ′′ }, where ǫ′′ , and ǫ2 ,

δ 2

ǫ1 , 1+3dmax

> 0, we get a SB encoder f (k) and SB decoder g generating FB and

reconstruction sequences satisfying ˆ 0 )] < D + ǫ1 , 1. E[d(X0 , X ¯ 2. H(Y) < R − ǫ2 .

Appendix H Proof of Theorem 7 First, we prove that for any given ǫ > 0, there exists Nǫ > 0, such that for n > Nǫ , E[dn (X n , gn∗ (Z n , fn∗ (X n )))] < DX,π (R) + ǫ.

(H.1)

By definition, DX,π (R) denotes the infimum of all distortions achievable by WZ coding of source X at rate R when the DMC is described by π. Therefore, for any ǫ > 0, (D + 4ǫ , R) would be an interior point of the rate-distortion region. Hence, by theorem 6 for ǫ1 =

ǫ 4

> 0, there exist some ǫ2 > 0, and a sliding-block WZ code with mappings

f and g, each one having a finite window length, such that i+l i+m i+k 1. E d(Xi , g(Zi−l , Yi−m )) ≤ D + 2ǫ , where Yi = f (Xi−k ), 1 H(Y1 , . . . , Yn ) n→∞ n

2. H(Y) = lim

≤ R − ǫ2 , for some ǫ2 > 0.

On the other hand, the FB process {Yi } generated by sliding windowing a station-

ary ergodic process {Xi } with a time invariant mapping f , is also a stationary ergodic process. Consequently, since for any stationary ergodic process Lempel-Ziv coding algorithm is an asymptotically optimal lossless compression scheme [1], for any given σ > 0, there exists Nσ > 0, such that for n > Nσ , 1 LZ(Y1 , . . . , Yn ) ≤ H(Y) + σ. n 136

(H.2)

137

Letting σ =

ǫ2 , 2

and choosing n greater than the corresponding Nσ , yields 1 LZ(Y1 , . . . , Yn ) ≤ R − ǫ2 + σ, n ǫ2 ≤R− , 2 < R.

Therefore, for any given ǫ > 0, and any source output sequence, by choosing the block length n sufficiently large, the mapping f would belong to S(xn , k, R). On the other hand, since for any individual source sequence xn , f ∗ is the mapping in S that defines the FB sequence minimizing the expected distortion, it follows that V (f ∗ , l, m) < V (f, l, m).

(H.3)

Moreover, since V (f, l, m) is the minimum accumulated loss attainable by the mappings in S(n, l, m), when the decoder is constrained to be a sliding window decoder with parameters l and m, it is in turn less than the expected distortion obtained by the specific mapping g given by theorem 6, i.e., E

"

n−k X

d xi , g

∗

i+m i+l , y˜i−m ) (Zi−l

i=k+1

#

≤E

"

n−k X

d

i+m i+l , yi−m ) xi , g(Zi−l

i=k+1

#

,

(H.4)

˜i = f ∗ (xi+k where yi = f (xi+k i−k ). i−k ) and y The final step is applying Theorem 5 which is the asymptotic optimality of WZ DUDE algorithm in the semi-stochastic setting. From this result, when the parameter l and m are such that tn = max{l, m} = o(n/ log n), the difference between the performance of the WZ DUDE decoding algorithm and the optimal sliding-window decoder of the same order goes to zero as the block length goes to infinity. In other words, for any given ǫ > 0, there exists Nǫ′ > 0, such that for n > Nǫ′ , 1 E n − 2k

"

n−k X

#

1 d (xi , xˆi ) ≤ E n − 2k i=k+1

"

n−k X

i=k+1

i+l i+m d xi , g ∗(Zi−l , y˜i−m )

#

ǫ + , 4

(H.5)

138

APPENDIX H. PROOF OF THEOREM 7

where xˆn = gn∗ (Z n , fn∗ (X n )). Note that the only uncertainty in (H.5) is due to the channel noise, and the source and FB sequence are assumed to be individual sequences. Combining (H.4) and (H.5), it follows that with probability one n E [dn (xn , gn∗ (Z n , fn∗ (X n )))] ≤ n − 2k " n−k # X ǫ 1 i+l i+m E d xi , g(Zi−l , yi−m ) + . n − 2k 4 i=k+1

(H.6)

On the other hand, since {(Xi , Yi )}∞ −∞ is also a stationary ergodic process with

super-alphabet X × Y, by the ergodic theory, with probability one, n−k X 1 i+l i+m lim E d xi , g(Zi−l , yi−m ) , n→∞ n − 2k i=k+1 l m = E d X0 , g(Z−l , Y−m ), ǫ ≤ DX,π + . 2

This means that with probability one, there exists Nǫ′′ > 0, such that for n > Nǫ′′ , n−k X 1 ǫ ǫ i+l i+m E d xi , g(Zi−l , yi−m ) ≤ DX,π + + . n − 2k i=k+1 2 4

(H.7)

Finally, combining (H.6) and (H.7), and taking n > Nǫ , where Nǫ = max{Nσ , Nǫ′ , Nǫ′′ },

yields the desired result as follows

n E [dn (X n , gn∗ (Z n , fn∗ (X n )))] ≤ DX,π(R) + ǫ. n − 2k

(H.8)

Appendix I Bounding the rate-distortion function of a first order binary Markov process As mentioned before, the rate distortion of any stationary ergodic source can be computed using (1.2). Although it might not be clear at first glance, this formula does not yield the rate-distortion function of a given stationary ergodic source explicitly. The problem is that the computational complexity required for solving the convex optimization problem given in (1.2) grows exponentially with n. Moreover, the convergence rate of Rn (D) to R(D) is typically slow, and at each step it is not clear how far we are from the optimal solution. In this appendix we present some upper and lower bounds on the R(D) of first order Markov sources. In this Appendix, based on some ideas presented in [90], new upper and lower bounds on the rate-distortion of a BSMS(q) is derived. Let {Xn }∞ n=−∞ be the output sequence of a BSMS(q) source . Consider every

k + 1 source output symbols {Xi(k+1) }∞ i=−∞ . Define Si , Xi(k+1) , and super-symbol i(k+1)+k

Yi , Xi(k+1)+1 = (Xi(k+1)+1 , . . . , Xi(k+1)+k ):

139

140

APPENDIX I. BOUNDING R(D)

. . . , X−2 , X−1 , X0 , X1 , . . . , Xk , Xk+1 , . . . . |{z} | {z } | {z } ↓

↓

S0

Y0

↓

S1

Given the {Si }∞ i=−∞ sequence, from the Markovity of the source, Yi ’s are independent random variables which are not identically distributed, and P Yi |{Si}∞ i=−∞ = P (Yi |Si , Si+1 ) .

Since the source is binary, (Si , Si+1 ) can take on only 2x2 = 4 different values. As a result, given {Si }∞ −∞ , Yi ’s consist of four types of i.i.d. processes. Now given

all these, a simple scheme for describing the source within average distortion D, is as follows: 1. describe the side information sequence {Si }∞ −∞ losslessly, ˜ 2. describe the {Yi }∞ −∞ sequence within distortion D =

1 D, rk

where rk =

k . k+1

The total expected per symbol distortion of this scheme is 0

1 ˜ k = D, +D k+1 k+1

(I.1)

which satisfies the desired average distortion constraint. Therefore, the rate required for this scheme gives an upper bound on R(D) of a BSMS(q). The average number of bits per source symbol required for the first step is equal to H(Xk |X0 ) . k+1

The reason is that Si is a first-order Markov process with a different transi-

tion probability than the original sequence, and consequently its lossless description requires H(Xk |X0 ) bits per Si symbol, which in turn is divided by k + 1 to give the cost per source symbol. For the second step, where we should describe a mixture ˜ the average required of four types of i.i.d. processes within average distortion D,

number of bits per source symbol is ˜ k (D) ˜ = rk rk R

X

(i,j)∈{0,1}2

(i,j)

˜ P (S0 = i, S1 = j) Rk (D),

(I.2)

141

(i,j)

˜ is the k th order rate-distortion when X k is distributed according to where, Rk (D) P X k |X0 = i, Xk+1 = j . Each of the k th order rate-distortion functions in (I.2) is

multiplied by the corresponding P (S0 = i, S1 = j) which denotes the proportion of

the Yi sequence that is allocated to the (i, j)th mode. For each Yi , the two source symbols that embrace it, namely Xi(k+1) and X(i+1)(k+1) , determine its mode. Note ˜ 1 (D) coincides with the erasure rate that for the Markov source considered here, R distortion function of Verdu and Weissman [91]. Combining the rates required for the first and second steps, one gets the following upper bound on the rate-distortion function of a BSMS(q), ˜ k (D) ˜ + H(Xk |X0 ) . R(D) ≤ rk R k+1

(I.3)

For proving the lower bound, consider the following problem where the decoder desires to describe the {Yi } sequence within a given fidelity constraint, while the {Si } sequence is known by both the encoder and the decoder losslessly beforehand. Any

given coding scheme that achieves average distortion D for the BSMS(q), can also be considered as a scheme for encoding the underlying {Yi } sequence as well. In the sequel, it is shown how the average expected loss incurred by the {Yi} sequence using

this technique can be upper bounded. Note that by definition,

1 n(k + 1)

n(k+1)−1

X

i=0 n−1 X

k = n(k + 1) ≤ D.

ˆ i )] E[d(Xi , X n−1

E[d(Yi, Yˆi )] +

i=0

X 1 E[d(Si, Sˆi )] n(k + 1) i=0

Since both terms in the above sum are greater than zero, each of them individually should be less than D as well. This implies that n−1

X k E[d(Yi , Yˆi)] ≤ D, n(k + 1) i=0

142

APPENDIX I. BOUNDING R(D)

or, n−1

k+1 1X ˜ E[d(Yi , Yˆi)] ≤ D = D. n i=0 k Thus, the given scheme describes the {Yi } sequence within an average distortion ˜ As a result, from the converse in the rate-distortion coding theorem for less than D. ˜ k (D) ˜ is the mixed i.i.d. sources with known state at both encoder and decoder, since R infimum of the rates required for describing the Yi sequence with average distortion ˜ at the decoder, one gets the following lower bound on R(D), D ˜ k (D) ˜ ≤ R(D). rk R

(I.4)

Combining (I.3) and (I.4), we can bound the rate-distortion function of a BSMS(q) as follows

˜ k (D) ˜ ≤ R(D) ≤ rk R ˜ k (D) ˜ + H(Xk |X0 ) . rk R k+1

(I.5)

A point to note about these bounds is that the difference between the upper bound and lower bound given in (I.5) goes to zero as k goes to infinity, and can separately be computed without computing the bounds themselves. For instance, for a given δ, ˜ k (D) ˜ − R(D)| < δ. choosing k > ⌈1/δ − 1⌉, guarantees |rk R Although the difference between the two sides of (I.5) goes to zero as k goes to

infinity, still the lower bound seems to be a loose bound, because in its derivation, it was assumed that the decoder has access to the so-called side information sequence losslessly without spending any rate. In the sequel, an alternative lower bound is derived which will be seen in the next section to be tighter than the previous lower bound, and even outperforming Berger lower bound for moderate values of k. Define the conditional rate distortion function R− (D) as follows R− (D) ,

min

ˆ 1 |X1 ,X0 ):E d(X1 ,X ˆ 1 )≤D p(X

ˆ 1 |X0 I X1 ; X

(I.6)

and similarly, Rk− (D) ,

1 k ˆk I X ; X |X0 . ˆ k |X k ,X0 ):E dk (X k ,X ˆ k )≤D k p(X min

(I.7)

143

In the following, it will first be shown that for any first order Markov source R(D) is lower bounded by R− (D), and then by a simple argument it will be proven that for a BSMS(q), the rate distortion function would also be lower bounded by Rk− (D). For a given first-order finite-alphabet Markov source, consider C to be a source

code of length n, rate R, and average expected distortion per symbol less than D. Similar to the inverse proof of the rate-distortion theory, the following series of inequalities holds,

ˆ n ), nR ≥ H(X

ˆ n ), = I(X n ; X n h i X ˆ n) , = H(Xi |X i−1 ) − H(Xi|X i−1 , X i=1

n h i X ˆi) , ≥ H(Xi |Xi−1 ) − H(Xi |Xi−1 , X i=1

n X ˆ ≥ I Xi ; Xi |Xi−1 , i=1

ˆ 1) + ≥ I(X1 ; X ≥

n X i=1

n X i=2

ˆi) , R− E d(Xi, X

ˆi) , R− E d(Xi , X

≥ nR−

n

1X ˆi) E d(Xi, X n i=2

!

≥ nR− (D).

(I.8)

Therefore, R(D) is lower bounded by R− (D). Now define the sequence {Zi }, where

ik Zi , X(i−1)k+1 . Since the two {Xi } and {Zi } sequences are essentially the same, they

should have the same rate-distortion function. But since the defined sequence is also a first-order Markov source, the lower bound given by (I.6) applies to the new source as well. Consequently, since the two source have the same rate-distortion function, we conclude that

144

or,

APPENDIX I. BOUNDING R(D)

1 ˆ R(D) ≥ min I Z1 ; Z1 |Z0 , p(Zˆ1 |Z1 ,Z0 ):E d(Z1 ,Zˆ1 )≤D k 1 k ˆk 0 I X ; X |X−k+1 k ˆk ˆ k |X k ,X 0 k p(X −k+1 ):E dk (X ,X )≤D 1 k ˆk = min I X ; X |X0 , ˆ k |X k ,X0 ):E dk (X k ,X ˆ k )≤D k p(X

R(D) ≥

min

= Rk− (D).

(I.9) (I.10)

Finally, we present another upper bound for the rate-distortion function of a binary Markov source. Again divide the source output bit sequence into consequent blocks of length k as shown below 3k X2k+1

Xk

z }| { z }| { . . . , X0 , X1 , . . . , Xk , Xk+1 , . . . , X2k , X2k+1 , . . . , X3k , X3k+1 . . . . | {z } 2k Xk+1

(i+1)k

Each sub-block, Xik+1 , conditioned on its preceding bit Xik assumes two different distributions: P (X k |0) and P (X k |1). Based on this, we can label each block by zero

or one, depending on wether it is preceded by a zero or a one. Then, a method for describing the source output within distortion level D to the decoder is as follows. First concatenate all zero blocks, and one blocks separately, to divide the original sequence into two subsequences. Then code each subsequence within distortion D such that the last bits of all sub-blocks are preserved. The average bit rate required for describing the source in this way is as follows Rk+ (D) =

min 

ˆ k |X k ,X0 ): P (X

ˆ Xk = X k ˆ k) ≤ D Ed(X k , X

1 ˆ k |X0 ). I(X k ; X k

(I.11)

The decoder is able to reconstruct the source sequence within distortion D from the coded version of the two described sub-sequences as follows. First it decodes each sub-sequence separately, and then merges them. Since the last bits of the sub-blocks

145

are intact, this is a straightforward task. Below, an example of this process is shown:

1. Labeling: 1

0

1

}| { z }| { z }| { z 0, 1, 0, 1, . . . , 0, 0, 1, 1, . . . , 1, 0, 1, 0, . . . , 0, 0, 1, 1, . . . , 1, 1, 1, 0, . . . , 1, 0, 0, 1, . . . , 1 . | {z } | {z } | {z } 0

0

1

2. Forming the two sub-sequences:

- zero blocks: 1, 0, 1, . . . , 01 , 0, 1, 1, . . . , 12 , 0, 1, 1, . . . , 14 - one blocks: 0, 1, 0, . . . , 03 , 1, 1, 0, . . . , 15 , 0, 0, 1, . . . , 16 3. Describing each sub-sequence such that the last bits of its sub-blocks are carried over losslessly:

- zero blocks: 1, 1, 1, . . . , 01 , 0, 1, 0, . . . , 12 , 1, 1, 1, . . . , 14 - one blocks: 1, 1, 1, . . . , 03 , 0, 1, 0, . . . , 15 , 0, 1, 1, . . . , 16 4. Merging the two sub-sequences (assume that the first bit is known to be zero): 0, 1,1,1,. . . ,01 , 0,1,0,. . . ,12 , 1, 1, 1, . . . , 03 , 1,1,1,. . . ,14 ,

Since this is an achievable scheme for describing the source within distortion level D, we have R(D) ≤ Rk+ (D).

(I.12)

146

APPENDIX I. BOUNDING R(D)

Note that 1 ˆ k |X0 ) I(X k ; X k

(I.13)

min

i 1h ˆ k |X0 ) − H(X ˆ k |X0 , X k ) H(X k

(I.14)

min

1h ˆ k |X0 , X ˆ k−1) + H(X ˆ k−1|X0 )− H(X k

Rk+ (D) =

min 

ˆ k |X k ,X0 ): P (X

= ˆ k |X k ,X0 ): P (X

= ˆ k |X k ,X0 ): P (X

≤





ˆ Xk = X k ˆ k) ≤ D Ed(X k , X

ˆ Xk = X k ˆ k−1 ) ≤ k D Ed(X k−1 , X k−1

ˆ Xk = X k ˆ k−1 ) ≤ k D Ed(X k−1 , X k−1

1 + k P (Xˆ k |X k ,X

0 ):



min

ˆ Xk = X k ˆ k−1 ) ≤ k D Ed(X k−1 , X k−1

i k−1 k ˆ H(X |X0 , X ) (I.15) h i 1 ˆ k−1|X0 ) − H(X ˆ k−1|X0 , X k ) (I.16) H(X k

In the simulation results presented in this thesis, the bounds used in the plots are Rk− (D) and Rk+ (D).

Appendix J Proof of Theorem 11 Proof of the first part: Let (R1 , R2 , D1 , D2 , D0 ) ∈ R B . We need to find (R11 , R22 , R0 ) such that (R11 , R22 , R0 , D1 , D2 , D0 ) ∈ R P , and (6.10) -(6.12) are satisfied. (n)

(n)

(n)

Let (f (n) , g1 , g2 , g0 ) be a sequence of codes at rate (R1 , R2 ) that achieves ˆ n, X ˆ n, X ˆ n ) is the point (R1 , R2 , D1 , D2 , D0 ) ∈ R B . Note that for a given code, (X 1

2

0

n

a deterministic function of X . Using the same method used in [39], we can generˆ (n) , X ˆ (n) , X ˆ (n) ) by appropriately embedding ate jointly stationary ergodic processes (X 1

2

0

these block codes into ergodic processes. Here the superscript (n) indicates the dependence of the constructed processes on n. In order to code an ergodic process into another ergodic process using a block code of length n, we need to cover an infinite length sequence by non-overlapping blocks of length n up to a set of negligible measure, and then replace each block by its reconstruction generated by the block code. The challenging part is the partitioning which should preserve the ergodicity. This can be done using R-K Theorem 17. Since the sequence of MD block codes were assumed to achieve the point ˆ (n) ˆ (n) ˆ (n) (R1 , R2 , D1 , D2 , D0 ), the constructed process (X 1 , X2 , X0 ) satisfies the distortion constraints given in (6.7)-(6.9) at (D1 + ǫn , D2 + ǫn , D0 + ǫn ), where ǫn → 0 as n → ∞. P ¯ n (X ¯ n (X ¯ ˆ (n) ˆ (n) ˆ (n) ˆ 1(n) ), H ˆ (n) Therefore, (H 2 ), Hn (X0 |X1 , X2 ), D1 + ǫn , D2 + ǫn , D0 + ǫn ) ∈ R . 147

148

APPENDIX J. PROOF OF THEOREM 11

Let (n)

1 ˆ n ), H(X 1 n 1 ˆ n ), := H(X 2 n 1 ˆ n |X ˆ n, X ˆ n ), := H(X 0 1 2 n

R11 := (n)

R22

(n)

R0

(J.1) (J.2) (J.3)

ˆ n = g (n) (Mi ), for i ∈ {1, 2} and X ˆ n = g (n) (M1 , M2 ). Note that since where X i 0 0 i (n) (n) n n n ˆ ,X ˆ ,X ˆ ), by Theorem 10, R11 the encoder knows (X ≤ R1 , R22 ≤ R2 , and 1 2 0 (n) (n) (n) ¯ n (X ¯ n (X ˆ (n) ), H ˆ (n) ), R + R + R ≤ R1 + R2 . Finally, the relationship between (H 11

22

0

1

2

(n) (n) (n) ¯ n (X ˆ (n) ˆ (n) ˆ (n) H 0 |X1 , X2 )) and (R11 , R22 , R0 ) can be established by following similar

steps as what is done in the proof of Theorem 6.

Proof of the second part: Let (R11 , R22 , R0 , D1 , D2 , D0 ) ∈ R P . This means that ˆ 1, X ˆ 2 and X ˆ 0 jointly stationary and ergodic with X which there exist processes X satisfy (6.4)-(6.9). Based on these processes, for block length n, we use the following ˆ n and X ˆ n losslessly to the block coding strategy: For coding sequence X n , describe X 1

2

ˆ 1 ) + ǫn ) and n(H( ˆ 2 ) + ǫn ) bits respectively. Given ¯ X ¯ X decoders 1 and 2 using n(H( ˆ n and X ˆ n , n(H( ¯ X ˆ 0 |X ˆ 1, X ˆ 2 ) + ǫn ) bits suffice to describe X ˆ n losslessly to Decoder X 1

2

0

0. These bits can be divided into two parts: the first part will be included in the message M1 , and the rest in the message M2 . Decoders 1 and 2 just ignore these

extra bits, but Decoder 0 combines them with the two other messages to reconstruct ˆ n . Since R1 and R2 satisfy (6.10)-(6.12), it is possible to do this. X 0

Bibliography [1] T. Cover and J. Thomas. Elements of Information Theory. Wiley, New York, 2nd edition, 2006. [2] C. E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:379–423 and 623–656, 1948. [3] R.G. Gallager. Information Theory and Reliable Communication. NY: John Wiley, 1968. [4] T. Berger. Rate-distortion theory: A mathematical basis for data compression. NJ: Prentice-Hall, 1971. [5] C. Shannon. Coding theorems for a discrete source with fidelity criterion. In R. Machol, editor, Information and Decision Processes, pages 93–126. McGrawHill, 1960. [6] J.C. Kieffer. A survey of the theory of source coding with a fidelity criterion. Information Theory, IEEE Transactions on, 39(5):1473–1490, Sep 1993. [7] J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. Information Theory, IEEE Transactions on, 24(5):530–536, Sep 1978. [8] I. H. Witten, R. M. Neal, , and J. G. Cleary. Arithmetic coding for data compression. Commun. Assoc. Comp. Mach., 30(6):520–540, 1987. [9] S. Jalali and T. Weissman. New bounds on the rate-distortion function of a binary Markov source. In Information Theory, 2007. ISIT 2007. IEEE International Symposium on, pages 571–575, June 2007. 149

150

BIBLIOGRAPHY

[10] J. Ziv. Distortion-rate theory for individual sequences. Information Theory, IEEE Transactions on, 26(2):137–143, Mar 1980. [11] Donald S. Ornstein and Paul C. Shields. Universal almost sure data compression. The Annals of Probability, 18(2):441–452, Apr. 1990. [12] E.-H. Yang. Universal almost sure data compression for abstract alphabets and arbitrary fidelity criterions. Probl. Contr. Inform. Theory, 20:397–408, 1991. [13] En hui Yang and J.C. Kieffer. Simple universal lossy data compression schemes derived from the lempel-ziv algorithm. Information Theory, IEEE Transactions on, 42(1):239–245, Jan 1996. [14] H. Morita and K. Kobayashi. An extension of lzw coding algorithm to source coding subject to a fidelity criterion. Proc. 4th Joint Swedish-Soviet Int. Workshop on Information Theory, page 105109, 1989. [15] Y. Steinberg and M. Gutman. An algorithm for source coding subject to a fidelity criterion, based on string matching. Information Theory, IEEE Transactions on, 39(3):877–886, May 1993. [16] M.J. Wainwright and E. Maneva. Lossy source encoding via message-passing and decimation over generalized codewords of ldgm codes. In Information Theory, 2005. ISIT 2005. Proceedings. International Symposium on, pages 1493–1497, Sept. 2005. [17] A. Gupta and S. Verdu. Nonlinear sparse-graph codes for lossy compression of discrete nonredundant sources. In Information Theory Workshop, 2007. ITW ’07. IEEE, pages 541–546, Sept. 2007. [18] J. Rissanen and I. Tabus. Rate-distortion without random codebooks. In Workshop on Information Theory and Applications (ITA), Sep 2006. [19] A. Gupta, S. Verd´ u, and T. Weissman. Linear-time near-optimal lossy compression. In Information Theory, 2008. ISIT 2008. Proceedings. International Symposium on, 2008.

151

BIBLIOGRAPHY

[20] I. Kontoyiannis. An implementable lossy version of the lempel ziv algorithm-part i: optimality for memoryless sources. Information Theory, IEEE Transactions on, 45(7):2293–2305, Nov 1999. [21] A. Wyner and J. Ziv. The rate-distortion function for source coding with side information at the decoder. IEEE Trans. Inform. Theory, 1(22):1–10, January 1976. [22] S. Jalali, S. Verd´ u, and T. Weissman. A universal Wyner-Ziv scheme for discrete sources. In Proc. IEEE Int. Symp. Inform. Theory, Nice, France, July 2007. [23] S. Jalali and T. Weissman. New bounds on the rate-distortion function of a binary Markov source. In Proc. IEEE Int. Symp. Inform. Theory, Nice, France, July 2007. [24] S. Jalali and T. Weissman. Lossy coding via Markov chain Monte Carlo. In Proc. IEEE Int. Symp. Inform. Theory, Toronto, Canada, July 2008. [25] S. Jalali, S. Verd´ u, and T. Weissman. A universal Wyner-Ziv scheme for discrete sources. IEEE Trans. Inform. Theory. accepted to be published. [26] S. Jalali and T. Weissman.

Rate-distortion via Markov Chain Monte Carlo.

IEEE Trans. Inform. Theory. submitted to. [27] S. Jalali, A. Montanari, and T. Weissman. An implementable scheme for universal lossy compression of discrete Markov sources. In Proc. Data Compression Conference, Snowbird, UT, March 2009. [28] S. Jalali, A. Montanari, and T. Weissman. An iterative scheme for near optimal and universal lossy compression. In Information Theory Workshop (ITW), Volos, Greece, 2009. [29] S. Jalali and T. Weissman.

Multiple description coding of discrete ergodic

sources. In Proc. of Allerton Conference on Communication, Control and Computing, 2009.

152

BIBLIOGRAPHY

[30] S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi. Optimization by simulated annealing. Science, 220:671–680, 1983. [31] V. Cerny. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. Journal of Optimization Theory and Applications, 45(1):41–51, Jan 1985. [32] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741, Nov 1984. [33] Kenneth Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86(11):2210–2239, Nov 1998. [34] J. Vaisey and A. Gersho. Simulated annealing and codebook design. In Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on, pages 1176–1179 vol.2, Apr 1988. [35] Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantizer design. Communications, IEEE Transactions on, 28(1):84–95, Jan 1980. [36] En-Hui Yang, Zhen Zhang, and T. Berger. Fixed-slope universal lossy data compression. Information Theory, IEEE Transactions on, 43(5):1465–1476, Sep 1997. [37] Robert M. Gray, David L. Neuhoff, and Donald S. Ornstein. Nonblock source coding with a fidelity criterion. The Annals of Probability, 3(3):478–491, Jun 1975. [38] K. Markon. On the rate distortion function of stationary sources. Probl. Contr. Inform. Theory, 4:289–297, 1975. [39] R. M. Gray. Block, sliding-block, and Trellis codes. In Janos Bolyai Colloquiem on Info. Theory, Keszthely, Hungary, August 1975.

BIBLIOGRAPHY

153

[40] R. Gray. Information rates of autoregressive processes. Information Theory, IEEE Transactions on, 16(4):412–421, Jul 1970. [41] B. Natarajan, K. Konstantinides, and C. Herley. Occam filters for stochastic sources with application to digital images. Signal Processing, IEEE Transactions on, 46(5):1434–1438, May 1998. [42] D. Donoho. The Kolmogorov sampler, Jan 2002. [43] T. Weissman and E. Ordentlich. The empirical distribution of rate-constrained source codes. Information Theory, IEEE Transactions on, 51(11):3718–3733, Nov 2005. [44] K. Ramchandran and M. Vetterli. Best wavelet packet bases in a rate-distortion sense. Image Processing, IEEE Transactions on, 2(2):160–175, Apr 1993. [45] E. Ordentlich, G. Seroussi, S. Verdu, M. Weinberger, and T. Weissman. A discrete universal denoiser and its application to binary images. In Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, volume 1, pages I–117–20 vol.1, Sep 2003. [46] T. Weissman, Erik Ordentlich, G. Seroussi, S. Verd´ u, and M. Weinberger. Universal discrete denoising: Known channel. IEEE Trans. Inform. Theory, 51(1):5–28, 2005. [47] A. Gupta, S. Verd´ u, and T. Weissman. Rate-distortion in near-linear time. submitted to IEEE Trans. on Info. Theory. [48] R. M. Gray. Rate distortion functions for finite-state finite-alphabet markov sources. IEEE Trans. Inform. Theory, 17(2):127–134, March 1971. [49] T. Berger. Explicit bounds to R(D) for a binary symmetric Markov source. Information Theory, IEEE Transactions on, 23(1):52–59, Jan 1977. [50] S. Shamai, S. Verd´ u, and R. Zamir. The rate-distortion function for source coding with side information at the decoder. IEEE Trans. Inform. Theory, 44(2):564– 579, March 1998.

154

BIBLIOGRAPHY

[51] S. Vembu, S. Verd´ u, and Y. Steinberg. The source-channel separation theorem revisited. IEEE Trans. Inform. Theory, 41(1):44–54, January 1995. [52] R. Gray. Sliding-block source coding. Information Theory, IEEE Transactions on, 21(4):357–368, Jul 1975. [53] R. M. Gray, D. L. Neuhoff, and D. S. Ornstein. Nonblock source coding with a fidelity criterion. The Annals of Probability, 3:478–491, 1975. [54] G. M. Gemelos, S. Sigurjonsson, and T. Weissman. Universal minimax discrete denoising under channel uncertainty. IEEE Trans. Inform. Theory, 52(8):3476– 3497, 2006. [55] G. M. Gemelos, S. Sigurjonsson, and T. Weissman. Algorithms for discrete denoising under channel uncertainty. IEEE Trans. Signal Processing, 54(6):2263– 2276, 2006. [56] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero. Distributed video coding. Proceedings of the IEEE, 93(1):71–83, January 2005. [57] S. S. Pradhan and K. Ramchandran. Distributed source coding using syndromes (DISCUS). IEEE Trans. Inform. Theory, 49(3), 2003. [58] D. Rebollo-Monedero, R. Zhang, and B. Girod. Design of optimal quantizers for distributed source coding. In Proc. Data Compression Conference, Snowbird, UT, March 2003. [59] S. D. Servetto. Lattice quantization with side information. In Proc. Data Compression Conference, Snowbird, UT, March 2000. [60] Y. Yang, S. Cheng, Z. Xiong, and W. Zhao. Wyner-Ziv coding based on TCQ and LDPC codes. In Proc. Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, November 2003. [61] R. Zamir, S. Shamai, , and U. Erez. Nested linear/lattice codes for structured multiterminal binning. IEEE Trans. Inform. Theory, 48(6):1250–1276, June 2002.

BIBLIOGRAPHY

155

[62] N. Merhav and J. Ziv. On the wyner-ziv problem for individual sequences. IEEE Trans. Inform. Theory, 52(3):867–873, March 2006. [63] J. Yu and S. Verd´ u. Schemes for bi-directional modeling of discrete stationary sources. IEEE Trans. Inform. Theory, 52(11):4789–4807, 2006. [64] E. Ordentlich, M.J. Weinberger, and T. Weissman. Multi-directional context sets with applications to universal denoising and compression. In Information Theory, 2005. ISIT 2005. Proceedings. International Symposium on, pages 1270– 1274, Sept. 2005. [65] E. Ordentlich, M.J. Weinberger, and T. Weissman. Efficient pruning of bidirectional context trees with applications to universal denoising and compression. In Proc. IEEE Int. Symp. Inform. Theory, San Antonio, TX, October 2004. [66] J. Kieffer. A method for proving multiterminal source coding theorems. IEEE Trans. Inform. Theory, 27(5), 1981. [67] T. Berger. Multiterminal source coding. New York: Springer-Verlag, 1977. [68] A. Kaspi and T. Berger. Rate-distortion for correlated sources with partially separated encoders. IEEE Trans. Inform. Theory, 28(6):828–840, November 1982. [69] S. Y. Tung. Multiterminal source coding. PhD thesis, Cornell University, Ithaca, NY, 1977. [70] L. Ozarow. On a source coding problem with two channels and three receivers. Bell Syst. Tech. J., 59(10):1909–1921, Dec. 1980. [71] H. Witsenhausen. On source networks with minimal breakdown degradation. Bell Syst. Tech. J., 59(6):1083–1087, July-Aug. 1980. [72] H. S. Witsenhausen and A. D. Wyner. Source coding for multiple descriptions ii: A binary source. In Bell Lab. Tech. Rep. TM-80-1217, Dec. 1980.

156

BIBLIOGRAPHY

[73] J. Wolf, A. Wyner, and J. Ziv. Source coding for multiple descriptions. Bell Syst. Tech. J., 59(8):1417–1426, Oct. 1980. [74] A.E. Gamal and T. Cover. Achievable rates for multiple descriptions. Information Theory, IEEE Transactions on, 28(6):851–857, Nov 1982. [75] Zhen Zhang and T. Berger. New results in binary multiple descriptions. Information Theory, IEEE Transactions on, 33(4):502–521, Jul 1987. [76] R. Venkataramani, G. Kramer, and V.K. Goyal. Multiple description coding with many channels. Information Theory, IEEE Transactions on, 49(9):2106–2114, Sept. 2003. [77] Lei Zhao, Paul Cuff, and Haim Permuter. Consolidating achievable regions of multiple descriptions. In Information Theory, 2009. ISIT 2009. IEEE International Symposium on, pages 51–54, 28 2009-July 3 2009. [78] J. Chen, C. Tian, and S. Diggavi. Multiple description coding for stationary and ergodic sources. In Proc. Data Compression Conference (DCC), pages 73–82, 2007. [79] M. Fleming and M. Effros. The rate distortion region for the multiple description problem. In Information Theory, 2000. Proceedings. IEEE International Symposium on, page 208, 2000. [80] Haixiao Cai, S.R. Kulkarni, and S. Verdu. An algorithm for universal lossless compression with side information. Information Theory, IEEE Transactions on, 52(9):4008–4016, Sept. 2006. [81] G. Motta, E. Ordentlich, and M.J. Weinberger. Defect list compression. In Information Theory, 2008. ISIT 2008. IEEE International Symposium on, pages 1000–1004, July 2008. [82] K. Sivaramakrishnan and T. Weissman. Universal denoising of discrete-time continuous-amplitude signals. IEEE Trans. Inform. Theory, 54(12):5632–5660, December 2008.

157

BIBLIOGRAPHY

[83] K. Sivaramakrishnan and T. Weissman. A context quantization approach to universal denoising. IEEE Trans. Signal Processing, (12). to appear in June 2009 issue of IEEE Trans. Sig. Proc. [84] T. Moon and T. Weissman. mitted

to

IEEE

Trans.

Discrete denoising with shifts.

Inform.

Theory,

Aug.

2007,

sub-

available

at

http://arxiv.org/abs/0708.2566v1. [85] E. Plotnik, M.J. Weinberger, and J. Ziv. Upper bounds on the probability of sequences emitted by finite-state sources and on the redundancy of the lempel-ziv algorithm. Information Theory, IEEE Transactions on, 38(1):66–72, Jan 1992. [86] R. Gray, D. Neuhoff, and J. Omura. Process definitions of distortion-rate functions and source coding theorems. Information Theory, IEEE Transactions on, 21(5):524–532, Sep 1975. [87] P. Bremaud. Markov chains, Gibbs fields, Monte Carlo simulation, and queues. Springer, New York, 1991. [88] W. Hoeffding. Probability inequalities for sums of bounded random vaiables. Journal of the American Statistical Association, 58(301):13–30, March 1963. [89] Paul C. Shields. The Ergodic Theory of Discrete Sample Paths. Amer Mathematical Society, July 1996. [90] T. Weissman and A. El Gamal. Source coding with limited-look-ahead side information at the decoder. Information Theory, IEEE Transactions on, 52(12):5218– 5239, Dec. 2006. [91] S. Verdu and T. Weissman. The information lost in erasures. Information Theory, IEEE Transactions on, 54(11):5030–5058, Nov. 2008.

implementable schemes for lossy compression a ...

both algorithms gets close to the rate-distortion curve for a number of ..... on hard drives. While a typical application of lossless compression algorithms is in compressing text files, lossy compression is not usually useful in coding texts. .... as denoising a sequence corrupted through a memoryless channel with the help of a.

Download PDF

975KB Sizes 1 Downloads 245 Views

Report

implementable schemes for lossy compression a ...

Recommend Documents