Lossy Compression of Discrete Sources via The Viterbi ...

Viewer
Transcript

1

Lossy Compression of Discrete Sources via The Viterbi Algorithm Shirin Jalali, Member, IEEE, Andrea Montanari, and Tsachy Weissman, Senior Member, IEEE

Abstract—We present a new lossy compressor for finitealphabet sources. For coding a sequence xn , the encoder starts by assigning a certain cost to each possible reconstruction sequence. It then finds the one that minimizes this cost and describes it losslessly to the decoder via a universal lossless compressor. The cost of each sequence is a linear combination of its distance from the sequence xn and a linear function of its kth order empirical distribution. The structure of the cost function allows the encoder to employ the Viterbi algorithm to find the sequence with minimum cost. We identify a choice of the coefficients used in the cost function which ensures that the algorithm universally achieves the optimum rate-distortion performance for any stationary ergodic source, in the limit of large n, provided that k diverges as o(log n). Iterative techniques for approximating the coefficients, which alleviate the computational burden of finding the optimal coefficients, are proposed and studied. Index Terms—Rate-distortion coding, Universal compression, Viterbi algorithm.

I. I NTRODUCTION ONSIDER the problem of universal lossy compression of stationary ergodic sources described as follows. Let X = {Xi ; ∀ i ∈ N+ } be a stochastic process. Consider a family of source codes {Cn }n≥1 . Each code Cn in this family consists of an encoder fn and a decoder gn such that

C

source and reconstruction sequences. For a given source X and coding scheme Cn , the expected rate Rn , and expected average distortion Dn of Cn in coding the process X are defined as ln (fn (X n )) Rn , E , n and

Dn , E[dn (X n , Xˆn )],

ˆ n = gn (fn (X n )), where X n

dn (xn , x ˆn ) ,

and d : X × Xˆ → R+ is a per-letter distortion measure. For a given stationary ergodic process X and a rate R ≥ 0, the minimum achievable distortion (cf. [1] for the exact definition of achievability) is characterized as [2], [3], [4] D(R, X) = lim

min

n→∞ p(X ˆ n |X n ):I(X n ;X ˆ n )≤nR

ˆ n )]. E[dn (X n , X

Similarly, for each distortion D ≥ 0, the minimum required rate for achieving distortion D, R(D, X), can be characterized as R(D, X) = min r. D(r,X)≤D

fn : X n → {0, 1}∗, and

1X d(xi , x ˆi ), n i=1

gn : {0, 1}∗ → Xˆn ,

where X and Xˆ denote the source and reconstruction alphabets, respectively. Both X and Xˆ are assumed to be finite throughput the paper. Here, {0, 1}∗ denotes the set of all finite-length binary sequences. Encoder fn maps each source block X n to a finite-length binary sequence, and the decoder gn maps the coded bits back to the signal space as ˆ n = gn (fn (X n )). Let ln (fn (X n )) denote the length of X the binary sequence assigned to sequence X n by encoder fn . The performance of a code in this family is measured by its expected rate, i.e., expected number of coded bits per source symbol, and by its expected average distortion between

Universal lossy compression codes are usually defined in the literature in one of the following modes [5]: I. Fixed-rate: A family of universal lossy compression codes at rate R consists of a family of codes {Cn } such that, for every stationary ergodic process X, Rn ≤ R, for all n ≥ 1, and lim sup Dn = D(R, X). n

II. Fixed-distortion: A family of universal lossy compression codes at distortion D consists of a family of codes {Cn } such that, for every stationary ergodic process X, Dn ≤ D, for all n ≥ 1, and lim sup Rn = R(D, X). n

This paper was presented in part at DCC 2009, Snowbird, Utah, and ITW 2009, Volos, Greece. S. Jalali is with the Center for the Mathematics of Information, California Institute of Technology, Pasadena, CA 91125 USA (e-mail: [email protected]), A. Montanari is with the Department of Electrical Engineering and the Departments of Statistics, Stanford University, Stanford, CA 94305 USA (email: [email protected]), T.Weissman is with the Department of Electrical Engineering, Stanford University, Stanford, CA 94305 USA (e-mail: [email protected]).

III. Fixed-slope: A family of universal lossy compression codes at slope −α, α > 0, consists of a family of codes {Cn } such that, for every stationary ergodic process X, lim sup[Rn + αDn ] = min[R(D, X) + αD]. n

D≥0

Existence of universal lossy compression codes for all these cases has already been established in the literature a long time

c 2011 IEEE 0000–0000/00$00.00

2

ago [6], [7], [8], [9], [10], [11]. The unsettled challenging step is to design universal lossy compression algorithms that are implementable and appealing from a practical viewpoint. A. Related prior work Unlike lossless compression, where there exists a number of well-known universal algorithms which are also attractive from a practical perspective (cf. Lempel-Ziv algorithm [12] or arithmetic coding algorithm [13]), in lossy compression, despite all the progress in recent years, no such algorithms are yet known. In this section, we briefly review some of the related literature on universal lossy compression with the main emphasis on the progress towards the design of practically appealing algorithms. There have been different approaches towards designing universal lossy compression algorithms. Among them the one with longest history is that of tuning the well-known universal lossless compression algorithms to work for the lossy case as well. For instance, Cheung and Wei [14] extended the moveto-front transform to the case where the reconstruction is not required to perfectly match the original sequence. One basic tool used in LZ-type compression algorithms is the idea of string-matching, and hence there have been many attempts to find optimal approximate string-matching. Morita and Kobayashi [15] proposed a lossy version of LZW algorithm, and Steinberg and Gutman [16] suggested a fixed-database lossy compression algorithm based on string-matching. Although the extensions could all be implemented efficiently, they were later proved to be sub-optimal, even for memoryless sources [17]. Another related example is the work of Luczak and Szpankowski proposing another suboptimal compression algorithm which again uses the ideas of approximate pattern matching [18]. For some other related work see [19] [20][21]. Another well-studied approach to lossy compression is trellis coded quantization [22] and more generally vector quantization (c.f. [23], [24] and the references therein). Codes of this type are usually designed for a given distribution encountered in a specific application. For example, such codes are used in image compression (JPEG) or video compression (MPEG). Nevertheless, there have been attempts at extending such codes to more general settings. For instance Kasner, Marcellin, and Hunt proposed universal trellis coded quantization which is used in the JPEG2000 standard [25]. There has been a lot of progress in recent years in designing non-universal lossy compression algorithms of finite-alphabet memoryless sources. Wainwright and Maneva [26] proposed a lossy compression algorithm based on message-passing ideas. The effectiveness of the scheme was shown by simulations. Gupta and Verd´u proposed an algorithm based on non-linear sparse-graph codes [27]. Another algorithm with near linear complexity was suggested by Gupta, Verd´u and Weissman in [28]. The algorithm is based on a ‘divide and conquer’ strategy. It breaks the source sequence into sub-blocks and codes the subsequences separately using a random codebook. Finally, the capacity-achieving polar codes proposed by Arikan for channel coding [29] are shown to be optimal for lossy compression of binary-symmetric memoryless sources in [30]. More recently,

Mao et al. proposed a trellis-encoded source coding algorithm for i.i.d. sources [31]. Chou, Lookabaugh and Gray proposed the idea of fixedslope rate-distortion coding in [32], and later this idea was employed by Yang, Zhang and Berger in [5] to construct a universal lossy compression algorithm. They proposed a generic fixed-slope universal algorithm which leads to specific coding algorithms based on different universal lossless compression algorithms. Although the constructed algorithms are all universal, they involve computationally demanding minimizations, and hence are impractical. In [5], the authors considered lowering the search complexity by choosing appropriate lossless codes which allow to replace the required exhaustive search by a low-complexity sequential search scheme that approximates the solution of the required minimization. However, these schemes only find an approximation of the optimal solution. In a recent work [33], a new implementable algorithm for fixed-slope lossy compression of finite-alphabet sources was proposed. Although the algorithm involves a minimization which resembles a specific realization of the generic cost proposed in [5], it is somewhat different. The reason is that the cost used in [33] cannot be derived directly from a lossless compression algorithm. The advantage of the new cost function is that, rather naturally, it lends itself to Gibbs simulated annealing in that the computational effort involved in each iteration is modest. It was shown that using a universal lossless compressor to describe the reconstruction sequence found by the annealing process to the decoder results in a scheme which is universal in the limit of many iterations and large block length. The drawback of the proposed scheme is that although its computational complexity per iteration is independent of the block length n and linear in a parameter kn = o(log n), there is no useful bound on the number of iterations required for convergence. In this paper, motivated by the algorithm proposed in [33], we propose another approach to fixed-slope lossy compression of finite-alphabet sources. We start by making a linear approximation of the cost function used in [33]. The new cost assigned to each possible reconstruction sequence consists of a linear combination of two terms: a linear function of its empirical distribution plus its distance from the source sequence. We show that there exist proper coefficients such that minimizing the linearized cost function results in the same performance as would minimizing the original cost. The advantage of the modified cost is that its minimizer can be found efficiently using the Viterbi algorithm. Application of the Viterbi algorithm to lossy compression of stationary ergodic sources with continuous alphabets has been explored in [34]. The proposed algorithm and the analysis approach in [34] are different than the approach and the algorithm suggested in this paper. B. Organization of this paper The organization of the paper is as follows. In Section II, the empirical distribution and the empirical conditional entropy of a sequence are defined, and some of their properties are studied. Section III reviews the fixed-slope universal lossy compression algorithm which is based on exhaustive search, and

3

is used in [33]. Section IV describes a new fixed-slope lossy compression algorithm derived from replacing part of the cost function used in the mentioned exhaustive-search algorithm by another linear function. We prove that using appropriate coefficients for the linear function, the two algorithms have identical performances. The advantage of this modified cost is discussed in Section V, where we show that the minimizer of the new cost can be found using the Viterbi algorithm. In Section VI, a heuristic low-complexity iterative approach for approximating the optimal coefficients is presented. In Section VII, we propose another optimization problem whose solution gives rise to asymptotically tight approximations of the optimal coefficients. This method, along with the results of Section IV results in a fixed-slope universal lossy compression algorithm that achieves the rate-distortion performance for any finitealphabet stationary ergodic source. The proposed optimization for approximating the coefficients, however, is computationally demanding, and hence impractical. Section VIII presents some simulations results and, finally, Section IX concludes the paper with a discussion of some future directions.

where Z k+1 is distributed according to m, i.e., P Z k+1 = [b1 , . . . , bk , β] = mβ,bk (y n ).

For a probability mass function (pmf) v = (v1 , . . . , vℓ ) let Pℓ −v h(v) , i log vi denote the entropy of v. For a i=1 vector v with non-negative components, let H(v) denote the entropy of the random variable whose pmf is proportional to v. Formally, ( v if v 6= 0, h kvk 1 (2) H(v) = 0 if v = 0, where 0 log(0) = 0 by convention. With this notation, the conditional empirical entropy Hk (y n ) defined in (1) is readily seen to be expressible in terms of m(y n ) as Hk (y n ) = H(m(y n )) X = h1, m·,bk iH m·,bk , bk

where 1 and m·,bk denote an all-one vector of length |Y| and the column of m indexed by bk , i.e.,

C. Notation: N+ and R+ denote the set of positive integers and positive reals, respectively. Calligraphic letters such as X and Y represent sets. The size of a set A is denoted by |A|. Random variables are represented by capital letters such as X and Y . For a random variable X, X denotes its alphabet. Bold letters such as X and x refer to vectors and matrices. The ith element of vector x is denoted by xi . For vector x of length n, and 1 ≤ i1 ≤ i2 ≤ n, xii21 , (xi1 , xi1 +1 , . . . , xi2 ), and xi1 , xi11 . For vectors u and v both in Rn , ku − vk1 denotes the ℓ1 distance between u and v defined as ku−vk1 , P n probability i=1 |ui − vi |. If vectors u and v represent P P distribution vectors (ui ≥ 0, vi ≥ 0, and i ui = i vi = 1), the total variation distance between u and v is defined as ku − vkTV , 0.5ku − vk1 . For length-n vectorsPu and v, n hu, vi denotes their inner product, i.e., hu, vi , i=1 ui vi . For two matrices A = [ai,j ] and B = [bi,j ] of the same dimensions m × n, define the Frobenius inner product of A and B as n m X X ai,j bi,j . A⊙B , i=1 j=1

II. C ONDITIONAL

EMPIRICAL ENTROPY AND ITS

respectively. Remark 1: Note that Hk (·) has a discrete domain, while the domain of H(·) is continuous. In other words, if we define Mk to denote the set of all |Y| × |Y|k matrices with positive real entries adding up to one, then Hk : Y n → [0, 1], but H : Mk → [0, 1]. Conditional empirical entropy of sequences, Hk (·), plays a key role in our results. Hence, in the following two subsections, we focus on this function, and study some of its properties. A. Concavity We prove that like the standard entropy function, conditional empirical entropy is also a concave function. By definition X h1, m·,bk iH(m·,bk ), H(m) = bk ∈Y k

where H(·) is defined in (2). We need to show that for any (θ1 , θ2 ) ∈ [0, 1]2 , such that θ1 + θ2 = 1, and matrices m(1) and m(2) with non-negative components adding up to one, θ1 H(m(1) ) + θ2 H(m(2) ) ≤ H(θ1 m(1) + θ2 m(2) ).

PROPERTIES

For each y n ∈ Y n , the |Y| × |Y|k matrix m(y n ) denotes its (k + 1)th order empirical distribution. For bk ∈ Y k and β ∈ Y, the element in the β th row and the bk -th column of matrix m, mβ,bk , is defined as 1 i−1 = bk , yi = β] , mβ,bk (y n ) , 1 ≤ i ≤ n : yi−k n where here and throughout the paper we assume a cyclic convention whereby for i ≤ 0, yi = yi+n . Based on the distribution induced by m(y n ), define the k th order conditional empirical entropy of y n , Hk (y n ), as Hk (y n ) , H(Zk+1 |Z k ),

m·,bk , ({mβ,bk }β∈Y }),

(1)

For each bk ∈ Y k , (1)

(1)

(2)

(2)

θ1 h1, m·,bk iH(m·,bk ) + θ2 h1, m·,bk iH(m·,bk ) (1)

(2)

= (θ1 h1, m·,bk i + θ2 h1, m·,bk i) (i)

×

X

θi h1, m·,bk i (1)

(i)

(2)

i∈{1,2} θ1 h1, m·,bk i + θ2 h1, m·,bk i

H(m·,bk ). (3)

For i ∈ {1, 2}, let (i)

θi,bk ,

θi h1, m·,bk i (1)

(2)

θ1 h1, m·,bk i + θ2 h1, m·,bk i

.

4

Note that 0 ≤ θi,bk ≤ 1, and θ1,bk + θ2,bk = 1. Hence, by the concavity of the entropy function h(·), it follows that (1)

(1)

(2)

(2)

θ1 h1, m·,bk iH(m·,bk ) + θ2 h1, m·,bk iH(m·,bk )  X (1) (2) θi,bk h  = (θ1 h1, m·,bk i + θ2 h1, m·,bk i)

(i)

m·,bk (i)

h1, m·,bk i i∈{1,2}   X θi,bk m(i) k (1) (2) ·,b . ≤ (θ1 h1, m·,bk i + θ2 h1, m·,bk i)h  (i) h1, m i k i∈{1,2} ·,b



,

(4)

But, (i) X θi,bk m·,b k (i)

i∈{1,2}

h1, m·,bk i

=

(i) i∈{1,2} θi m·,bk . (1) (2) θ1 h1, m·,bk i + θ2 h1, m·,bk i

P

(5)

Combining (3), (4), (5) and the fact that function H is scaleindependent, we get (1)

(1)

(2)

(2)

Let p(y k+1 ) be a given pmf defined on Y k+1 . Under what condition(s) there exists a stationary process such that its (k + 1)th order distribution is equal to p? Lemma 1: The necessary and sufficient condition for {p(y k+1 )}yk+1 ∈Y k+1 to represent the (k + 1)th order marginal distribution of a stationary process is X X p(y k , β), (7) p(β, y k ) = β∈Y

for all y k ∈ Y k . Proof: i. Necessity: The necessity of (7) is just a direct result of the stationarity of the process. If p(y k+1 ) is to represent the (k + 1)th order marginal distribution of a stationary process Y = {Yi }, then it should be consistent with the k th order marginal distribution. Hence, (7) should hold. ii. Sufficiency: In order to prove the sufficiency, we assume that (7) holds, and build a stationary process with the (k + 1)th order marginal distribution equal to p(y k+1 ). Let Y = {Yi }i be a Markov chain of order k whose transition probabilities are defined as P(Yk+1 = yk+1 |Y k = y k ) , q(yk+1 |y k ) ,

p(y k+1 ) , p(y k )

where β∈Y

for all bk ∈ Y k . But this holds because both sides of (8) are i equal to |{i : 1 ≤ i ≤ n, yi−k+1 = bk }|/n.

SEARCH ALGORITHM

Consider the following lossy source coding algorithm. Given α > 0, for encoding sequence xn ∈ X n , find x ˆn such that xˆn = arg min[Hk (y n ) + αdn (xn , y n )],

B. Stationarity condition

X

β∈Y

(2)

Summing up both sides of (6) over all bk ∈ Y k yields the desired result.

p(y k ) ,

β∈Y

III. E XHAUSTIVE

(1)

≤ (θ1 h1, m·,bk i + θ2 h1, m·,bk i)H(θ1 m·,bk + θ2 m·,bk ). (6)

β∈Y

Throughout the paper, we refer to the condition stated in (7) as the stationarity condition. Corollary 1: For any |Y| × |Y|k matrix m corresponding to the (k + 1)th order empirical distribution of some y n ∈ Y n , there exists a stationary process whose marginal distribution coincides with m. Proof: From Lemma 1, we only need to show that (7) holds, i.e., X X mβ,bk = mbk ,[β,b1 ...,bk−1 ] , (8)

(2)

θ1 h1, m·,bk iH(m·,bk ) + θ2 h1, m·,bk iH(m·,bk ) (1)

Markov chain. Therefore, Y is a stationary process with the desired marginal distribution.

p(β, y k ) =

X

p(y k , β).

β∈Y

Now, given (7), it is easy to check that p(y k+1 ) is the (k + 1)th order stationary distribution of the defined

(9)

y n ∈Xˆ n

and describe x ˆn using the Lempel-Ziv coding algorithm. As proved in [5], [33], the described compressor is a fixedslope universal lossy compression algorithm. That is, if X n is ˆ n is the minimizer of (9) generated by the source X, and X for the input X n , 1 ˆ n ) + αdn (X n , X ˆ n ) → min [R(D, X) + αD], (10) ℓLZ (X D≥0 n almost surely. Here ℓLZ denotes the length of the codeword ˆ n by the Lempel-Ziv algorithm [12]. Clearly, assigned to X given the size of the search space, this is not an implementable algorithm. An approach for approximating the solution of (9) using Markov chain Monte Carlo methods has been suggested in [33]. One problem with the MCMC-based algorithms is that no useful bound is yet known on the required number of iterations. Moreover, the performance of the algorithm depends on the selected cooling process. There exist cooling schedules with guaranteed convergence, but they are very slow, and hence impractical. On the other hand, if we use faster cooling processes, there is a risk of getting stuck in a local minima and missing the optimum solution. The goal of this paper is to propose a new approach for approximating the solution of (9). This new approach, as we show later, suggests a new implementable algorithm for lossy compression. The main idea here is to use a linear approximation of the conditional entropy function, H(m), at some point m0 , and proving that if m0 is chosen appropriately, then while we can replace the exhaustive-search algorithm with the Viterbi algorithm, we do not change the final performance.

5

IV. L INEARIZED COST

FUNCTION

Consider the problems (P1) and (P2) described in (11) and (12), respectively. (P1) corresponds to the optimization required by the exhaustive-search lossy compression scheme described in (9). The difference between (P1) and (P2) is that the term corresponding to the conditional empirical entropy, a highly non-linear function of m, in (P1) is replaced by a linear function of m in (P2). (P1) :

min

y n ∈Xˆ n

[H(m(y n )) + αdn (xn , y n )] ,

Assume that z n ∈ S1 . Taking the minimum of both sides of (15) yields min [H(m(y n )) + αdn (xn , y n )] n y

n ˆ ≤ min [H(m(y )) + αdn (xn , y n )] n y

n ˆ ≤ H(m(z )) + αdn (xn , z n )

= H(m(z n )) + αdn (xn , z n ) = min [H(m(y n )) + αdn (xn , y n )]. n y

(11) Therefore,

min [H(m(y n )) + αdn (xn , y n )] n

and

y

(P2) : min

y n ∈Xˆ n

 

X X

β∈Xˆ bk ∈Xˆ k



λβ,bk mβ,bk (y n ) + αdn (xn , y n ) ,

(12)

where {λβ,bk }β,bk are a set of real-valued coefficients. We are interested in answering the following question: Is it possible to choose the set of coefficients {λβ,bk }β,bk , (β, bk ) ∈ Xˆ × Xˆk , such that (P1) and (P2) have the same set of minimizers, or such that the set of minimizers of (P2) is a subset of the set of minimizers of (P1)? The reason we are interested in answering this question is that if the answer is affirmative, then instead of solving (P1) one can solve (P2), which, as we describe in Section V, can be done efficiently via the Viterbi algorithm. Let S1 and S2 denote the set of minimizers of (P1) and (P2) respectively. Consider some xˆn ∈ S1 , and let m∗n = m(ˆ xn ), and define the coefficients used in (P2) as ∂ H(m) λβ,bk = ∂mβ,bk m∗ n   P ∗ mβ ′ ,bk   β ′ ∈Xˆ (13) = log  . m∗β,bk

Theorem 1: If the coefficients used in (P2) are chosen according to (13), then (P1) and (P2) have equal minimum costs. Moreover, S2 ⊆ S1 and S2 contains all the sequences wn ∈ S1 with m(wn ) = m∗n . Proof: Since, as proved earlier, H(m) is concave in m, for any empirical count matrix m, we have X ∂ H(m) ≤ H(m∗ ) + H(m) (mβ,bk − m∗β,bk ) ∂mβ,bk m∗ β,bk n X ∗ ∗ λβ,bk (mβ,bk − mβ,bk ) = H(m ) + β,bk

ˆ , H(m).

(14)

Replacing m in (14) with m(y n ), and adding αdn (xn , y n ) to the both sides of (14), we conclude that for any y n ∈ Xˆ n , n ˆ H(m(y n )) + αdn (xn , y n ) ≤ H(m(y )) + αdn (xn , y n ). (15)

n ˆ = min [H(m(y )) + αdn (xn , y n )], n y

i.e., (P1) and (P2) have the same minimum costs. For any sequence wn with m(wn ) 6= m∗n , by strict concavity of H(m), n ˆ H(m(w )) + αdn (xn , wn ) > H(m(wn )) + αdn (xn , wn ),

≥ min [Hk (y n ) + αdn (xn , y n )]. n y

Hence, the empirical count matrices of all the sequences in S2 , i.e., all the minimizers of (P2) for the selected coefficients, are equal to m∗n . Let wn ∈ S2 . We prove that wn ∈ S1 as well. As we just proved, m(wn ) = m(z n ) = m∗n . Moreover, since both z n and wn belong to S2 , n ˆ min [H(m(y )) + αdn (xn , y n )] n y

n ˆ = H(m(w )) + αdn (xn , wn ) n ˆ = H(m(z )) + αdn (xn , z n ).

Therefore, dn (xn , wn ) = dn (xn , z n ), and consequently, Hk (wn ) + αdn (wn , xn ) = Hk (z n ) + αdn (z n , xn ), = min [Hk (y n ) + αdn (y n , xn )], n y

n

which proves that w ∈ S1 , and concludes the proof. Theorem 1 states that if the optimal type m∗n is known, then the desired coefficients can be computed according to (13), and solving (P2) instead of (P1) using the computed coefficients finds a minimizer of (P1). In Section V, we describe how, for a given set of coefficients, (P2) can be solved efficiently using the Viterbi algorithm. The problem of course is that the optimal type m∗n required for computing the desired coefficients is not known to the encoder (since knowledge of m∗n seems to require solving (P1) which is the problem we are trying to avoid). In Section VII, we introduce another optimization problem whose solution is a good approximation of m∗n , and hence, when substituting in (13), of the desired coefficients {λβ,bk }. V. V ITERBI

CODER

In this section, we show that for a given set of coefficients, {λβ,bk }, (P2) can be solved efficiently via the Viterbi algorithm [35], [36].

6

Consider rewriting the linearized cost used in (P2) as follows X λβ,bk mβ,bk (y n ) + αdn (xn , y n ) bk ∈Xˆ k ,β∈Xˆ n h X

=

1 n

=

i=1

i λyi ,yi−1 + αd(xi , yi ) i−k

n−k i 1 Xh λyi ,yi−1 + αd(xi , yi ) i−k n i=1 n i 1 X h + λyi ,yi−1 + αd(xi , yi ) . i−k n

(16)

i=n−k+1

The advantage of this alternative representation is that, fixing n the last k = o(log n) terms of y n as yn−k+1 = xnn−k+1 , we can find the minimizer of the first term in (16) efficiently using the Viterbi algorithm. The Viterbi algorithm is a wellknown dynamic programming optimization method for finding the path of minimum weight in a trellis diagram. For i = 1, 2, . . . , n − k, let i , si , yi−k

represent the state at time i. (Note that by our convention, yi = yi+n , for i ≤ 0.) Also, let S , {bk+1 : bk+1 ∈ Xˆk+1 } denote the set of all possible states. From this definition, the state at time i, si , is a function of the state at time i − 1, si−1 , and yi . In other words, si = g(si−1 , yi ), where g : S × Xˆ → S. This representation leads to a trellis diagram corresponding n−k to the evolution of the states {si }i=1 in which each state has ˆ ˆ |X | states leading to it and |X | states branching out from it. In this representation, there is a 1-to-1 correspondence between n−k sequences y n ∈ Xˆ n , and sequences of states {si }i=1 . ′ ′ To an edge e = (s , s) connecting states s and s = bk+1 at time i, we assign weight wi (e) = wi (s′ , s) defined as wi (e) , λbk+1 ,bk + αd(xi , bk+1 ). The path of minimum weight in the corresponding Pn−k trellis n−k diagram, i.e., the path {si }i=1 that minimizes i=1 wi (ei ), where ei = (si−1 , si ), minimizes the first term in (16), i.e., n−k Xh i=1

i λyi ,yi−1 + αd(xi , yi ) . i−k

Solving this minimization can be readily done by the Viterbi algorithm which can be described as follows. For each state s ∈ S, let L(s) denote the |Xˆ| states leading to s, and for each i ∈ {1, 2, . . . , n − k}, let Ci (s) , ′min [wi (s′ , s) + Ci−1 (s′ )]. s ∈L(s)

At time i ∈ {1, 2, . . . , n − k}, for each state s, this procedure defines a path of length i − 1 leading to s which is the path of minimum weight among all possible paths from time 1 to

time i ending at s. After computing {Ci (s)} for all s ∈ S and i ∈ {1, 2, . . . , n − k}, at time i = n − k, let s∗ = arg min Cn (s). s∈S

It is not hard to see that the path leading to s∗ is the path of minimum weight among all possible paths, assuming that n yn−k+1 = xnn−k+1 . Therefore, given the set of coefficients {λbk ,β }, solving (P2) is straightforward using the Viterbi algorithm. The problem is finding an approximation of the optimal coefficients. In the next section we propose a heuristic low-complexity iterative method for approximating the coefficients. The effectiveness of this approach is discussed in our simulations results. Then in Section VII, we propose another approach for approximating the coefficients which is asymptotically optimal. However, it involves solving a concave minimization problem of high order, and hence is only interesting from a theoretical pointof-view. Remark 2: The number of states in the trellis diagram, |Xˆ |k+1 , grows exponentially in k. Therefore, the computational complexity of the Viterbi algorithm is of order O(n|Xˆ |k . Remark 3: Here, to apply the Viterbi algorithm, we fixed the last k = o(log n) positions in y n . The reason is that we are using a cyclic assumption in our definitions of empirical distributions. We adopted this definition because it simplifies the expression of some of our results (for instance Corollary 1). However, with regular, non-cyclic counting, i i.e., mbk+1 ,bk (y n ) = |{k + 1 ≤ i ≤ n : yi−k = k+1 b }|/(n − k), the results of the paper continue to hold, which shows that fixing these k positions does not change the asymptotic performance of the algorithm. VI. A PPROXIMATING

THE OPTIMAL COEFFICIENTS

As we discussed in Section IV, having known the optimal coefficients, solving (P2) is equivalent to solving (P1). But while finding the minimizer of (P1) requires exponential computational complexity, (P2) can be solved efficiently via the Viterbi algorithm. In this section we propose a heuristic method with moderate computational complexity for approximating the optimal coefficients. First, assume that the desired distortion is small, or equivalently α is large. In that case, the distance between the original sequence xn and its quantized version x ˆn should be small. Therefore, their types, i.e., their (k + 1)th order empirical distributions, are close. Hence, the coefficients computed based on m(xn ) provide a reasonable approximation of the coefficients calculated at m∗ . This implies that if our desired distortion is small, one possibility is to compute the type of the input sequence, and evaluate the coefficients at m(xn ). In the case where the desired distortion is not very small, we can use an iterative approach as follows. Start with m(xn ). Evaluate the coefficients from (13) at m(xn ). Employ the Viterbi algorithm to solve (P2) at the computed coefficients. Let x ˆn denote the output sequence. Compute m(ˆ xn ), and recalculate the coefficients using (13) this time at m(ˆ xn ). Again, use the Viterbi algorithm to solve (P2) at the updated coefficients. Iterate.

7

For a conditional empirical distribution matrix m, define its corresponding |Xˆ | × |Xˆ |k coefficient matrix Λ(m), where λβ,bk is defined as (13). Now succinctly, the iterative approach can be described as follows. For t = 0, let y n,(0) = xn . For t = 1, 2, . . . Λ y

(t)

n,(t)

= Λ(m(y

n,(t−1)

= arg min[Λ

(t)

z n ∈Xˆ n

⊙ m(z n ) + αdn (xn , z n )].

E(y n ) = Hk (y n ) + αdn (xn , y n ).

(17)

As mentioned before, the goal is to find the sequence with minimum energy. Theorem 2 below gives some justification on how the described approach serves this purpose. It shows that, through the iterations, the energy level of the output is decreasing at each step. Moreover, since the number of energy levels is finite, it proves that the algorithm converges in a finite number of iterations. ˆ = Λ(m), ˆ = m(ˆ ˆ and Theorem 2: Let x ˆn ∈ Xˆ n , m xn ), Λ ˆ ⊙ m(z n ) + αdn (xn , z n )]. x ˜n = arg min[Λ z n ∈Xˆ n

Then, E(˜ xn ) ≤ E(ˆ xn ).

(18)

Proof: From the concavity of H(m) in m, ˆ ⊙ (m ˜ ≤ H(m) ˆ +Λ ˜ − m). ˆ H(m)

(19)

ˆ β,bk m λ ˆ β,bk

β,bk

=

X

β,bk

 P

 m ˆ β,bk log 

β ′ ∈X

m ˆ β ′ ,bk

m ˆ β,bk

ˆ = H(m).



y n ∈Xˆ n

(20)

(21)

Adding a constant term to the both sides of (21), we get n

n

But, since x ˜n is assumed to be a minimizer of (P2) for the ˆ we have computed coefficients Λ, ˆ ⊙m ˆ ⊙m ˜ + αdn (xn , x ˆ + αdn (xn , x Λ ˜n ) ≤ Λ ˆn ) ˆ + αdn (xn , x = H(m) ˆn ) = E(ˆ xn )

(23)

Therefore, combining (22) and (23) yields the desired result, i.e., E(˜ xn ) ≤ E(ˆ xn ).

(24)

Since φ(xn , Λ) is the minimum of |Xˆ |n affine functions of Λ, it is a concave function. To each sequence y n ∈ Xˆn , assign a coefficient matrix Λ(y n ) = Λ(m(y n )), such that for β ∈ Xˆ and bk ∈ Xˆ k as in (13) ∂H(m) λβ,bk = . (25) ∂m k β,b

ˆ ⊙m ˜ + αdn (x , x ˜ + αdn (x , x E(˜ x ) = H(m) ˜ )≤Λ ˜ ). (22) n

β,bk

= min [Λ ⊙ m(y n ) + αdn (xn , y n )] .

ˆ ⊙ m) ˆ ⊙m ˜ ≤ (H(m) ˆ −Λ ˆ −Λ ˜ H(m) ˆ ˜ = Λ ⊙ m.

n

A. Further insight on (P2)

y n ∈Xˆ n

 

Therefore, combining (19) and (20) yields

n

Remark 4: In the described iterative algorithm, for any slope α, we assumed that the algorithm starts at y n,(0) = xn . However, as mentioned earlier, only for large values of α, m(xn ) provides a reasonable approximation of the desired type m∗n . Hence, in order to address this issue, we can slightly modify the algorithm as follows. The idea is that instead of starting at y n,(0) = xn for all values of α, we can gradually decrease the slope to our desired value, and use the final output of each step as the initial point for the next step. More explicitly, for any given α0 , start from some large slope, αmax , (corresponding to very low distortion). Run the previous iterative algorithm and find x ˆn (αmax ). Pick some integer Nα , and define αmax − α0 . ∆α , Nα Again run the iterative algorithm, but this time at α = αmax − ∆α. Now, instead of starting from y n,(0) = xn , initialize y n,(0) = xˆn (αmax ). Repeat this process Nα times. That is, at the rth step, r = 1, . . . , Nα , run the algorithm at α = αmax − r∆α, and initialize y n,(0) at x ˆn (αmax − (r − 1)∆α). At the final step α = α0 , and we have a reasonable quantized version of xn for initialization.aaa

To gain more insight on (P2), for the given sequence xn and the coefficient matrix Λ = [λβ,bk ]β,bk , define   X λβ,bk mβ,bk (y n ) + αdn (xn , y n ) φ(xn , Λ) , min 

On the other hand X

Proof: In Theorem 2, let xˆn = y n,(t) , and x ˜n = y n,(t+1) .

)),

Stop as soon as y n,(t) = y n,(t−1) . For a given sequence xn , and slope α, assign to each sequence y n ∈ Xˆ n the energy

ˆ ⊙m ˆ = Λ

Corollary 2: For the described iterative algorithm, at each t ≥ 0, E(y n,(t+1) ) ≤ E(y n,(t) ).

m(y n )

Let Ld denote the set of all such coefficient matrices, i.e., Ld , {Λ(y n ) : y n ∈ Xˆ n }.

Similarly to each possible conditional distribution matrix m on Xˆ k+1 which satisfies the stationarity condition defined in Section II-B, assign a matrix Λ(m) defined according to (25). Let Lc denote the set of all such coefficient matrices, i.e., X mβ,bk = 1, Lc ,{Λ(m) : mβ,bk ≥ 0, β,bk

X

β∈Y

mβ,bk =

X

β∈Y

mbk ,[β,b1 ...,bk−1 ] , ∀ bk ∈ Xˆ k }.

8

Note that while Ld is a discrete set (consisting of no more than |Xˆ|n elements), Lc is a continuous set. For a sequence xn , let x ˆn = x ˆn (xn ) = arg min E(y n ), y n ∈Xˆ n

and

ˆ n ) , Λ(ˆ Λ(x xn ).

ˆ n ) is the optimal coefficient matrix required Note that Λ(x for replacing (P1) with (P2) without any penalty in the performance. Lemma 2: For each sequence xn ∈ X n , min E(y n ) = min φ(xn , Λ) n y

Λ∈Ld

= min φ(xn , Λ), Λ∈Lc

(26)

ˆ n ). and minΛ∈Ld φ(xn , Λ) is achieved at Λ(x Proof: By Theorem 1, since x ˆn = arg minyn E(y n ), we have ˆ n )) = φ(xn , Λ(ˆ φ(xn , Λ(x xn )) = E(ˆ xn ).

(27)

Hence, ˆ n )), min φ(xn , Λ) ≤ min φ(xn , Λ) ≤ φ(xn , Λ(x

Λ∈Lc

Λ∈Ld

(28)

where the first inequality holds because Ld ⊂ Lc . On the other hand, for each Λ ∈ Lc , let x ˜n (Λ) denote the n n n minimizer of Λ⊙m(y )+αdn (x , y ). As shown in the proof of Theorem 2, H(m(˜ xn (Λ))) ≤ Λ ⊙ m(˜ xn (Λ)).

(29)

Therefore, adding dn (xn , x˜n ) to both sides of (29) yields E(˜ xn (Λ)) ≤ Λ ⊙ m(˜ xn (Λ)) + dn (xn , x ˜n (Λ)) = φ(xn , Λ).

Since Λ(y n ) ∈ Ld , we can replace Lc by Ld in (32), and still get the same result. From Lemma 2, instead of finding the optimal coefˆ = ficients from a discrete optimization problem, i.e., Λ n arg minΛ∈Ld φ(x , Λ), one can solve a continuous optimization problem over a function of relatively low dimensions, n ˆ = arg min i.e., Λ Λ∈Lc φ(x , Λ). VII. O PTIMAL COEFFICIENTS As mentioned in Section IV, there exists a set of coefficients for which (P1) and (P2) have the same value. However, computing the desired coefficients requires the knowledge of m∗n which is not available without solving (P1). In order to alleviate this issue, in this section we introduce another optimization problem that gives an asymptotically tight approximation of m∗n , and therefore a reasonable approximation of the set of coefficients. For a given sequence xn and a given order k, let M(k) = M(k) (xn ) be the set of all jointly stationary probability ˆ k ) (in the sense of Lemma 1) such that distributions on (X k , X their marginal distributions with respect to X coincide with the k th order empirical distribution induced by xn defined as follows |{1 ≤ i ≤ n : (xi−k+1 , . . . , xi ) = ak }| (k) , pˆ[xn ] (ak ) , n n 1X 1i = k, n i=1 xi−k+1 =a

where ak ∈ X k . More specifically a distribution p(k) in M(k) should satisfy the following two constraints: 1) Stationarity condition: as described in Section II-B, for any ak−1 ∈ X k−1 and bk−1 ∈ Xˆ k−1 , X p(k) (ak , bk ) = (ak ,bk )∈X ×Xˆ

X

On the other hand, by assumption, xˆn is a minimizer of E(y n ). Hence, n

n

E(ˆ x ) ≤ E(˜ x (Λ)). Therefore, E(ˆ xn ) ≤ min φ(xn , Λ). Λ∈Lc

(30)

p(k) (ak ak−1 , bk bk−1 ).

(ak ,bk )∈X ×Xˆ

2) Consistency: for each ak ∈ X k , X (k) p(k) (ak , bk ) = pˆ[xn ] (ak ). bk ∈Xˆ k

n

Finally, combining (27), (28) and (30) yields the desired result.

For given x , k and ℓ > k, consider the following optimization problem

Remark 5: Another slightly different proof of (26) is as follows. Note that

ˆ k+1 |X ˆ k ) + α E[d(X1 , X ˆ 1 )] min H(X ˆ ℓ ) ∼ p(ℓ) s.t. (X ℓ , X

min min (Λ ⊙ m(y n ) + αdn (xn , y n )) Λ∈Lc y n n n n min Λ ⊙ m(y ) + αd (x , y ) . = min n n y

Λ∈Lc

p(ℓ) ∈ M(ℓ) . (31)

But H(m(y n )) ≤ Λ ⊙ m(y n ), for any Λ ∈ Lc , and the lower bound is tight at Λ(y n ). Therefore, min min (Λ ⊙ m(y n ) + αdn (xn , y n )) n

n

= min (Hk (y ) + αdn (x , y )). n y

Remark 6: Note that the rate-distortion function of a stationary ergodic process X has the following representation [37]: ¯ X) ˆ : (X, X) ˆ jointly stationary and R(D, X) = inf{H( ˆ 0 )] ≤ D} ergodic, and E[d(X0 , X ˆ k+1 |X ˆ k ) : (X, X) ˆ jointly stationary = inf inf{H(X

Λ∈Lc y n

n

(33)

(32)

k≥1

ˆ 0 )] ≤ D}, and ergodic, and E[d(X0 , X

(34)

9

¯ X) ˆ denotes the entropy rate of the stationary ergodic where H( ˆ process X, i.e.,

1

¯ X) ˆ , lim H(X ˆ n+1 |X ˆ n ). H( n→∞

(a,b)∈X ×Xˆ

s.t.

0 ≤ p (a , b ) ≤ 1, ∀ α ∈ X , b ∈ Xˆℓ , X p(ℓ) (aℓ , bℓ ) = 1, ∀ aℓ ∈ X ℓ , bℓ ∈ Xˆℓ , (ℓ)

ℓ

ℓ

ℓ

ℓ

ℓ

R (D)

0.8

Performance of iterative Viterbi-based lossy coder

0.7

Hk(x ˆ n)

This representation gives the motivating intuition behind the optimization described in (33). It shows that (33) is basically performing the search required by (34). Using the properties of the set M(ℓ) , and the definition of conditional empirical entropy, (33) can be written more explicitly as X min H(m) + α d(a, b)q(a, b)

0.9

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

d n (x n , x ˆ n)

0.4

0.5

aℓ ,bℓ

X

p(ℓ) (aℓ , bℓ ) =

X

p(ℓ) (aℓ aℓ−1 , bk bℓ−1 ),

Fig. 1. Average performance of the iterative Viterbi-based lossy coder applied to an i.i.d. Bern(0.5) source. (n = 104 , k = 8, α = (3, 2.9, . . . , 0.1), and L = 50)

(aℓ ,bℓ )∈X ×Xˆ

(aℓ ,bℓ )∈X ×Xˆ

X

a concave minimization problem (of dimension |X |ℓ |Xˆ|ℓ + |Xˆ |k+1 + |X ||Xˆ |). aaa

∀ aℓ−1 ∈ X ℓ−1 , bℓ−1 ∈ Xˆℓ−1 , (ℓ)

p(ℓ) (aℓ , bℓ ) = pˆ[xn ] (aℓ ),

∀ aℓ ∈ X ℓ , VIII. S IMULATION

bℓ ∈Xˆ ℓ

RESULTS

As the first example, consider an i.i.d. Bern(p) source p (aa , bb ), q(a, b) = with p = 0.5. Fig. 1 shows the performance of the iteraaℓ−1 ∈X ℓ−1 ,bℓ−1 ∈Xˆ ℓ−1 tive algorithm described in Section VI slightly modified, as ∀ (a, b) ∈ X × Xˆ , suggested in Remark 4. The simulations parameters are as X (ℓ) ℓ k ℓ−k follows: n = 104 , k = 8, and α = (3, 2.9, . . . , 0.1). Each p (a , b βb ), mβ,bk = point corresponds to the average performance over L = 50 aℓ ∈X ℓ ,bℓ−k ∈Xˆ ℓ−k independent source realizations. As mentioned in Section VI, ∀ β ∈ Xˆ , bk ∈ Xˆ k . (35) the iterative algorithm continues until the algorithm converges. (ℓ) The optimization in (35) is done over M , i.e., the Fig. 2 shows the average, minimum and maximum number of ˆ ℓ ). Let Pˆ ∗ denote the set required iterations before convergence versus α. Again, the joint distributions p(ℓ) of (X ℓ , X n ∗ ˆ of minimizers of (35), and Sn be their (k + 1)th order number of trials are L = 50. It can be observed that the ˆ β,bk }β,bk be number of iterations in this case is always below 60, which, ˆ Let {λ marginalized versions with respect to X. n ∗ ∗ ˆ ˆ n ∈ Sn using (13). Let given the size of the search space, which is 2 , shows fast the coefficients evaluated at some m X be a stationary ergodic source, and R(X, D) denote its convergence. The next example involves a binary symmetric Markov ˆ n be the reconstruction rate distortion function. Finally, let X source (BSMS) with transition probability q = 0.2. Fig. 3 sequence obtained by solving (P2) (recall (12)) at the evaluated compares the average performance of the Viterbi encoder coefficients. against upper and lower bounds on R(D) [38]. The reason 1/4 Theorem 3: If k = kn = o(log n), ℓ = ℓn = o(n ) and for only comparing the performance of the algorithm against k = o(ℓ) such that kn , ℓn → ∞, as n → ∞, then for any bounds on R(D) in this case is that the rate-distortion function stationary ergodic source of a Markov source is not known, except for a low-distortion ˆ n ) + αdn (X n , X ˆ n ) n→∞ Hk (X −→ min [R(X, D) + αD] , (36) region. For low distortions, the Shannon lower bound is tight D≥0 [39]. More explicitly, for D ≤ Dc ≈ 0.0159, almost surely. R(D) = hb (q) − hb (D), The proof of Theorem 3 is presented in Appendix A. Remark 7: Theorem 3 implies the fixed-slope universality where hb (ǫ) , h(ǫ, 1 − ǫ). For D > Dc , R(D) > hb (q) − of the scheme which performs the lossless compression of the hb (D). reconstruction by first describing its count matrix (costing a A comparison with the memoryless case (Fig. 1) seems to number of bits which is negligible for large n) and then doing suggest that the problem is less with how quickly (in n) we the conditional entropy coding. are converging to the exhaustive search performance scheme Remark 8: Note that all the constraints in (35) are linear, of (9) than with how quickly the convergence in (36) is taking and the cost is a concave function. Hence, overall, we have place, which is source dependent and not at our control. X

(ℓ)

ℓ−1

ℓ−1

10

Avg. number of iterations before convergence

30

average

14 20

12

10

0

0

0.5

1

1.5 α

2

2.5

10

3

4

8

minimum

3 2

6

1 0

0

0.5

1

1.5 α

2

2.5

4

3

maximum

60

2

40

0 0

20

0

0

0.5

1

1.5 α

2

2.5

0.5

1

1.5 α

2

2.5

3

3

Fig. 4. Average number of iterations before convergence. (BSMS with q = 0.2, n = 25 × 103 , k = 8, α = 3 : −0.1 : 0.1 and L = 50) Fig. 2. From top to bottom: average, minimum and maximum number of iterations before convergence. (i.i.d. Bern(0.5) source, n = 104 , k = 8, α = (3, 2.9, . . . , 0.1), and L = 50)

H k (y n ,( t)) + αd n (x n , y n ,( t))

0.6 Upp er b ound on R (D)

Performance of Iterative Viterbi-based algorithm 0.4

Hk(x ˆ n)

0.7

0.65

Lower b ound on R (D)

0.5

0.75

0.6

0.55

0.3

0.5

0.2

0.45 0

0.1

0 0

0.1

0.2

0.3

d n (x n , x ˆ n)

0.4

0.5

10

t

15

20

Fig. 5. Energy decay through the iterations for α = 1.5. (BSMS with q = 0.2, n = 25 × 103 and k = 8)

Fig. 3. Average performance of the iterative Viterbi-based lossy coder applied to a BSMS with q = 0.2 source. (n = 25 × 103 , k = 8, α = 3 : −0.1 : 0.1 and L = 50)

0.75

H k (y n , ( t)) + αd n (x n , y n , ( t))

Fig. 4 shows the average number of iterations before convergence versus α. It can be observed that the average is always below 15. To give some examples on how the energy is decreasing, Fig. 5 and Fig. 6 show the energy decay through iterations for α = 1.6 and α = 1 respectively. Finally, Fig. 7 compares the performances of the iterative Viterbi-based encoding algorithm proposed in this paper, and the MCMC-based encoding algorithm proposed in [33]. Both algorithms are applied to a single source sequence of length n = 104 generated by an i.i.d. Bern(0.5) source. In the MCMC-based approach the temperature at iteration t is equal to Tt = 0.75⌈t/n⌉ , and the number of iterations are r = 10n = 105 . The other parameters used by both algorithms are α = 4 : −0.5 : 0.5 and k = 8. Clearly the new approach

5

0.7 0.65 0.6 0.55 0.5 0.45 0.4 0

5

10

t

15

20

Fig. 6. Energy decay through the iterations for α = 1.1. (BSMS with q = 0.2, n = 25 × 103 and k = 8)

11

1 0.9

R(D)

0.8

Viterbi-based iterative encoder MCMC-based encoder

Hk(x ˆ n)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2 0.3 d n (x n , xˆ n)

0.4

0.5

Fig. 7. Comparing the performances of the MCMC-based and Viterbi-based encoders applied to an i.i.d Bern(0.5) source sequence of length n = 104 for α = 4 : −0.5 : 0.5 and k = 8, . The parameters of the MCMC-based algorithm are: r = 10n, and Tt = 0.75⌈t/n⌉

which is universal with respect to the class of stationary ergodic sources. However, solving this optimization problem is computationally demanding, and in fact infeasible in practice for even moderate blocklengths. In order to overcome this problem, we propose an iterative approach for approximating the optimal coefficients. This approach is partially justified by a guarantee of convergence to at least a local minimum. In the described iterative approach, the algorithm starts at a large slope (corresponding to a small distortion) and gradually decreases the slope until it hits the desired value. At each slope, the algorithm runs the Viterbi algorithm iteratively until it converges. An interesting possible next step is to explore whether there exists a sequence of slopes converging to the desired value in a small number of steps (e.g. of o(n)) for which we can guarantee the convergence of the algorithm to the global minimum at the end of the process. Existence of such sequence of slopes implies a universal lossy compression algorithm with moderate computational complexity. APPENDIX A: P ROOF

OF

T HEOREM 3

Proof: By rearranging the terms, the cost function in (P1) can alternatively be represented as n

1X d(xi , yi ) Hk (y ) + αdn (x , y ) = Hk (m(y )) + α n i=1 n

has better performance, and is able to follow the R(D) curve. In this example, the required time for generating the points in Matlab for the new iterative Viterbi-based algorithm is about 4 times less than the required time for generating the curve for the MCMC-based algorithm. Remark 9: Similar to [33], here in the figures we are using Hk (ˆ xn ) as the rate, while in fact it is not a true length function. The reason is that as explained in [33], by Ziv inequality [40], if k = o(log(n)), then for any ǫ > 0, there exits Nǫ ∈ N such that for any n > Nǫ and any sequence = (y1 , y2 , . . .), 1 ℓLZ (y n ) − Hk (y n ) ≤ ǫ. (37) n IX. C ONCLUSIONS

n

n

n

1X d(xi , yi ) = Hk (m(y )) + α n i=1

X

n

n

= Hk (m(y )) + α

X

= Hk (m(y )) + α

d(a, b)1(xi ,yi )=(a,b)

(a,b)∈X ×Xˆ n

X

1X 1(xi,yi )=(a,b) d(a, b) n i=1

X

d(a, b)ˆ p[xn ,yn ] (a, b)

(a,b)∈X ×Xˆ n

1(xi,yi )=(a,b)

(a,b)∈X ×Xˆ

n

1X = Hk (m(y )) + α n i=1 n

(1)

(a,b)∈X ×Xˆ

= Hpˆ(k+1) (Yk+1 |Y k ) + α Epˆ(1) [yn ]

In this paper, a new approach to fixed-slope lossy compression of finite-alphabet sources is proposed. The core ingredient of the new algorithm is the Viterbi algorithm, which enables the encoder to find a minimum cost reconstruction sequence. The cost of each possible reconstruction sequence y n is defined in terms of its distance from the original sequence xn , and some weights that are assigned to the contexts of order k + 1 appearing in y n multiplied by theirs frequencies. The trellis corresponding to the Viterbi algorithm includes |Xˆ|k+1 different states, corresponding to the |Xˆ |k+1 different possible contexts of length k + 1. Hence, for encoding a sequence of length n, the computational complexity of the Viterbi algorithm is of order O(n|Xˆ |k+1 ). We prove that there exists a set of optimal coefficients for which the described algorithm will achieve the rate-distortion performance for any stationary ergodic process. The problem is finding those weights. We present an optimization problem whose solution can be used to find an asymptotically tight approximation of the optimal coefficients resulting in an overall scheme

n

[xn ,yn ]

[d(X1 , Y1 )].

(A-1)

This new representation reveals the close connection between (P1) and (33). Although the costs we are trying to minimize in the two problems are equal, there is a fundamental difference between them: (P1) is a discrete optimization problem, while the optimization space in (33) is continuous. Let En∗ be the set of minimizers of (P1), and define the (ℓ) set Pn∗ , {ˆ p[xn ,yn ] : y n ∈ En∗ }, i.e., the set of ℓth order joint distributions induced by the sequences in En∗ and xn . Also, let Pˆn∗ denote the set of order ℓ joint distributions, i.e., {p∗(ℓ) (aℓ , bℓ )}(aℓ ,bℓ )∈X ℓ ×Xˆ ℓ that are the solutions of (35). Finally, let Cn∗ and Cˆn∗ denote the minimum values achieved by (P1) and (35) respectively. In order to make the proof more tractable, we break it down into several steps as follows. (ℓ) 1) Let y n ∈ En∗ , and pˆ[xn ,yn ] be its corresponding ℓth order joint empirical distribution with xn . It is easy to check (ℓ) that pˆ[xn ,yn ] is a feasible point in (35) and satisfies all the required constraints. The only condition that might

12

need some thought is the stationarity constraint, which also holds because X (ℓ) pˆ[xn ,yn ] (aℓ , bℓ )

ak+1 ∈ Xˆ k+1 ,

(aℓ ,bℓ )∈X ×Xˆ

1 ℓ−1 i−1 1 ≤ i ≤ n : xi−1 , yi−ℓ+1 = bℓ−1 i−ℓ+1 = a n X (ℓ) = (A-2) pˆ[xn ,yn ] (aℓ aℓ−1 , bℓ bℓ−1 ). =

(aℓ ,bℓ )∈X ×Xˆ

Therefore, since Cˆn∗ is the minimum of (35), we have Cˆn∗ ≤ Hk (m(y n )) + α Epˆ(1)

[xn ,yn ]

[d(X1 , Y1 )]

= Hk (m(y n )) + αdn (xn , y n ) = Cn∗ .

(A-3)

2) Let p∗(ℓ) ∈ Pˆn∗ . Based on this joint probability distribu˜n tion and xn , we construct a reconstruction sequence X n n as follows: divide x into r = ⌈ ℓ ⌉ consecutive blocks: (r−1)ℓ

n xℓ , x2ℓ ℓ+1 , . . . , x(r−2)ℓ+1 , x(r−1)ℓ+1 ,

where, except for possibly the last block, the other blocks have length ℓ. The new sequence is constructed as follows ˜n ˜ ℓ, X ˜ 2ℓ , . . . , X ˜ (r−1)ℓ , X X ℓ+1 (r−1)ℓ+1 , (r−2)ℓ+1 ˜ iℓ such that, for i = 1, . . . , r − 1, X (i−1)ℓ+1 is a sample drawn independently from the conditional distriˆ ℓ |X ℓ = xiℓ ˜n bution p∗(ℓ) (X (i−1)ℓ+1 ), and X(r−1)ℓ+1 ∼ ˆ n−(r−1)ℓ |X n−(r−1)ℓ = xn p∗(ℓ) (X (r−1)ℓ+1 ). ∗(k+1) 3) Let p be the (k + 1)th order marginalized version of the solution of (35) on Xˆ (k+1) . We prove that

∗(k+1) (k+1) (A-4) − pˆ[X˜ n ] → 0,

p 1

almost surely. The randomization in (A-4) is only in the ˜ n. generation of X Remark 10: Since p∗(ℓ) satisfies stationarity condition, its (k + 1)th order marginalized distribution, p∗(k+1) , is well-defined and can be computed with respect to any of the (k + 1) consecutive positions. In other words, for ak+1 ∈ Xˆk+1 , X ℓ−k−1 p∗(k+1) (bj ak+1 bj+1 ), p∗(k+1) (ak+1 ) = bℓ−k−1 ∈Xˆ n

for any j ∈ {0, . . . , ℓ − k − 1}, and the result does not depend on the choice of j. In order to show that (A-4) holds almost surely, we (k+1) decompose pˆ[X˜ n ] (ak+1 ) into the average of ℓ − k terms each of which converging to p∗(k+1) (ak+1 ). Then, using the union bound, we get the desired result which is (k+1) the convergence of pˆ[X˜ n ] (ak+1 ) to p∗(k+1) (ak+1 ). For

where,

(k+1) k+1 ) − p∗(k+1) (ak+1 ) p[X˜ n ] (a ˆ n 1 X ∗(k+1) k+1 (a ) 1X˜i−k = i k+1 − p =a n i=1 ℓ−k−1 r−1 1 X X 1X˜ iℓ−j =ak+1 + = n j=0 i=1 iℓ−j−k δ1 − p∗(k+1) (ak+1 ) , # " r−1 r ℓ−k−1 X 1X + = 1 ˜ iℓ−j k+1 n j=0 r i=1 Xiℓ−j−k =a δ1 − p∗(k+1) (ak+1 ) , " r−1 # X 1X 1 ℓ−k−1 = + 1 ˜ iℓ−j k+1 ℓ − k j=0 r i=1 Xiℓ−j−k =a δ2 − p∗(k+1) (ak+1 ) , (A-5)

δ1 , ℓ−1 r−1 1 1 X X iℓ−j 1X˜iℓ−j−k k+1 + =a n n i=1 j=ℓ−k

n X

1X˜i−k i =ak+1

i=(r−1)ℓ+1

accounts for the edge effects between the blocks, and δ2 is defined such that δ2 − δ1 takes care of the effect of 1 replacing nr by ℓ−k . Therefore, from these definitions, 1 kr , and |δ2 − δ1 | = 0 ≤ δ1 < n + nℓ ≤ kℓ + nk + r−1 o(k/ℓ). Hence, for our choice of parameters, δ1 → 0 and δ2 → 0, as n → ∞. The new representation decomposes a sequence of correlated random variables, {1X˜ i =ak+1 }ni=k+1 , into ℓ − k i−k sub-sequences where each of them is an independent process. For achieving this goal, some counts that lie between two blocks are ignored, i.e., if 1X˜ i =ak+1 i−k ˜ iℓ depends on more than one block of the form X (i−1)ℓ+1 , it is ignored. The effect of such terms will be no more than δ1 which, since k = o(ℓ), goes to zero as k, ℓ → ∞. More specifically, in (A-5), for each j ∈ {0, . . . , ℓ − k − 1}, {1X˜ iℓ−j =ak+1 }ri=1 forms iℓ−j−k a sequence of independent not necessarily identically distributed random variables. For n large enough, |δ2 | < ǫ/2. Therefore, by the Hoeffding inequality [41], and the union bound, (k+1) k+1 P ˆ ) − p∗(k+1) (ak+1 ) > ǫ p[X˜ n ] (a " r−1 X 1X 1 ℓ−k−1 1 ˜ iℓ−j ≤P k+1 − ℓ − k j=0 r i=1 Xiℓ−j−k =a 1 ∗(k+1) k+1 ǫ p (a ) > ℓ−k 2

13 ℓ−k−1 r−1 X 1 X 1 1 ˜ iℓ−j k+1 − ℓ − k j=0 r i=1 Xiℓ−j−k =a ǫ 1 ∗(k+1) k+1 p (a ) > ℓ−k 2 ℓ−k−1 r−1 X 1 X P 1 ˜ iℓ−j ≤ k+1 − r i=1 Xiℓ−j−k =a j=0 ǫ p∗(k+1) (ak+1 ) > 2 2 ≤ 2(ℓ − k)e−rǫ /2 . (A-6)

≤P

Again by the union bound, (k+1) P kˆ p[X˜ n ] − p∗(k+1) k1 > ǫ X ≤ ǫ |Xˆ |k+1

.

!

(A-7)

Our choices of k = kn = o(log n), ℓ = ℓn = o(n1/4 ), k = o(ℓ), and kn , ℓn → ∞, as n → ∞ guarantee that the right hand side of (A-7) is summable on n which together with the Borel-Cantelli lemma yields the desired result of (A-4). 4) Using similar steps, we can prove that (1)

kq ∗ − pˆ[xn ,X˜ n ] k → 0, almost surely. Again, we first prove that |q ∗ (a, b) − (1) pˆ[xn ,X˜ n ] (a, b)| → 0, for each (a, b) ∈ X × Xˆ . To achieve this goal, we again decompose {1(xi ,X˜ i )=(a,b) }ni=1 into ℓ − k subsequences each of which is a sequence of independent random variables, and then apply Hoeffding inequality plus the union bound. Finally, we apply the union bound in addition to the Borel-Cantelli lemma to get the desired result. 5) Combining the results of the last two step, and the fact that Hk (m) and Eq [d(X, Y )] are bounded continuous functions of m and q respectively, we conclude that ˜ n ) + αdn (xn , X ˜ n) Hk (X = Hpˆ(k+1) (Yk+1 |Y k ) + α Eqˆ(1) ˜ n] [X

max

(a,b)∈X ×Xˆ

d(a, b).

Therefore, since Λ is in turns a continuous function of m, and as proved in (A-4), (k+1)

kp∗(k+1) − pˆ[X˜ n ] k1 → 0, ˆ → 0, |φ(xn , Λ∗ ) − φ(xn , Λ)|

(k+1) k+1 P ˆ ) − p∗(k+1) (ak+1 ) > p[X˜ n ] (a

≤ |Xˆ|k+1 2(ℓ − k)e

log |Xˆ | + α

we conclude that,

ak+1 ∈Xˆ k+1

nǫ2 − ˆ |2(k+1) 2ℓ|X

with probability one, as n → ∞. 7) For a given matrix of coefficients Λ = [λβ,bk ]β,bk computed at some m according to (13), it is easy to check that φ(xn , Λ) defined in (24) is continuous in Λ, and bounded by

˜n] [xn ,X

[d(X1 , Y1 )]

= Hp∗(k+1) (Yk+1 |Y k ) + α Eq∗ [d(X1 , Y1 )] + ǫn = Cˆ ∗ + ǫn , (A-8) n

where ǫn → 0, with probability 1. 6) Since Cn∗ is the minimum of (P1), from (A-8), ˜ n ) + αdn (xn , X ˜ n) Cn∗ ≤ Hk (X = Cˆn∗ + ǫn .

(A-9)

On the other hand, as shown in (A-3), Cˆn∗ ≤ Cn∗ . Therefore, combining this and (A-9), it follows that |Cn∗ − Cˆn∗ | → 0,

(A-10)

ˆ are the coefficients computed at p∗(k+1) where Λ∗ and Λ (k+1) and pˆ[X˜ n ] respectively. ¯ n be the output of (P2) when the coefficients are 8) Let X ˜ n ). Then, from Theorem 2, computed at m(X ¯ n ) + αdn (xn , X ¯ n ) ≤ Hk (X ˜ n ) + αdn (xn , X ˜ n) Hk (X = Cˆn∗ + ǫn . Since, ǫn → 0, with probability one, this shows that ˜ n ), we would haven computed the coefficients at m(X get a universal lossy compressor. But instead, we want to compute the coefficients at m∗ . However, from (A-10), the difference between the performances of these two algorithms goes to zero, as n goes to infinity. Therefore, we finally get our desired result which is h i ˆ n ) + αdn (X n , X ˆ n) lim sup Hk (X n→∞

≤ min [R(X, D) + αD] , D≥0

almost surely.

R EFERENCES [1] T. Cover and J. Thomas. Elements of Information Theory. Wiley, New York, 2nd edition, 2006. [2] C. Shannon. Coding theorems for a discrete source with fidelity criterion. In R. Machol, editor, Information and Decision Processes, pages 93– 126. McGraw-Hill, 1960. [3] R.G. Gallager. Information Theory and Reliable Communication. NY: John Wiley, 1968. [4] T. Berger. Rate-distortion theory: A mathematical basis for data compression. NJ: Prentice-Hall, 1971. [5] E.H. Yang, Z. Zhang, and T. Berger. Fixed-slope universal lossy data compression. Information Theory, IEEE Transactions on, 43(5):1465– 1476, Sep. 1997. [6] D. J. Sakrison. The rate of a class of random processes. IEEE Trans. Inform. Theory, 16:10–16, Jan. 1970. [7] J. Ziv. Coding of sources with unknown statistics Part II: Distortion relative to a fidelity criterion. Information Theory, IEEE Transactions on, 18:389–394, May 1972. [8] D. L. Neuhoff, R. M. Gray, and L.D. Davisson. Fixed rate universal block source coding with a fidelity criterion. Information Theory, IEEE Transactions on, 21(5):511–523, May 1972. [9] D. L. Neuhoff and P. L. Shields. Fixed-rate universal codes for Markov sources. Information Theory, IEEE Transactions on, 24(3):360–367, May 1978. [10] J. Ziv. Distortion-rate theory for individual sequences. Information Theory, IEEE Transactions on, 26(2):137–143, Mar. 1980.

14

[11] R. Garcia-Munoz and D. L. Neuhoff. Strong universal source coding subject to a rate-distortion constraint. Information Theory, IEEE Transactions on, 28(2):285–295, Mar. 1982. [12] J. Ziv and A. Lempel. Compression of individual sequences via variablerate coding. Information Theory, IEEE Transactions on, 24(5):530–536, Sep. 1978. [13] I. H. Witten, R. M. Neal, , and J. G. Cleary. Arithmetic coding for data compression. Commun. Assoc. Comp. Mach., 30(6):520–540, 1987. [14] K. Cheung and V. K. Wei. A locally adaptive source coding scheme. In Proc. Bilkent Conf on New Trends in Communication, Control, and Signal Processing, pages 1473–1482, 1990. [15] H. Morita and K. Kobayashi. An extension of LZW coding algorithm to source coding subject to a fidelity criterion. In Proc. 4th Joint SwedishSoviet Int. Workshop on Information Theory, Gotland, Sweden, 1989. [16] Y. Steinberg and M. Gutman. An algorithm for source coding subject to a fidelity criterion based on string matching. Information Theory, IEEE Transactions on, 39(3):877–886, Mar. 1993. [17] E.H. Yang and J.C. Kieffer. On the performance of data compression algorithms based upon string matching. Information Theory, IEEE Transactions on, 44(1):47–65, Jan. 1998. [18] T. Luczak and T. Szpankowski. A suboptimal lossy data compression based on approximate pattern matching. Information Theory, IEEE Transactions on, 43(5):1439–1451, Sep. 1997. [19] R. Zamir and K. Rose. Natural type selection in adaptive lossy compression. Information Theory, IEEE Transactions on, 47(1):99–111, Jan. 2001. [20] M. Atallah, Y. Genin, and W. Szpankowski. Pattern matching image compression: algorithmic and empirical results. IEEE Trans. Pattern Analysis and Machine Intelligence, 21(7):614–627, July 1999. [21] A. Dembo and I. Kontoyiannis. The asymptotics of waiting times between stationary processes, allowing distortion. The Annals of Applied Probability, 9(2):413–429, May 1999. [22] M. W. Marcellin and T. Fischer. Trellis coded quantization of memoryless and Gauss-Markov sources. IEEE Transactions on Comm., 38(1):82–93, Jan. 1990. [23] T. Berger and J.D. Gibson. Lossy source coding. Information Theory, IEEE Transactions on, 44(6):2690–2723, Sep. 1998. [24] A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Springer, New York, 1992. [25] J.H. Kasner, M.W. Marcellin, and B.R. Hunt. Universal trellis coded quantization. Image Processing, IEEE Transactions on, 8(12):1677– 1687, Dec. 1999. [26] M.J. Wainwright and E. Maneva. Lossy source encoding via messagepassing and decimation over generalized codewords of LDGM codes. In Proc. IEEE Int. Symp. Inform. Theory, Adelaide, Australia, Sep. 2005. [27] A. Gupta and S. Verd´u. Nonlinear sparse-graph codes for lossy compression. Information Theory, IEEE Transactions on, 55(5):1961– 1975, May 2009. [28] A. Gupta, S. S. Verd´u, and T. Weissman. Rate-distortion in near-linear time. In Proc. IEEE Int. Symp. Inform. Theory, pages 847–851, Toronto, Canada, July 2008. [29] E. Arikan. Channel polarization: A method for constructing capacityachieving codes for symmetric binary-input memoryless channels. IEEE Trans. Inform. Theory, 55(7):3051–3073, July 2009. [30] S.B. Korada and R.L. Urbanke. Polar codes are optimal for lossy source coding. IEEE Trans. Inform. Theory, 56(4):1751–1768, April 2010. [31] M.Z. Mao, R.M. Gray, and T. Linder. Rate-constrained simulation and source coding i.i.d. sources. Information Theory, IEEE Transactions on, 57(7):4516–4529, July 2011. [32] P.A. Chou, T. Lookabaugh, and R.M. Gray. Entropy-constrained vector quantization. Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(1):31–42, Jan. 1989. [33] S. Jalali and T. Weissman. Rate-distortion via Markov chain Monte Carlo. In Proc. IEEE Int. Symp. Inform. Theory, pages 852–856, July 2008. [34] E.H. Yang and Z. Zhang. Variable-rate trellis source encoding. IEEE Trans. Inform. Theory, 45(2):586–608, Mar. 1999. [35] A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. Information Theory, IEEE Transactions on, 13(2):260–269, Apr. 1967. [36] Jr. Forney, G.D. The Viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, Mar. 1973. [37] R. Gray, D. Neuhoff, and J. Omura. Process definitions of distortionrate functions and source coding theorems. Information Theory, IEEE Transactions on, 21(5):524–532, Sep 1975.

[38] S. Jalali and T. Weissman. New bounds on the rate-distortion function of a binary Markov source. In Proc. IEEE Int. Symp. Inform. Theory, Nice, France, July 2007. [39] R. Gray. Rate distortion functions for finite-state finite-alphabet markov sources. Information Theory, IEEE Transactions on, 17(2):127–134, Mar. 1971. [40] E. Plotnik, M.J. Weinberger, and J. Ziv. Upper bounds on the probability of sequences emitted by finite-state sources and on the redundancy of the Lempel-Ziv algorithm. IEEE Trans. Inform. Theory, 38(1):66–72, Jan. 1992. [41] W. Hoeffding. Probability inequalities for sums of bounded random vaiables. Journal of the American Statistical Association, 58(301):13– 30, Mar. 1963.

Shirin Jalali received her B.Sc. and M.Sc. degrees in electrical engineering from Sharif University of Technology in 2002 and 2004, respectively. She obtained her M.Sc. in statistics and Ph.D. in electrical engineering from Stanford University in 2009. Her research interests are in the areas of information theory and statistical signal processing. She is currently a postdoctoral fellow at the center for mathematics of information (CMI) at California Institute of Technology.

Andrea Montanari received a Laurea degree in physics in 1997, and a Ph.D. in theoretical physics in 2001 (both from Scuola Normale Superiore in Pisa, Italy). He has been post-doctoral fellow at Laboratoire de Physique Thorique de lEcole Normale Suprieure (LPTENS), Paris, France, and the Mathematical Sciences Research Institute, Berkeley, USA. Since 2002 he is Charg de Recherche (with Centre National de la Recherche Scientifique, CNRS) at LPTENS. In September 2006 he joined Stanford University as a faculty, and since 2010 he is Associate Professor in the Departments of Electrical Engineering and Statistics. He was co-awarded the ACM SIGMETRICS best paper award in 2008. He received the CNRS bronze medal for theoretical physics in 2006 and the National Science Foundation CAREER award in 2008.

Tsachy Weissman (S’99-M’02-SM’07) Tsachy Weissman graduated summa cum laude with a B.Sc. in electrical engineering from the Technion in 1997, and earned his Ph.D. at the same place in 2001. He then worked at Hewlett-Packard Laboratories with the information theory group before joining Stanford, where he has been on the faculty of the Electrical Engineering department since 2003, and is incumbent of the STMicroelectronics chair of the Stanford School of Engineering. He has spent leaves at the Technion, and at ETH Zurich. Tsachy’s research is focused on information theory, statistical signal processing, the interplay between them, and their applications. Among his recent honors are a joint IT/COM societies best paper award, a Horev fellowship for Leaders in Science and Technology, and a Henry Taub prize for excellence in research. He is on the editorial board of the IEEE Transactions on Information Theory, serving as Associate Editor for Shannon Theory.

Denoising via MCMC-based Lossy Compression