An efficient video decoder design for MPEG-2 MP@ ML

Viewer
Transcript

An Efficient Video Decoder Design for MPEG-2 MIPOML Jui-HuaLi

and

NamLing

+

COMPUTER ENGINEERING DEPARTMENT SANTA CLARA U N I V E R S I ' T Y SANTACLARA,CA 95053, 1J.S.A. [email protected], [email protected]

Abstract In this paper, we present an efficient MPEG-2 video decoder architecture design to meet MP@ML real-time decoding requirement. The overall architecture, as well as the design of the major function-specific processing blocks, such as the variable-length decoder, the inverse 2-0 discrete cosine transform unit, and the motion compensation unit, are discussed. A hierarchical and distributed controller approach is used and a bus-monitoring model for different bus arbitration schemes to control external DRAM accesses is devdoped and the system is simulated. Practicul issues and bufer sizes are addressed. With a 27 MHz clock, our architecture uses much,fewer than the 667 cycles, upper bond for the MP@ML decoding requirement, to decode each macroblock with a single external bus and DRAM.

1. Introduction As the demand for multimedia applications increases and with the establishment of several international standards for video coding and transmission, the need to produce efficient and low cost architectures to meet the required performance is high. The MPEG-2 [ I ] standard for video and audio coding, with its different profiles and levels, provides audio and visual quality for different applications such as digital video broadcast, DVD, SDTV, HDTV, VOD and possible stereoscopic video in the longer future. Several dedicated decoder chips for MI'EG-2 applications have been developed (see, for example, [3],[6],[7]). The MPEG video coding standard inducted several compression techniques, such as variable length coding (VLC), discrete cosine transform (DCT), quantization, and motion compensation. This coding scheme signific,mtly reduces both spatial and temporal redundancies. For block-based motion compensation, several macroblocks (1 6x1 6 pixels area) have been defined for different types of pictures. In MPEG decoder design, it is important to meet the required real-time performance and reduce the hardware cost. Moreover, the variety of macroblock types and variable length codes increase the difficulty of predicting memory accesses. Most MPEG video decoder chips access pixelskompressed data in an external memory (RAM) through buses. The unpredictability of memory accesses resulted in the dificulty of designing direct memory access (DMA) bus arbitration algorithms. It also incr1:ases the difficulty in optimizing buffer requirements within the decoder. In this paper, an MPEG-2 decoder architecture with minimum required resources to achieve performance for main profile at main

+

This work is supported in part by a grant from NJR corporation.

1063-6862/97 $10.00 0 1 9 9 7 IEEE

509

level (MP@ML) decoding [ l ] is presented. In addition, a model is developed to monitor real traffic condition in the decoder datapath and produce bus utilization data under different bus arbitration schemes. Very few literature have discussed bus arbitration scheme for decoders[3], and none has produced a solid model and simulator to optimize bus and buffers. Our paper begins with a brief introduction to the overall architecture of our MPEG-2 decoder, followed by a detailed design of each functional module. Our bus-monitoring model is then presented with simulation results under different bus control schemes and different sizes of buffers. Finally, a conclusion is drawn.

2. Decoder Architecture Our architecture for an MPEG-2 decoder. which is quite generic, is shown in Figure 1 . Each block specifies a particular function being performed. The input buffer (FIFO) receives incoming bitstream which is then transferred to the external DRAM. These compressed data is fetched out again to the VLD (variable length decoding) unit, where the bitstream is interpreted and proper consequent operations triggered. Both or either of the baseline and the motion compensation operations are performed depending on the type of macroblock to be processed (decided by the controller). The reconstructed macroblock is written back to the DRAM, where it is ready for display or reference. The I/O width of the architecture is 64 bits. The width of the external bus is 64 bits, which allows the transferring of one row (8 pixels) in a block a t a time. We adopt a 27 MHz clock for the following reasons : It is suitable for low power applications. . It is a simple multiple of 13.5 MHz, which is the sampling rate of ITU-R 601 video. . The clock period is about 37 ns, giving enough allowance for each stage in the data path. . Higher clock rates would require an increase of on-chip memory.

.

External DRAM DMAC

~~

f

~~

64 bits

4

Micro controller

Figure 1 The architecture of MPEG video decoder

3. Function-Specific Module Architectures In this section, the architectures of three function-specific modules, our variable-length decoder (VLD), inverse discrete cosine transform (IDCT) unit, and motion compensation (MC) unit, are presented. The IDCT unit and the MC unit are the two most time-consuming (IDCT being the most computationally intensive; MC being the mo5L I10 intensive) modules, which determine the performance in the MPEG-2 decoder. Other modules are simpler and straight forward and thus will not be discussed here.

510

3.1 Variable-length Decoder (VLD)

In MPEG-2 bitstream, the macroblock addressing, macroblock types, coded block pattern (CBP), motion vectors, and DCT coefficients are all variable length codes. With the VLD, the input bitstreams are parsed and interpreted. There are two kinds of implementations. One is basically a tree-searching algorithm [4], and the other is a parallel structure implemented through the look-up table. The former is bit-serially operated and is hard to meet the reqiuired performance. In our design, we adopted the latter, Lei-Sun VLD [ 2 ] , ancl modified it to decode MPEG-2 MP@ML bitstream. Since there are no explicit boundaries between codewords, the VLD has to decode a codeword, determine its length, and shift the input data stream by the number of bits corresponding to its length. Figure 2 shows a block diagram of the VLD for MPEG video bitstream. The VLC tables, implemented by a PLA, contains the decoded data and length information. The fixed lookup table contains fixed-length information in the header part of a bitstream. The state machine monitors the variable-length decoding process and also indicates the current processing position in a bitstream. Compressed data, coming from the DRAM, are kept in the VLD input bufi-er. The upper and lower registers, which are 64-bit wide each, contain the current bitstream to be processed. The barrel shifter operates like a sliding window on the contents of these two registers. The window size ( the length of output from the shifter) is 28 bits, the same length as the escaped code, which has the maximum code-length among the codewords. Hence, the fixed symbol decoding rate at one codeword per cycle is guaranteed. The 91-bit is the minimum length determined for the barrel shifter to prevent underflow, i.e., there are always enough bits for the next decoding cycle at minimum hardware cost. At each cycle, the state machine determines which tables are to be used, according to the current position in a bitstream. The fixed lookup table is used io search the start codes and interpret the fixed-length information in the header, while the VLC tables are used for all data that are variable-length coded. While doing the VLC decoding, the output of the barrel shifter is matched with all entries in the PLA. When a match is found, the corresponding source symbol and the length of the decoded data are output. The shifter is then shifted to the beginning of the next codeword according to the accumulated code length. When the carry-out signal goes to high, it indicates that the upper register has been consumed. The content of the lower register is transferred to the upper register and another new 64-bit data is loaded into the lower register. The decoded header information is sent to the controller, while the motion vectors are sent to the on-chip address generator for DRAM address generation. The decoded DCT coefficients are sent to the next stage in the baseline path for further processing. 6 4 hits

1-r -

hd

91 h i t s

I

I

64 bits

L

o w e r reg

h.*r

2 7 hits Barrel shifter

FSM State machine

I,.".*

5 bits

V L C tables F r o m F S M

< A

M

V LC length_inform icro controller

flxcd Icnpth Inform

Figure 2 The variable-length decoder archilecture

511

3.2 IDCT Design The discrete cosine transform (DCT) [5] and its inverse (IDCT) are important operations in MPEG. Many fast algorithms have been proposed and implemented for computing the DCT/IDCT [8][9][15]. Detailed comparisons and in-depth analysis can be found in the literature [lo] [ I 11. In general, we can classify these DCTiIDCT algorithms by their separability, which is an important property for VLSI implementation of 2D-DCTADCT. These algorithms can be carried out by either multiplieriadder-based or ROM-based (distributed arithmetic) [ 121 architecture. Both are paralleVpipelined approaches to achieve real-time performance in MPEG decoding. In our design, we employ Chen’s algorithm [15] due to its regularity, reduced arithmetic operations and its ability to retain accuracy in limited wordlength. These attributes are suitable for VLSI implementation. The 8-point I-D IDCT algorithm is described in Eqs. (1)(2). From the matrix operation, the 8-point IDCT results are easily obtained. Figure 3 shows an overall architecture for the 2D-IDCT operation using row-column decomposition from the 1-D IDCTs.

Y 2

A

Y?

A

-C -A -B A

x 4

-C

X 6

G

-F

where A = c o s (n/4), B = c o s ( ? r / X ) C = sin (n/S). D = c o s ( a / 1 6 ) , E = c o s ( 3 ~ / 1 6 ) F, = s i n ( 3 x / 1 6 ) , G

E -D

=

x 7

sin (rr/16)

In our design, IDCT is implemented by a multiplier-adder architecture rather than a distributed arithmetic architecture which would need to be operated at a higher clock rate than the 27 MHz clock rate in our system. Furthermore, more register stages are required and power consumption is increased proportionally at a higher clock frequency. The cycle time at 27 MHz is enough for the data to go through a multiply-accumulate operation. We use a systolic multiplier-accumulator array to carry out IDCT. Comparing to the design with SIMD (single instruction multiple data) approach, this approach avoids the complex wiring and results in a highly regular and modular chip design. There are other examples of systolic array implementation (e.g. [13]). The wordlengths for internal paths for implementing an 8x8 IDCT are determined to minimize the hardware cost under the IEEE standard accuracy test. Kound

Round & cllp

Figure 3 Overall architecture for 2-D IDCT Our overall 2-D IDCT architecture is shown in Figure 3. Figures 4 and 5 illustrate our design of an 8-point 1 -D IDCT using 4 MACS to form a systolic array. Each processing element (PE) in an array is basically a multiplier and an accumulator which performs the operation defined in Eq.(3)(FigS). Data are pumped into this computing array in a special sequence on every clock cycle. In Figure 4, the X’s (inputs), C’s (cosine values from the cosine ROM), and 0’s form three

512

data streams flowing into the array, where the intermediate results of the inner products are transferred to the next neighbor PES and accumulated along the paths of PES. A pair of results flows out every other cycle after the fourth cycle from the beginning. The timing is shown in Figure 7. The total processing time for one macroblock (6 blocks) is 440 cycles. The maximum delay along one MAC is 27.52 ns. This synthesized result is derived from the COMPASS ASIC synthesis tool using 0.6 pm CMOS technology. Obviously, it is much less than a cycle time of the system clock (37 ns), and hence this architecture is guaranteed. The determination of optimal wordlengths for IDCT operation is impartant in order to minimize hardware cost while maintaining maximum accuracy to satisfy the IEEE specification. We adopt the wordlength determination results from [14] in our design. 'The width of the interconnection between PES is 19 bits, the output from the cosine ROM is I 3-bit wide. The output from the 1'' IDCT unit i s 15 bits, while the output from the 2Id IDCT unit is 9 bits. These are illustrated in Figure 4. The rounding scheme is used for the output from the 1 D-IDCT unit, while the truncating scheme is applied to the output from each PE. X F

~~

Y's

Add & shift

m

13

,5

Ld

Cosine ROM

m

I?

S ub

C

& shift

Yout

~~

*

X's

=

X '' C

+ Yin

(3)

2 D = 2 cycle delay

Figure 4 1-D IDCT using 4 MACS as a systolic array

Figure 5 Definition of the PE

The transpose RAM (T-RAM) in our design is a dual-port RAM, where read and write operations can be active at the same time. For the separable: 2-D IDCT implementation, the transpose RAM is used to keep the intermediate results from the 1'' 1-D IDCT unit, which will then be processed by the 2nd I-D IDCT unit. The 1'' 1-D IDCT unit writes to the T-RAM in a row/column wise manner and the 2nd I-D IDCT unit reads from the T-RAM in the columnhow wise manner. The earliest time for the 2nd IDCT unit to fetch data from the transpose RAM is determined to reduce the total latency of 2D-IDCT. The read-write sequence is shown in Figure 6. The earliest time for the 2nd I-D IDCT unit to read data from the transpose RAM is the 50th writing cycle of the 1'' I-D IDCT unit. Through correct timirig and sequencing, the readwrite operations for each unit are carried out in the manner of chasing each other without destroying the data in the T-RAM or getting incorrect data from it. (1"'lD~DC

ID-DCT unto

w Tltlng

Readrn

( I " 'I D - D C T unit) rltlng

I D - D C T unit)

-

w

0

o

1"' II D D C T u n a , )

Reading direction

1

Rcading direction I I,

(I" ID-DC

r

Ynlt)

read cell ( c u r r e n t b l o c k ) write cell (current block) n o data i n cells

0

0

II

w

direction

direction

r e a d cell ( c u r r e n t b l o c k ) d a t a of c u r r e n t block W r i t e cell (next block)

W

: r e a d :ell ( c u r r e n t b l o c k )

write cell ( n e x t block)

Figure 6 Transpose technique for readwrite in T-RAM

513

com pleied

I :

2”* ID C T u n i t c a n s t a r t

Figure 7 Output timing for our 2-D IDCT unit

3.3 Motion Compensation Unit In MPEG video, I pictures are intra-picture coded, while P pictures and B pictures use blockbased motion compensation to reduce inter-picture temporal redundancy. A decoder constructs a predicted macroblock pixel by pixel from 1 or 2 reference macroblocks in one or two previously decoded pictures. Decompressed prediction errors are then added to the predicted macroblocks to produce reconstructed macroblocks. It is more efficient to transmit the spatial and content differences between the predicted block and the reference block, than to transmit the original coded block by itself. The spatial difference is specified by the motion vector in the bitstream, the content difference (prediction error) is DCT compressed. Figure 8 shows the motion compensation process in our decoder. Each motion vector from the VLD is sent to a simple address generation unit in order to generate the DRAM address to fetch a reference macroblock. The reference macroblock from DRAM is read out to the motion compensation (MC) unit, where temporal interpolation and half-pixel manipulations are performed if necessary. The output from the MC unit is combined with decompressed prediction errors from the IDCT unit to obtain the motion-compensated macroblock, which is then sent back to the external memory for further reference or display. For 4:2:0 format, color components Cb and Cr are interleaved and compacted in the DRAM to reduce memory accesses. Thus, the post-process, shown in Figure 8, is needed before writing the blocks of reconstructed pixels back to the DRAM.

-addrera

I D C T unit M otion vector

com pensstloll s*

hj

~~

-~~ x-_

I___

m acrohlock type

half pel

full-pel e t c (froin ‘ontroller)

Figure 8 Motion compensation A motion compensation unit for HDTV (MP@HL) was designed in [16]. In our MC unit design, we aim to minimize the hardware cost while also meeting the requirements of the NTSC and the PAL system at MP@ML. Our motion compensation unit (MCU) is based on a pipelined architecture. Figure 9 shows the architecture for the MCU. Two paths, beginning from the MC input buffer, go separately for the forward prediction and the backward prediction. The F-register

5 14

set and B-register set, which are 4- pixel wide, serve as data pools to pre-load pixels from the MC buffer and arrange the output sequence for the next process. If a motion vactor has half-pel precision, spatial interpolation is performed in the addedshifier unit immediately following the register sets. In the case of bi-directional prediction, the results from both the forward prediction and the backward prediction paths are added and shifted (temporal interpolation) to obtain the reference pixels. There are four pipeline stages in a MCU. The first two stages are loading stages, while the next two stages are computing cycles. With the pixel-level pipeliine, parallelism is achieved to get high throughput. There is a latency of two cycles between rows in a block ( 8x8 pixels). This results from loading a new row of data from MC input buffer to F-/B-registers. Figure 10 shows the timing diagram for our MC unit. The total number of cycles needed to process one macroblock (MB) (4 luminance 8x8 blocks and 2 chrominance 8x8 blocks) using one MCU is 481. It is lower than 667 cycles, the upper bound o f process cycles for one MB in NTSC or PAL system at MP@ML.

I

From

c o n troller Control

F - Registers

lag,LZ

I

M C input buffer

-F r o m

B - Registers

control logic

I l l I A d d e r & s h ~ f t e i IAdder & shifter

~~

control loglc

-krom

a d d erShifter R e s u l t s f r o m ID C 1

adder Reconstructed pixel

Figure 9 The motion compensation unit (one row o f data) 8 pixels generated

+

+

L

'

,

,

,

I

,

!

I

,

,

I

(one row o f data) 8 pixels generated c +

(one row of data) 8 pixels generated +

f

I

,

,

,

I

I

I

t

i

l

I

l

l

,

I

I

--

,

,

,

,

,

cy~:ley . ... ..

... ... .

Cycles

c o m p u t i n g cycles

Figure 10 The timing for one MC unit

3.4 Evaluation In this section, the performances o f t h e 2-D IDCT unit and the MC unit are analyzed to determine the minimum number o f basic components, i.e. the number o f MACS fix 2-D IDCT unit and the number o f MC units. For MP@ML with 4:2:0 format, the lower bound for process speed requirement is 40,500 MB/sec. At clock rate of 27MHz, 667 cycles is the maximum tolerable process time per MB. Our 2-D IDCT and MC unit performances are summarized in Table 1.

515

M C unit

2-D IDCT unit

Umer 1

Process time per MB

MP@ML

440 cycles

N~ ofMACs

Process time (2x1-D IDCT) per MB

8

481 cycles

1

No of MCUs

bounds

1

667 cycles

Table 1 The performances of our 2-D IDCT unit and MC unit It can be seen from Table 1 that these function-specific modules satisfy MP@ML performance requirement, if data and results for the modules can be fetched and sent back smoothly without delay. However, in practice, factors such as buffer underflowioverflow, bus/DRAM availability, and process synchronization may degrade the performances. The baseline controller, described in the later section, takes care of the synchronization problem. In the next section, we present our busmonitoring model in order to determine suitable bus control schemes and buffer sizes to prevent the high frequency of bus conflicts and buffer underflowioverflow, that may result in increasing the processing time for MBs. Due to the unpredictability of data nature and traffic, a simulator, with a delicate controller design, is developed to test out our model.

4. Bus-Monitoring Model From the architecture of the decoder. the data from five different paths are routed to/from the DRAM via the bus. As shown in Figure 1, path1 is from the FIFO to the DRAM. Path2 is from the DRAM to the VLD. Path3 is from the DRAM to the Motion Compensation (MC) unit. Path4 is from the output buffedadder, which forms the reconstructed macroblocks, to the DRAM. Path5 i s from the DRAM to the display unit. Five different bus requests can be issued for these five paths under the conditions of the ti0 buffers connected to the bus. In general, the bus should be acquired either when the content of a buffer i s over or below some predefined level, or when buffer overflowhnderflow is about to occur. In our design. the level is set to 50% fullness or 1 MB of the buffer in each path. The bus administration, a special module in our simulator, receives the incoming request and determines the transfer size. The request is granted if the bus is free; otherwise, it is placed onto the waiting list. All the requests are scheduled by the bus arbitration algorithm. Different priorities can be assigned to different paths. Different bus control schemes are generated and simulated by our model. Precautions are taken to prevent starvation.

5. The Simulator and Controller Scheme A software simulator has been developed to simulate and monitor the decoding process in the architecture. Each function-specific processing block is simulated by the characteristic of its architecture and the nature of the input data (e.g. DC or AC coefficients, I or B or P pictures, types of macroblocks, CBP, ..etc.), for a typical MPEG-2 MP@ML bitstream. Architectural features, such as latency, pipeline stages, buffer sizes, ..etc., for each processing module, are specified by parameters. Basically, this simulator is a skeleton of the controller. We use a hierarchical and distributed controller design approach. In our model, distributed finite state machines (FSMs) are assigned to control individual path on a cycle-by-cycle basis, and a centralized controller synchronizes the entire architecture on a macroblock-by-macroblock basis. The top controller is responsible for synchronization and communication with different process lines. The baseline is more complicated than the others, because the functional units in the baseline have different processing rates. For

5 16

example, the processing time for IDCT for 1 MB may be I O times that of VLD, and if blocks in a MB are processed in a discontinuous way, then the required performance may not be met. Our baseline FSMs is designed to shorten the gap in the pipeline, i.e. to reduce the latency to increase throughput. Functional units cooperate at proper timing to ensure the correctnrss of data and to achieve the required performance. However, the operation along this line will be frozen if the condition of buffer overflowhnderflow occurs. This situation is applied to every process line. The advantages of this hierarchical design are the ease of developing a higher level controller, the reduction of the complexity for controller design, and the ease in debugging anld high reliability. This controller is designed with the consideration of hardware implementation, arid can be mapped into a microcode controller.

6. Simulation Results Simulation was performed for I, P, B pictures for NTSC ITU-R 601 video at MP@ML. Figure 1 1 shows the bus bandwidth for I, P, and B field pictures. We synchronized the decoding process at the picture level to clarify bus activities for different types of pictures. The performances and bus utilization are also evaluated for different pictures with different buffer sizes, as shown in Table 2. In the priority scheme, path5 is given the highest priority followed by path2, path3, path1 and lastly, path4. In the simulation, the MB output buffer is fixed at 384 bytes, the FIFO channel buffer is fixed at 64 bits, and the display buffer is about 1 K b j tes. Results show that the size of MC buffer determines the performance more than that of the VLD buffer. since the former encounters more buffer underflows, which may suspend the operations in the process lines. Results also show that a VLD input buffer of 16 bytes and an MC input buffer of 96 bytes will be sufficient for the priority scheme. Our results satisfy real-time MP@ML decoding. % of bus utilization 100%

75%

50%

25%

0%

I

P

B

scanlines

Figure 11 Bandwidth for different pictures

7. Conclusion In this paper, an efficient architecture of our MPEG-2 decoder design to achieve the performance for MP@ML is presented. Due to the variety of ME3 types and the unpredictability of memory accesses, a bus-monitoring model and software sirnulator is developed to ensure performance in practice. Simulation shows that the design mects real-time M P@ML decoding performance requirement. Appropriate buffer sizes are also determined.

517

1 M C unit 1 VLD unit 24 bytes 8 bytes VLD MC 2 2 2 4 2 8 4 2 4 4 4 8 6 2 6 4 6 8 8 2 8 4 12 8 8 2 12 4 12 8 16 2 16 4 16 8 32 2 32 4 32 8

P- PICTURE decode_cycles* bus_utilization 686 1377% 685 7345% 685 7330% 589 60 99% 589 60 68% 60 53% 588 573 58 23% 511 5 1 92% 571 5 1 76% 511 5554% 570 55 20% 571 55 04% 557 5392% 551 53 59% 556 5342% 559 53 01% 560 52 73% 560 52 56% 580 51 40% 578 51 10% 518 50 94%

BUS SCHEME priority 8- PICTURE decode-cycles bus_utIIizailon 194 11730% 188 11644% 786 11604% 641 91 02% 631 90 15% 638 89 80% 600 8541% 596 8461% 595 84 11% 519 8041% 518 7964% 515 1 9 19% 594 7 1 51% 588 7618% 586 76 37% 604 7542% 600 1460% 585 74 13% 604 72 87% 613 7202% 613 71 61%

I_ PICTURE decode-cycles bus-utilization 465 33 51% 461 32 15% 461 32 35% 465 33 51% 461 32 75% 461 32 35% 465 33 51% 461 32 75% 461 32 35% 465 33 57% 461 3275% 461 3235% 465 33 57% 461 32 75% 461 32 35% 465 33 57% 461 32 75% 461 32 35% 465 33 57% 461 32 15% 461 32 35%

Table 2 Decode cycles per MB and bus utilization for different pictures under different buffer sizes ( * Decoding cycles is no. of cycles to process one macroblock(MB)) References 1.

2. 3. 4. 5.

6.

7. 8. 9.

IO I1

12 13 14

15 16

[TU-T H.262, ISOilEC 138 18-2, Injormaiion Technology-Generic Coding of Moving Pictures and Associated Audio Information : Video. 1994. Shaw-Min Lei and Ming-Ting Sun. "An Entropy Coding System for Digital HDTV Applications," IEEE Trans. on Circuits andsystems for Video TechnologL.. Vol. 1. No. 1 . pp. 147-155. March 1991. '1'. Demura. et al., "A Single-Chip MPEG2 Video Decoder LSI," IEEE ISSCC Digest ofTech. Papers, pp. 72-73, Feb 1994. J. L. Sicre and A. Leger, '' Silicon Complexity of VLC Decoder vs Q-coder." JPEG N258, ISO/JTC/SC2/WGX, CClTT SG VII. Feb. 1989 C-23, N.Ahmed, T. Natarajan, and K. R. Rao. Discrete Cosine Transform." IEEE Trans. Compurers, Vol. pp. 90-94, Jan. 1974. M. Toyokura et al., "A Video DSP with a Macroblock-Level-Pipeline and a SIMD Type Vector-Pipeline Architecture for MPEG2 CODEC." IEEE Journd ofSoiid-Stare Circuiis. Vol. 29, No. 12, pp.1474-1481, Dec. 1994. T. Araki et al.. "Video DSP architecture for MPEG2 CODEC." Proc. ICASSP-94. Vol. 2, pp.417-420. Apr. 1994. E. Feig, and S. Winograd,"Fast Algorithms for the Discrete Cosine Transform," IEEE Trans. on Signal Processing, Vol. 9, pp.2174-2193. No. 40. Sep. 1992 B. G. Lee, "A New Algorithm to Compute the Discrete Cosine 'Transform," fEEE Trans. Acoust., Speech and Signal Process., Vol. ASSP-32, No. 6, pp. 1243-1245. Dec. 1984. C. Hung and T. H.-Y. Meng, "A Comparison of Fast Inverse Discrete Cosine Transform Algorithms." ACM Multimedia systems, Vol. 2. pp.204-217. 1994. Pirsch, N. Demassieux, and W. Gehrke, "VLSI Architectures for Video Compression-A Survey," Proceedings of the /EEL. Vol. 83. No. 2, pp.220-246, Feb. 1995. Stanley A. White, "Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review," IEEE ASSP Magazine, pp.4-18, July 1989. Totzek and F. Matthiesen, "Two-dimensional Discrete Cosine Transform with Linear Systolic Arrays," Proc. of Intl. Conf on Systo[ic Array.?. Ireland. pp.388-397. 1989. Kim and W. Sung, "Optimum Wordlength Determination of 8x8 IDCI' Architectures Conforming to the IEEE Standard Specifications," Proc. of 29th Asiiomar Conference on Signais, Systems, and Computers, pp. 821-825. Nov. 1995. H. Chen, C. H. Smith, and S.C. Fralick. "A Fast Computational Algorithm for the Discrete Cosine Transform," fEEE Trans. Communications. Vol. COM-25, No.9. pp. 1004-1009. Sep. 1977 Masaki, Y. Morimoto, T. Onoye, and 1. Shirakawa. "VISI Implementation of Inverse Discrete Cosine Transformer and Motion Compensator for MPEG2 HDTV Video Decoding. IEEE Trans. on Cir. and Sys for Video Tech.. Vol. 5, No. 5, pp.387-395, Oct. 1995. 'I

"

518

New efficient decoder for product and concatenated block codes