two dimensional discrete cosine transform using bit ...

Viewer
Transcript

TWO DIMENSIONAL DISCRETE COSINE TRANSFORM USING BIT LEVEL SYSTOLIC STRUCTURES Project Report Submitted by

Balagopal G 0221156

Jagannath S 0221162

Nishanth U 0221374

Under the guidance of

Dr.Sumam David Professor, Dept of ECE, NITK, Surathkal.

In Partial Fulfillment of the Requirements for the Award of Degree of BACHELOR OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA, SURATHKAL SRINIVASNAGAR – 575 025 KARNATAKA, INDIA May 2006

Dept of E & C

-1-

NITK – Surathkal

1. INTRODUCTION Discrete orthogonal transforms (DOTs), which are used frequently in many applications including image and speech processing, have evolved quite rapidly over the last three decades. Typically, these applications require enormous computing power. However, a close examination of the algorithms used in such real world algorithms (e.g. the discrete cosine transform (DCT), the discrete Fourier transform (DFT), and singular value decomposition (SVD)), reveal that many of the fundamental actions involve matrix or vector multiplication operations. Unfortunately, matrix-matrix multiplication algorithms have O(N3) complexity and matrix-vector multiplication algorithms have O(N2) complexity, making them computationally intensive for large-scale problems. Consequently, techniques that reduce this complexity are required. In real applications, data to be processed is often available as a stream of input values. In such cases, the throughput rate (which is the time between two consecutive input data values) is more important than the latency (which is the time from the input of a set of data to its computed result). Therefore, there is a need to incorporate some level of pipelining in the overall system. The use of systolic arrays in the design and implementation of high performance digital signal processing equipment is now well established. Most of the research to date has concentrated on word-level systems where the typical processor is of a single chip (at least). The systolic array concept can also be exploited at bit level in the design of individual VLSI chips. A bitlevel methodology exhibits a number of attractive features such as: • The basic processing element is small (typically a gated full adder) and an Entire array of these may be integrated on a single chip; • The computation time for a single cell is small (typically 3–4 gate delays) and so the overall throughout rate which may be achieved using a given technology is very high; • The highly regular structure of the circuits renders them comparatively easy to design and test; • Regular and nearest-neighbor interconnections between the PE’s, high level of pipelinability, small chip-area, and low power dissipation. Conventional ROM multipliers may be used. But the bottleneck is the size which increases exponentially with word length and transform size. An implementation of DCT based matrix-matrix and matrix-vector multiplication using the BAUGH-WOOLEY algorithm was done using the structural style of coding. A full-custom layout of the same at the transistor level was manually designed using magic tool and simulated using ngspice

Dept of E & C

-2-

NITK – Surathkal

2. BAUGH-WOOLEY ALGORITHM 2.1 OVERVIEW The algorithm specifies that all possible AND terms are created first, and then sent through an array of half-adders and full-adders with the carry-outs chained to the next most significant bit at each level of addition. For signed multiplication (by utilizing the properties of the two’s complement system) the Baugh-Wooley algorithm can implement signed multiplication in almost the same way as the unsigned multiplication.

2.2 MATHEMATICAL FORMULATION Let X = [x0, x1, . . . , xN-1]t be the N-point input vector and C be the N x N kernel matrix of an orthogonal transform. The transformed vector of Y is given by Y = CX = [y0, y1, . . . , yN_1]t

[1]

Such that N-1

ym =

Σ Cmk xk

[2]

k=0

Let the elements of the kernel matrix C and the data vector X be represented by the 2’s complement code as n-2

Cmk = - Cmk

n-1

2

n-1

+ Σ Cmk i 2 i

[3]

i=0 n-2

xk = - xk

n-1

2

n-1

+ Σ-xj 2j

[4]

j=0

where xkj and Cmki are the jth bit of xk and ith bit of Cmk, respectively, (which are either zero or one). xkn-1 and cmkn-1 are the sign bits, where n is the wordlength. Substituting [3] and [4] into [2], we have N-1

ym =

n-2

Σ [ - Cmk

n-1

n-1

2

n-2 i

+ Σ Cmk 2

k=0

i

][

-xkn-1

n-1

2

i=0

+ Σ xkj 2j ]

[5]

j=0

Using the Baugh–Wooley algorithm, (5) may be expressed as N-1 n-2 n-2

ym = Σ[

Σ Σ2

k=0 i=0 j=0

Dept of E & C

n-2 i+j

Cmkixkj

+2

Cmkn-1xkn-1 +

2n-2

( Σ-2 j=0

-3-

n-2 j

Cmkn-1xkj +

Σ-2ixkn-1Cmki) 2n-1 ]

[6]

i=0

NITK – Surathkal

N-1 n-2 n-2

ym = Σ[Σ

Σ2

n-2 i+j

i

j

Cmk xk +2

2n-2

Cmkn-1xkn-1 +

k=0 i=0 j=0

n-2

1-Cmkn-1xkj

Σ(

)2

n+j-1

2n-3

1-xkn-1Cmki

+Σ(

j=0

i=0

)2

n+i-1

-2Σ2i] i=n-1

[7] N-1 n-2 n-2 i+j

ym = Σ[ΣΣ2

k=0 i=0 j=0

n-2 i

Cmk xkj

+2

2n-2

Cmkn-1xkn-1 +

n-2 j

( Σ2 (Cmk j=0

n-1

j

xk )’+Σ2i(xkn-1Cmki)’)2n-1+2n-22n-1] i=0

[8] N-point discrete orthogonal transform output given by [8] may be computed by a systolic architecture described in the following section.

2.3 SYSTOLIC IMPLEMENTATION From [8] it can be seen that multiplication of Cmk and xk, expressed in 2’s complement representation, can be written in a form which involves only positive bit products. The multiplier design is based on the multiplication scheme shown in Fig. 1 for wordlength n = 4. The partial-product terms are formed by ANDing each multiplicand bit with each multiplier bit. The partial products xkjcmk3 and xk3cmki for i and j = 0; 1, and 2 contain information concerning the sign of the operands. According to Baugh–Wooley algorithm, these product terms containing the sign information are complemented to obtain the partial products. The final product is computed by adding “1” to the fifth and eighth columns along with all partial product terms. The proposed 2’s-complement serial–parallel multiplier is comprised of a logic unit and an adder unit shown in Fig. 2 (for n = 4).The logic unit consists of four AND gates, one NAND gate, four XOR gates, and an OR gate. Duration of a clock cycle = max{TA, TG}; where TA is the full adder delay and TG is the total gate delay in performing an AND operation, one XOR operation followed by an AND/OR operation. The output of the multiplier is obtained from the adder unit by carry-save and add-shift technique. Q and Q’ are two control signals. In the first three and last four clock cycles Q = 0; but in the fourth clock cycle Q = 1. Q’ = 0 for the sixth, seventh and eighth clock cycles and 1 in the rest of clock cycles. The extra “1” (necessitated by Baugh–Wooley algorithm) for the fifth column is provided to the right-most adder through an OR gate with the help of delayed Q signal. The extra “1” for the eighth column is provided to the left most adder with the help of Q’ signal through an AND gate. Each bit of multiplicand xk is provided to the first row of AND and NAND gates of the multiplier simultaneously while the bits of Cmk are stored and fed to the individual gates (Fig. 2). During the first clock cycle all flip-flops are reset, and the first bit of xk and control signal Q are fed to the multiplier. The control signal Q’ is fed to an AND gate following the left most XOR gate. Four zeros are appended to the left of the MSB of each xk for input/output synchronization.

Dept of E & C

-4-

NITK – Surathkal

Fig. 1. 4 X 4 bit 2’s complement multiplication using the Baugh–Wooley algorithm.

Fig. 2. 4 X 4 bit 2’s complement multiplication using the Baugh–Wooley algorithm.(BWM)

Dept of E & C

-5-

NITK – Surathkal

3. SYSTOLIC ARCHITECTURE Contemporary parallel architectures may be grouped into three different classes based on structure : vector processors, multiprocessor systems, and array processors. Vector processors and multi-processor systems belong to the domain of general purpose computers while most array processors are designed for special-purpose applications. Array processors, as a computing paradigm, are capable of meeting the realtime processing requirements of a variety of application domains. Locally interconnected computing networks such as systolic and wavefront processor arrays, due to their massive parallelism and regular data flow, allow efficient implementation of a large number of algorithms of practical significance; especially in the areas of image processing, signal processing, and robotics. A systolic system is defined as a network of processors which rhythmically compute and pass data through the system". Systolic arrays, as a class of pipelined array architectures, display regular and modular structures locally interconnected to allow a high degree of pipelining and synchronized multiprocessing capability. The primary reasons for the use of systolic arrays in special-purpose processing are simple and regular design, concurrency and communication, with balanced computation and I/O .

• • • •

As an example, consider a linear array. It has the following properties. It’s a fixed connection network. Underlying graph is fixed. There is only local connection. I/O location is fixed.

At each step of a globally synchronous clock, each processor 1. Receives inputs from neighbours ( or I/O ). 2. Inspects local memory. 3. Performs local computation. 4. Updates local memory. 5. Generates outputs for neighbours. Example: sorting • Accept the left input. • Compare input with the stored value. • Store smaller value. • Output bigger number to right.

Dept of E & C

-6-

NITK – Surathkal

3,5,1,2,3

3,5,1,2

3

3,5,1

2

3,5

1

3

3

2

5

3 2

1 3

5 1

3 2

3

Fig. 3. Sorting

3.1 PROCESSING ELEMENT Each PE comprises a serial parallel Baugh–Wooley multiplier (BWM) as shown in Fig 4, a flip-flop (FF) for saving the carry bit and a full adder that adds the result of the partial product and the result generated from the previous PE.

Fig. 4. Processing Element

3.2 MATRIX VECTOR MULTIPLICATION Equation [8] can be mapped into the proposed architecture as shown in Fig. 5 for the case of N=4. Using the same previous PEs structure for matrix multiplication, the matrix elements aij are fed from the north in a parallel/serial fashion bit by bit LSB first (LSBF) while the vector elements bi are fed in a parallel fashion and remain fixed in their corresponding PE cell during the entire computation of the operation. Each bit of the final

Dept of E & C

-7-

NITK – Surathkal

product of the PE is fed to the full adder of the preceding PE so that the corresponding output bits of each PE are added to compute the desired output bit in an LSBF fashion. During the first eight cycles the first inner product [C1] is computed using an LSBF fashion. Then, during the second (third) eight cycles the inner products [C2] ([C3]) are computed. Finally, the fourth inner product [C4] is computed at the end of the fourth eight cycles. The entire computation process can also be carried out 2nN clock cycles with a structure area of N PEs only.

Fig. 5. Matrix Vector multiplier

3.3 MATRIX MATRIX MULTIPLICATION Equation [6] can be mapped into the proposed architecture. Fig. 6 shows the architecture obtained for N=4. It consists of sixteen identical processing elements (PEs). Each PE comprises a serialparallel Baugh–Wooley multiplier (BWM) as shown in Fig. 2, a flip-flop (FF) for saving the carry bit and a fulladder that adds the Result of the partial product and the result generated from the previous PE. The matrix elements bij are fed from the north in a parallel/serial fashion bit by bit LSB first (LSBF) while the matrix elements aij are fed in a parallel fashion and remain fixed in their corresponding PE cell during the entire computation of the operation. Each bit of the final product of the PE is fed to the full adder of the preceding PE so that the corresponding output bit of each PE are added to compute the bit of the desired output bit in an LSBF fashion. During the first eight cycles the four inner products [Ci1] (i=1, 2, 3, 4) are computed using an LSBF fashion. Then, during the second (third) eight cycles the four inner products [Ci2] ([Ci3]) are computed. Finally, the four inner products [Ci4] are computed and are available at the output buffer at the end of the fourth eight cycles.

Dept of E & C

-8-

NITK – Surathkal

It is worth mentioning that the array produces four coefficients of the matrix C every eight clock cycles based on the multiple accumulate technique and therefore the entire computation can be carried out 2nN clock cycles with a structure requiring N2 PEs.

Fig. 6. Matrix Matrix multiplier

Dept of E & C

-9-

NITK – Surathkal

4.IMPLEMENTATION IN VHDL The proposed architecture for matrix-vector and matrix-matrix multiplication was designed and simulated in VHDL.A structural approach was used in which simpler gates, D-flip flops and adders were used to build RAM, Baugh-Wooley multiplier and the processing element. Finally, the local memories and PEs were interconnected to obtain the desired structure. Individual components were simulated for different input patterns and the waveforms were noted. After interconnection the architecture was simulated for various inputs. The maximum clock frequency and number slices were also noted.

4.1 BAUGH-WOOLEY MULTIPLIER Serial parallel Baugh-Wooley multiplier was designed using basic gates, adders and D-flip flops. It can perform 4-bit by 4-bit signed multiplication with numbers in 2s complement form. Necessary control signals were given and design was simulated. VHDL code: entity bmw is Port ( a : in bit_vector(3 downto 0); b,s1,s2 : in bit; clk : in bit; prod : out bit); end bmw; architecture Behavioral of bmw is signal m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11: bit; signal w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11 : bit; begin p0: p1: p2: p3: p4: p5: p6: p7: p8: p9: p10: p11: p12: p13:

nandx port map (b,a(3),w1); andx port map (b,a(2),w2); andx port map (b,a(1),w3); andx port map (b,a(0),w4); dffx port map (s1,clk,w5); xorx port map (s1,w1,w6); xorx port map (s1,w2,w7); xorx port map (s1,w3,w8); xorx port map (s1,w4,w9); andx port map (s2,w6,w10); orx port map (w9,w5,w11); dffx port map (w10,clk,m1); fax port map (m1,w7,m4,m2,m3); dffx port map (m3,clk,m4);

Dept of E & C

- 10 -

NITK – Surathkal

p14: dffx port map (m2,clk,m5); p15: fax port map (m5,w8,m8,m6,m7); p16: dffx port map (m6,clk,m9); p17: dffx port map (m7,clk,m8); p18: fax port map (m9,w11,m11,prod,m10); p19: dffx port map (m10,clk,m11); end Behavioral; Synthesis Report: Maximum Frequency: 281.770MHz Maximum combinational path delay: 10.277ns Number of Slices: 6 out of 192 Number of Slice Flip Flops: 7 out of 384 Number of 4 input LUTs: 11 out of 384 RTL Schematic:

Fig. 7. Baugh-Wooley multiplier (RTL Schematic)

2x3=6

(0010 x 0011=0000 0110)

-1 x 6 = -6 (1111 x 0110=1111 1010) Fig. 8. Test bench waveform for BWM.

Dept of E & C

- 11 -

NITK – Surathkal

4.2 RAM AND CONTROL UNIT Four 4x4 bit RAMs are used to store the kernel matrix. The RAM is designed using shift registers with load. The control unit necessary for generation of signals for Baugh-Wooley multiplier are implemented using shift registers. A mod-9 counter is used to synchronise all operations. VHDL code: entity ram is Port (b1,b2,b3,b4 : in bit_vector(3 downto 0); clk,loadram,load : in bit; data : out bit_vector(3 downto 0)); end ram; architecture Behavioral of ram is signal q1,q2,q3,q4 : bit_vector(3 downto 0); begin process(clk) begin if clk='1' and clk'event then if loadram='1' then q1 <= b1;q2 <=b2;q3<= b3;q4 <= b4; else if load='1' then q1<= q2; q2<=q3;q3<=q4; end if; end if; end if; data <= q1; end process; end behavioral;

4.3 PROCESSING ELEMENT Each PE consists of a Baugh-Wooley multiplier and a full adder with its carry-out given to the carry-in through a delay element. This is the basic element of the systolic array. The number of PEs indicate the dimension of the matrix/vector to be multiplied. VHDL code: entity PE is Port ( a : in bit_vector(3 downto 0); b : in bit_vector(3 downto 0); clk,load,cin : in bit; op : out bit); end PE; architecture Behavioral of PE is Dept of E & C

- 12 -

NITK – Surathkal

signal temp1,temp2,temp3,temp4,temp5 : bit:='0'; begin temp5 <= (not load) and temp2; temp4 <= (not load) and cin; q0: bmw port map (a,b,clk,load,temp1); q1: fax port map (temp1,temp4,temp5,op,temp3); q2: dffx port map (temp3,clk,temp2); end Behavioral; Synthesis Report: Maximum Frequency: 176.554MHz Maximum combinational path delay: 11.402ns Number of Slices: 11 out of 192 Number of Slice Flip Flops: 20 out of 384 Number of 4 input LUTs: 17 out of 384

RTL Schematic:

Fig. 9. Processing element (RTL Schematic)

4.4 MATRIX-VECTOR MULTIPLICATION The designed architecture performs matrix vector multiplication in a pipelined approach at a very high rate. The transform co-efficients are stored in RAM and output is stored in a buffer. Design is tested for various inputs.

Dept of E & C

- 13 -

NITK – Surathkal

VHDL code: entity mult is port( a4,a3,a2,a1 : in bit_vector(3 downto 0); b14,b13,b12,b11 : in bit_vector(3 downto 0); b24,b23,b22,b21 : in bit_vector(3 downto 0); b34,b33,b32,b31 : in bit_vector(3 downto 0); b44,b43,b42,b41 : in bit_vector(3 downto 0); clk,loadram: in bit; c1,c2,c3,c4 : out bit_vector(7 downto 0)); end mult; architecture Behavioral of mult is begin process(clk,loadram) begin if clk='1' and clk'event and loadram='0' then count(8 downto 0) <= count(9 downto 1) after 10ns; end if; load <= count(0) and (not loadram); count(9) <= count(0); end process;

tempo <= load ; tem5 <= (not load) and tem4; tem1_t <= (not load) and tem1; tem2_t <= (not load) and tem2; tem3_t <= (not load) and tem3; w1: w2: w3: w4:

pe port map (a4,b4,clk,load,'0',tem1); --pe4(farthest) pe port map (a3,b3,clk,load,tem1_t,tem2); --pe3 pe port map (a2,b2,clk,load,tem2_t,tem3); --pe2 pe port map (a1,b1,clk,load,tem3_t,tem4); --pe1(nearest to c)

w5: w6: w7: w8:

ram port map (b11,b12,b13,b14,clk,loadram,load,b1); ram port map (b21,b22,b23,b24,clk,loadram,load,b2); ram port map (b31,b32,b33,b34,clk,loadram,load,b3); ram port map (b41,b42,b43,b44,clk,loadram,load,b4);

w10:

op_ram port map(write,clk,tem5,mem);

c4<= mem(31 downto 24); c3<= mem(23 downto 16); c2<= mem(15 downto 8) ; c1<= mem(7 downto 0) ; end Behavioral;

Dept of E & C

- 14 -

NITK – Surathkal

Synthesis Report: Maximum Frequency: 108.249MHz Maximum combinational path delay: 17.108ns Number of Slices: 90 out of 192 Number of Slice Flip Flops: 151 out of 384 Number of 4 input LUTs: 126 out of 384 RTL Schematic:

Fig .10. Matrix Vector multiplier (RTL Schematic)

Dept of E & C

- 15 -

NITK – Surathkal

23

3410 2451

45

32

23 45

4322

3 2 6

5614

1

37

37

32

Fig. 11.Test bench waveform for Matrix vector multiplier

4.5 MATRIX-MATRIX MULTIPLICATION The designed architecture performs matrix vector multiplication in a pipelined approach at a very high rate. The transform co-efficients are stored in RAM and output is stored in a buffer. Design is tested for various inputs. VHDL code: entity mult is port( a14,a13,a12,a11 : in bit_vector(3 downto 0); a24,a23,a22,a21 : in bit_vector(3 downto 0); a34,a33,a32,a31 : in bit_vector(3 downto 0); a44,a43,a42,a41 : in bit_vector(3 downto 0); b14,b13,b12,b11 : in bit_vector(3 downto 0); b24,b23,b22,b21 : in bit_vector(3 downto 0); b34,b33,b32,b31 : in bit_vector(3 downto 0); b44,b43,b42,b41 : in bit_vector(3 downto 0); clk,loadram: in bit; c11,c12,c13,c14 : out bit_vector(7 downto 0); c21,c22,c23,c24 : out bit_vector(7 downto 0); c31,c32,c33,c34 : out bit_vector(7 downto 0); c41,c42,c43,c44 : out bit_vector(7 downto 0) ); end mult; architecture Behavioral of mult is

Dept of E & C

- 16 -

NITK – Surathkal

begin process(clk,loadram) begin if clk='1' and clk'event and loadram='0' then count(8 downto 0) <= count(9 downto 1) after 10ns; end if; load <= count(0) and (not loadram); count(9) <= count(0); end process; w11: w12: w13: w14:

pe port map (a14,b4,clk,load,'0',tem11); pe port map (a13,b3,clk,load,tem11_t,tem12); pe port map (a12,b2,clk,load,tem12_t,tem13); pe port map (a11,b1,clk,load,tem13_t,tem14);

w21: w22: w23: w24:

pe port map (a24,b4,clk,load,'0',tem21); pe port map (a23,b3,clk,load,tem21_t,tem22); pe port map (a22,b2,clk,load,tem22_t,tem23); pe port map (a21,b1,clk,load,tem23_t,tem24);

w31: w32: w33: w34:

pe port map (a34,b4,clk,load,'0',tem31); pe port map (a33,b3,clk,load,tem31_t,tem32); pe port map (a32,b2,clk,load,tem32_t,tem33); pe port map (a31,b1,clk,load,tem33_t,tem34);

w41: w42: w43: w44:

pe port map (a44,b4,clk,load,'0',tem41); pe port map (a43,b3,clk,load,tem41_t,tem42); pe port map (a42,b2,clk,load,tem42_t,tem43); pe port map (a41,b1,clk,load,tem43_t,tem44); -

w15: w16: w17: w18:

ram port map (b11,b12,b13,b14,clk,loadram,load,b1); ram port map (b21,b22,b23,b24,clk,loadram,load,b2); ram port map (b31,b32,b33,b34,clk,loadram,load,b3); ram port map (b41,b42,b43,b44,clk,loadram,load,b4);

w10: op_ram port map(write,clk,ou1,mem1); w20: op_ram port map(write,clk,ou2,mem2); w30: op_ram port map(write,clk,ou3,mem3); w40: op_ram port map(write,clk,ou4,mem4); c14<= mem1(31 downto 24); c13<= mem1(23 downto 16); c12<= mem1(15 downto 8) ; c11<= mem1(7 downto 0) ; c24<= mem2(31 downto 24);

Dept of E & C

- 17 -

NITK – Surathkal

c23<= mem2(23 downto 16); c22<= mem2(15 downto 8) ; c21<= mem2(7 downto 0) ; c34<= mem3(31 downto 24); c33<= mem3(23 downto 16); c32<= mem3(15 downto 8) ; c31<= mem3(7 downto 0) ; c44<= mem4(31 downto 24); c43<= mem4(23 downto 16); c42<= mem4(15 downto 8) ; c41<= mem4(7 downto 0) ; end Behavioral; Synthesis Report: Maximum Frequency: 109.314MHz Maximum combinational path delay: 16.9 ns Number of Slices: 206 out of 432 47% Number of Slice Flip Flops: 332 out of 864 38% Number of 4 input LUTs: 287 out of 864 33% RTL Schematic:

Fig. 12a. Matrix matrix multiplier (RTL Schematic)

Dept of E & C

- 18 -

NITK – Surathkal

Fig. 12b. Matrix matrix multiplier (RTL Schematic)

Dept of E & C

- 19 -

NITK – Surathkal

2 -1 0 4 1 3 -2 0 0013 -8 2 0 1

1121 -1 0 0 1 3424 -2 0 1 0

-5 2 8 1 -8 -7 -2 -4 -3

4

5

4

-12 -8 -15 -6

Fig. 13.Test bench waveform for Matrix matrix multiplier

Dept of E & C

- 20 -

NITK – Surathkal

5. PERFORMANCE IN FPGA IMPLEMENTATION Typically, FPGAs structures provide a reconfigurable hardware with flexible interconnects, with field-programmable ability, which are widely used for the rapid prototyping of DSP and computer systems. Furthermore, the recent advances in IC processing technology and innovations in their architectures have made FPGAs highly suitable alternatives to design powerful computing platforms. The ROM size of the architecture increases rapidly with the order of the DCT so that it may be useful for implementing the DCT of lower order but not suitable for implementing DCT of higher order. Since the number of slices per PE is known, the total number of slices for the entire architecture can be predicted. Hence if matrix dimension is increased from M to N then the number of slices will increase to number-of-slices-per-PE*(N2-M2) for matrixmatrix and slices-per-PE*(N-M) for matrix-vector multiplication. Also the variation of maximum clock frequency and number of slices with the transform was noted for both matrix-matrix and matrix-vector structures. The following graphs were obtained as a result for word length n=4:

Matrix-matrix

Matrix-vector

Fig. 14. Variation of clock frequency and slices with transform length.

Dept of E & C

- 21 -

NITK – Surathkal

6. IMPLEMENTATION IN MAGIC The implementation in FPGA gives great flexibility with respect to the design time, parametrisable transform length and time to market. But it scores very low on area utilization when compared to full-custom design. The project attempted to implement the same design in full-custom design with the free software magic ver 7.3 on a linux platform. Magic is a powerful EDA tool and is available for free. The design is composed of many basic gates and logic circuits. Each of these was manually designed for the application and optimized for area. In the process, we built our own cell library. Here, lambda rules were followed( not the industry standard).

6.1 CELL LIBRARY • • • • • • • •

AND ( 2 INPUT ) NAND ( 2 INPUT ) MULTIPLEXER (2:1) OR ( 2 INPUT ) EXOR ( 2 INPUT ) FLIP-FLOP WITH RESET FULL ADDER SHIFT REGISTER BASIC CELL (with shift and load signals )

6.2 DESIGN FLOW • • • • •

•

• •

The design specification of each block was got from the algorithm. It specifies the inputs and the ouputs. SPICE netlist was built and was simulated using standard TSMC 350 nm tech files for different values of the inputs. The SPICE netlist design was simulated and tested for specifications until it was met. Layout for the design was built using MAGIC. Metal 1 was used for interconnections wherever possible. Metal 2 was used to connect different blocks and we have customized it to have fewer contacts and poly runs. The SPICE netlist was extracted from the layout with parasitics without any constraints. This validates the design even for the worst case parasitic delay and associated problems. The netlist was simulated for different combinations of the inputs and different clock periods and necessary modifications were made to eliminate glitches. The outputs were as expected.

Dept of E & C

- 22 -

NITK – Surathkal

Fig. 15. Snapshots of the cells in library

Dept of E & C

- 23 -

NITK – Surathkal

Fig. 16. Complete layout of the multiplier

The total transistor count for the matrix matrix multiplier was 14514. Magic version 7.3 Scmos version 8.2.8 MOSIS scalable CMOS technology.

Dept of E & C

- 24 -

NITK – Surathkal

Fig. 17. Layout of the PE

Dept of E & C

- 25 -

NITK – Surathkal

6.3 SPICE NETLIST The spice netlist shows the various components present and the terminals to which they are connected. An illustrative example of a netlist for full adder is given. * SPICE3 file created from adder.ext - technology: scmos M1000 a_19_n14# a vdd vdd pfet w=4u l=2u + ad=20p pd=18u as=100p ps=90u M1001 axorb a_19_n14# a_28_n14# vdd pfet w=4u l=2u + ad=24p pd=20u as=40p ps=36u M1002 b a axorb vdd pfet w=4u l=2u + ad=20p pd=18u as=0p ps=0u M1003 a_19_n14# a 0 0 nfet w=4u l=2u + ad=20p pd=18u as=100p ps=90u M1004 axorb a a_28_n14# 0 nfet w=4u l=2u + ad=24p pd=20u as=40p ps=36u M1005 b a_19_n14# axorb 0 nfet w=4u l=2u + ad=20p pd=18u as=0p ps=0u M1006 a_28_n14# b vdd vdd pfet w=4u l=2u + ad=0p pd=0u as=0p ps=0u M1007 a_83_n14# axorb vdd vdd pfet w=4u l=2u + ad=20p pd=18u as=0p ps=0u M1008 sum a_83_n14# a_92_n14# vdd pfet w=4u l=2u + ad=24p pd=20u as=40p ps=36u M1009 cin axorb sum vdd pfet w=4u l=2u + ad=40p pd=36u as=0p ps=0u M1010 a_28_n14# b 0 0 nfet w=4u l=2u + ad=0p pd=0u as=0p ps=0u M1011 a_83_n14# axorb 0 0 nfet w=4u l=2u + ad=20p pd=18u as=0p ps=0u M1012 sum axorb a_92_n14# 0 nfet w=4u l=2u + ad=24p pd=20u as=40p ps=36u M1013 cin a_83_n14# sum 0 nfet w=4u l=2u + ad=40p pd=36u as=0p ps=0u M1014 a_92_n14# cin vdd vdd pfet w=4u l=2u + ad=0p pd=0u as=0p ps=0u M1015 a_147_n14# axorb vdd vdd pfet w=4u l=2u + ad=20p pd=18u as=0p ps=0u M1016 cout a_147_n14# cin vdd pfet w=4u l=2u + ad=24p pd=20u as=0p ps=0u M1017 a axorb cout vdd pfet w=4u l=2u + ad=20p pd=18u as=0p ps=0u M1018 a_92_n14# cin 0 0 nfet w=4u l=2u + ad=0p pd=0u as=0p ps=0u M1019 a_147_n14# axorb 0 0 nfet w=4u l=2u + ad=20p pd=18u as=0p ps=0u M1020 cout axorb cin 0 nfet w=4u l=2u + ad=24p pd=20u as=0p ps=0u M1021 a a_147_n14# cout 0 nfet w=4u l=2u + ad=20p pd=18u as=0p ps=0u C0 vdd a 6.8fF C1 vdd cin 2.5fF C2 vdd b 2.1fF C3 vdd a_19_n14# 2.1fF

Dept of E & C

- 26 -

NITK – Surathkal

C4 vdd a_147_n14# 2.1fF C5 vdd a_83_n14# 2.1fF C6 vdd a_28_n14# 5.0fF C7 vdd a_92_n14# 5.0fF C8 vdd axorb 10.7fF C9 cout 0 2.2fF C10 a_147_n14# 0 10.8fF C11 cin 0 25.3fF C12 sum 0 2.2fF C13 a_92_n14# 0 10.9fF C14 a_83_n14# 0 10.8fF C15 0 0 18.8fF C16 b 0 7.5fF C17 axorb 0 61.0fF C18 a_28_n14# 0 10.9fF C19 a_19_n14# 0 10.8fF C20 a 0 49.2fF C21 vdd 0 15.9fF .include tsmc350nm.txt V1 V2 V3 v4

vdd a 0 b 0 cin

0 dc 5 pulse(0 5 0 0 0 100us 200us) pulse(0 5 0 0 0 50us 100us) 0 pulse(0 5 0 0 0 20us 40us)

.tran 10ns 200us .control run setplot tran1 plot a,b+5,cin+10,sum+15,cout+20 .endc .end

Dept of E & C

- 27 -

NITK – Surathkal

Fig. 18. Simulation result for matrix vector multiplication

Dept of E & C

- 28 -

NITK – Surathkal

7. CONCLUSION AND FUTURE WORK 7.1 WORK DONE Structural VHDL code for matrix-vector and matrix-matrix multiplier were built. The same was simulated and tested for various input patterns. The variation of number of slices and maximum clock frequency with transform length was studied. Our contribution was to build the design using full-custom approach using VLSI tool. Since the design was at transistor level and placement and routing was done manually, there was lot of scope for area minimization and increase in maximum clock frequency. This full-custom design can be used as module within any graphics processor for fast computation of DCT.

7.2 SCOPE FOR FUTURE WORK The same structure can be used for complex transforms like DFT with additional modules for complex mathematics. With the advancement in precision of math processors, the word length increases. The same basic hardware can be duplicated and modifications can be made to the control signals( according to Baugh-Wooley algorithm) to accommodate for the increased word length. Modification needs to be done within the PE. The interconnection between the PEs and the local memory remains the same.

Dept of E & C

- 29 -

NITK – Surathkal

8. BIBLIOGRAPHY AND REFERENCES [1] NAYAK, S., and MEHER, P.: ‘High throughput VLSI implementation of discrete orthogonal transforms using bit-level vector-matrix multiplier’, IEEE Trans. Circuits Syst. II, Analog Digital Signal Process., 1999, 46, (5), pp. 655–658

‘

[2] A.Amira, A.Bouridane, P.Milligan and A.Belatreche.: Design of efficient architectures for discrete orthogonal transforms using bit level systolic structures’, IEE Proceedings online no. 20020159 DOI: 10.1049=ip-cdt:20020159, IEE Proc.-Comput. Digit. Tech., Vol. 149, No. 1, January 2002 [3] John Birkner , Jim Jian, Kevin Smith : High Performance Multipliers in QuickLogic FPGAs, QuickLogic corporation. [4] 6.896 Theory of Parallel Hardware, 2/4/04 L1.1, URL: www.mit.ocw.edu [5] Digital Systems Design using VHDL, Charles H.Roth, Jr [6] Principles of CMOS VLSI design, second edition, Neil H.E.Weste, Kamran Eshraghian. [7] URL: www.xilinx.com

Dept of E & C

- 30 -

NITK – Surathkal

two dimensional discrete cosine transform using bit ...

Discrete orthogonal transforms (DOTs), which are used frequently in many applications including image and speech processing, have evolved quite rapidly over the last three decades. Typically, these applications require enormous computing power. However, a close examination of the algorithms used in such real world ...

Download PDF

3MB Sizes 6 Downloads 213 Views

Report

two dimensional discrete cosine transform using bit ...

Recommend Documents