ECE 559 MOS VLSI DESIGN VOLTAGE OVERSCALING BY UNBALANCED PIPELINING CASE STUDY: 8-BIT WALLACE TREE MULTIPLIER Raghu Vamsi Chavali, Rajkumar Chinnakonda Kubendran Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN [email protected], [email protected] Abstract—The objective of this term project is to understand Supply voltage scaling by unbalanced pipelining without compromising on the speed and performance. As a case study, a pipelined 8-bit Wallace Tree Multiplier with two different Vector Merging Adders, namely, Ripple Carry Adder and Cascaded Carry Select Adder has been designed, implemented and compared. Keywords— VMA, RCA, CCSA, LPB, TSPC, SLP, LLP

II. MULTIPLIER ARCHITECTURE The top level block diagram for the pipelined 8 bit Wallace Tree Multiplier is shown in Fig.2 below,

I. INTRODUCTION Scaling supply voltage to achieve lower power dissipation is a well known technique employed in many applications. But voltage scaling without compromising on speed is a challenging task. CRISTA [1] is a novel technique that uses the concept of unbalanced pipelining in order to scale supply voltages after critical path isolation shown in Fig.1. This is used in our project work where we take the case of 8-bit Wallace Tree Multiplier [2] to illustrate the concept of voltage scaling by unbalanced pipelining. The components of the Multiplier Architecture are described in detail in section II. In section III, the comparison of the Ripple Carry Adder and the Cascaded Carry Select Adder is discussed. The schematic and post layout simulation results of the two multiplier architectures implemented are presented in section IV. We finally conclude with our observations and learning in the project. The bus diagrams for the Multiplier are presented in the Appendix.

Fig.2 Block Diagram of the 8 bit Wallace Tree Multiplier

Fig.1 Path Delay Distribution required for CRISTA

From the figure above, we can observe that the two 8-bit inputs, A0-A7 and B0-B7 are given to the input register bank. The latched inputs are then given to the Partial Product Generator which feeds the Wallace Tree Reduction block which comprises of 4 stages. Now, the reduced partial products are given to an intermediate register bank whose outputs are then fed into the Vector Merging Adder (VMA). The final product, S0-S15 generated by the VMA is latched by the output register bank and given as output. This pipelined implementation consists for 2 or 3 stages depending on the choice of the VMA. In our work, we have

considered two topologies for the VMA, namely, Ripple Carry Adder (RCA) and Cascaded Carry Select Adder (CCSA). If the VMA is chosen to be CCSA, we need an additional register bank in between the Wallace Tree Reduction block to unbalance the pipeline. The clocks to the register banks are generated by the Latency Predictor Block (LPB) from the input global clock with a gating circuitry based on the decoding logic output. Appropriate sizing was done to reduce delay and increase driving strength in order to minimize spurious glitches. A brief description of the building blocks is given below. A. Full Adder and Half Adder. The Mirror Full Adder is used both in the Wallace Tree Reduction block and VMA. This topology has the inherent advantages of lesser gate count and faster Carry generation compared to the Sum which is crucial for the operational speed of the Multiplier. The conventional Half Adder with a XOR gate for the Sum and an AND gate for the Carry is used. Also, the inversion property of Full Adders is exploited in the partial product generation stage and stage 1 of the Wallace Tree Reduction to reduce the number of inverters which significantly reduces power consumption, delay and area. B. Registers The positive-edge triggered True Single Phase Clock (TSPC) registers are used in the Register banks. These registers have low gate count, low delay and lower power consumption when compared to the conventional Mux based registers. The Wallace Tree Multiplier was implemented using both these registers and the TSPC registers were found to have a profound savings in power. C. Latency Predictor Block The decoding logic illustrated in [3] was used to isolate critical paths from the non-critical paths in the Wallace Tree Multiplier (Fig.4). The bits, P4-P7, were used to predict the occurrences of the Short Latency Paths (SLP), Long Latency Paths (LLP) which gives a nominal probability of clock gating and reasonable complexity of implementation.

The rising transition of the output of the decoding logic has to be converted a active low pulse for one clock period. This pulse is the Enable signal used to gate the input global clock and the gated clocks are fed to the Register banks. The State Machine (Fig.6) used for the LPB was derived using the State diagram and State table as shown in Fig.5

In 0 1 1

Out 1 0 1

Fig.5 State Diagram and State Table for the LPB In order to drive a large number of registers, buffering is employed to increase the fan out of the gated clocks.

Fig.6 State Machine and Gating Circuitry for the LPB D. Vector Merging Adder VMA forms the final stage of the Wallace Tree Multiplier and the bottleneck in pipelined implementation. The imbalance in the pipeline is achieved because of the longer delay of the VMA when compared to the Wallace Tree Reduction block. In our project work, we have considered two VMA topologies, namely, i. Ripple Carry Adder The RCA implementation is straightforward where the Carry propagates from LSB to MSB through a chain of Full Adders. The splitting of the SLP and LLP using the LPB is shown in Fig.7

Fig.3 Latency Predictor Block

ii. Fig.4 Pipeline Scheme for the Adder with gated clock

Fig.7 Critical Path for RCA Cascaded Carry Select Adder The simple Carry Select Adder implementation employs two parallel chains of Full Adders propagating with input

Carry taken as 0 and 1 respectively. The actual carry is used to select the correct output using multiplexer logic. The delay through CSA is less since the computation is completed before the actual carry arrives. The cascaded CSA splits the chain of full adders into smaller stages so that the delay of the VMA is reduced further. The splitting of the SLP and LLP using the LPB is shown in Fig.8.

Tclk >= Tsetup+Tclk-Q+Tcomb,max Nanosim tool was used to test the multiplier for 10000 random input vectors generated using Matlab. The power consumption, maximum voltage scaling and frequency scaling achievable were measured and are tabulated in Table II. The power consumption when conventional Mux based registers were used was around 32mW whereas the TSPC based registers consumed only 4mW. The bus diagrams that illustrate the functionality of the decoding logic, LPB and the Multiplier as a whole are presented in the Appendix section. TABLE I DELAY MEASUREMENTS

Fig.8 Critical Path for CCSA III. RCA VS CCSA In this project work, we have compared the performance of two different VMAs in order to obtain a better understanding of the trade-offs involved in using unbalanced pipelining for voltage scaling. The two architectures have been designed, implemented and compared in terms of power dissipated, maximum frequency of operation and layout area. The RCA is the simplest VMA implementation with minimum area in layout, lower power consumption but highest delay. Using RCA, we can achieve more voltage scaling due to the highly unbalanced pipeline and thus result in higher power savings. In comparison, the CCSA is more complicated in implementation with different variations possible based on the splitting of the stages in the adder. For the 12 bit adder, we have implemented three different variations, {2,2,2,2,2,2}, {3,3,3,3} and {2,4,2,4}. Among these, the {3,3,3,3} split had the least delay but more gates. This was chosen to be compared with RCA since it will give us a good understanding of the trade-off between performance and power consumption. Since the delay of CCSA was comparable to that of the Wallace Tree Reduction block, an additional Register bank was required in between stages II and III of the Wallace Tree Reduction block in order to unbalance the pipeline. Hence, CCSA takes up almost 30% more layout area due to additional full adders, multiplexers and register bank. Its power consumption will also be higher but it can operate at a higher frequency due to the additional stage in the pipeline. IV. SIMULATION AND RESULTS The schematic and layout (Fig.9, Fig.10) of the designs were implemented in Cadence using TSMC 0.3um technology. The Spice netlist generated was simulated using NanoTime for delay measurements and are tabulated in Table I. The design frequencies were computed based on these measurements using the following equation,

Block Name TSPC Register Propagation Delay TSPC Register Setup Time Wallace Tree Reduction Wallace Tree reduction Stage 1 and 2 Wallace Tree reduction Stage 3 and 4 LPB RCA CCSA (3,3,3,3) CCSA (4,2,4,2) CCSA (2,2,2,2,2,2)

Delay 145ps 200ps 3.29ns 1.75ns 1.55ns 900ps 5.3ns 2.88ns 3.06ns 3.25ns

TABLE II DESIGN METRICS COMPARISON

Design Parameters

Design frequency at VDD = 2.5V Power at Designed frequency and VDD = 2.5V Min VDD Power at Min VDD and designed frequency Max frequency of operation at VDD = 2.5V % Power reduction design frequency Layout Area Post Layout design frequency at 2.5V Post Layout Power dissipation at 2.5V

Wallace Multiplier RCA 188MHz

Tree using

Wallace Multiplier CCSA 333MHz

4.836mW

11.96mW

1.8V 2.237mW

2.2V 8.76mW

303MHz

385MHz

53%

27%

437*327 0.143sq.mm 204MHz 7.337mW

=

473*430 0.203sq.mm 250MHz 13.272mW

Tree using

=

Fig.9 Layout of Wallace Tree Multiplier with RCA

CONCLUSION The concept of using unbalanced pipelining in order to achieve voltage scaling at the same designed frequency was understood and implemented. Different design aspects for the multiplier, VMA and LPB were studied and compared. Voltage scaling from 2.5 to 1.8V was achieved for the Wallace Tree multiplier using RCA as VMA with almost 53% power savings for 5% throughput penalty. When CCSA was used as VMA, we could achieve 27% power savings for the same throughput penalty. The architectures were finally implemented in layout and post layout simulations match the specifications to a good extent. The design metrics were compared for the two architectures. It is found that RCA is more suitable for a Wallace Tree Multiplier for a low power application but moderate performance. Whereas, CCSA is more suitable for applications that demand higher performance and can tolerate a higher power budget. ACKNOWLEDGMENT We would like to thank Dr.Kaushik Roy for providing us the opportunity to work on this project. We would also like to thank Mr.Kuntal Roy for providing us valuable guidance and support throughout the project. REFERENCES [1]

[2] [3]

Fig.10 Layout of Wallace Tree Multiplier with CCSA

Swaroop Ghosh et.al -CRISTA: IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, Vol. 26, No. 11, November 2007 J.M.Rabaey Anantha Chandrakasan , Digital Integrated Circuits: A Design Perspective, 2nd ed.,Prentice Hall of India, 2008. Debabrata Mohapatra, et al. Low-Power Process-Variation Tolerant Arithmetic Units Using Input-Based Elastic Clocking. ISLPED’07, August 27–29, 2007, Portland, Oregon, USA.

APPENDIX In the figures below, Group0 is the 8bit input A0-A7, Group1 is the 8bit input B0-B7 and Group2 is the 16bit output S0-S15. The frequency of operation can be deduced from the markers which show two clock period delays. When the enable signal goes high, we find that the clock to the register bank gets gated on time before the next set of inputs get latched. We can observe that both one cycle and two cycle operations are computed correctly thus verifying functionality. BUS DIAGRAM FOR WALLACE TREE MULTIPLIER WITH RCA (SCHEMATIC )

BUS DIAGRAM FOR WALLACE TREE MULTIPLIER WITH RCA (EXTRACTED )

BUS DIAGRAM FOR WALLACE TREE MULTIPLIER WITH CCSA (SCHEMATIC )

BUS DIAGRAM FOR WALLACE TREE MULTIPLIER WITH CCSA (EXTRACTED )

ece 559 mos vlsi design voltage overscaling by ...

ECE 559 MOS VLSI DESIGN. VOLTAGE OVERSCALING BY ... Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN ... observations and learning in the project. The bus diagrams .... technology. The Spice ...

571KB Sizes 2 Downloads 372 Views

Recommend Documents

EC6612-VLSI DESIGN-LABORATORY- By EasyEngineering.net.pdf ...
Page 3 of 75. EC6612-VLSI DESIGN-LABORATORY- By EasyEngineering.net.pdf. EC6612-VLSI DESIGN-LABORATORY- By EasyEngineering.net.pdf. Open.

VLSI Architecture for High Definition Digital Cinema ... - Rice ECE
This paper presents a high performance VLSI architecture for the playback system of high definition digital cinema server that complies with Digital Cinema ...

Multi-Layer Parallel Decoding Algorithm and VLSI ... - Rice ECE
parallel decoding algorithm would still require less memory than the two-phase flooding ..... permuters and other related logic will be disabled. The 2Z permuted ...

VLSI Architecture for High Definition Digital Cinema ... - Rice ECE
structure memory and dynamic buffer management method. It can be configured to support both 2k and 4K high definition digital movies. In addition, since ... vided into three parts: hardware-software interface module, information gathering and coding

VLSI DESIGN COURSE HANDOUTS.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps. ... VLSI DESIGN COURSE HANDOUTS.pdf. VLSI DESIGN COURSE ...

introduction to vlsi design
possible due to good digital system design and modeling techniques. 1.2 ..... to them. Verilog as an HDL was introduced by Cadence Design Systems; they.

Industrial Training VLSI Design -
(Live Project). (A Corporate Partner .... Visvesvaraya Regional College of Engineering (1976) o Experience: .... UNIVERSITIES IN USA, CANADA & GERMANY.

ce6302-mos- By EasyEngineering.net.pdf
Define – Strain Energy. [N/D-15]. Whenever a body is strained, some amount of energy is absorbed in thebody. The energy which is. absorbed in the body due ...

ge6351-ece- By EasyEngineering.net.pdf
What is meant by ecological succession? Ecological succession is the progressive replacement of one community by another. till the development of stable community in a particular area. 16. List out the types of ecological succession. 1) Primary succe

Multi-Operation Cryptographic Engine: VLSI Design ...
and the ANSI X9.17 standards. ... DES, Triple DES and ANSI X 9.17 Standards .... last procedure is needed to update the value of register V for security reasons.

PDF Digital VLSI Chip Design with Cadence and ...
Title : PDF Digital VLSI Chip Design with Cadence q and Synopsys CAD Tools Full eBook isbn : 0321547993 q. Book synopsis. Digital VLSI Chip Design with ...

cmos vlsi design 4th edition pdf
Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. cmos vlsi design 4th edition pdf. cmos vlsi design 4th edition pdf.

Power Distribution Network Design for VLSI BEST ...
Distribution Network Design for VLSI provides detailed information on this critical component of circuit design and physical integration for high-speed chips.