ECE 559 MOS VLSI DESIGN VOLTAGE OVERSCALING BY UNBALANCED PIPELINING CASE STUDY: 8-BIT WALLACE TREE MULTIPLIER Raghu Vamsi Chavali, Rajkumar Chinnakonda Kubendran Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN
[email protected],
[email protected] Abstract—The objective of this term project is to understand Supply voltage scaling by unbalanced pipelining without compromising on the speed and performance. As a case study, a pipelined 8-bit Wallace Tree Multiplier with two different Vector Merging Adders, namely, Ripple Carry Adder and Cascaded Carry Select Adder has been designed, implemented and compared. Keywords— VMA, RCA, CCSA, LPB, TSPC, SLP, LLP
II. MULTIPLIER ARCHITECTURE The top level block diagram for the pipelined 8 bit Wallace Tree Multiplier is shown in Fig.2 below,
I. INTRODUCTION Scaling supply voltage to achieve lower power dissipation is a well known technique employed in many applications. But voltage scaling without compromising on speed is a challenging task. CRISTA [1] is a novel technique that uses the concept of unbalanced pipelining in order to scale supply voltages after critical path isolation shown in Fig.1. This is used in our project work where we take the case of 8-bit Wallace Tree Multiplier [2] to illustrate the concept of voltage scaling by unbalanced pipelining. The components of the Multiplier Architecture are described in detail in section II. In section III, the comparison of the Ripple Carry Adder and the Cascaded Carry Select Adder is discussed. The schematic and post layout simulation results of the two multiplier architectures implemented are presented in section IV. We finally conclude with our observations and learning in the project. The bus diagrams for the Multiplier are presented in the Appendix.
Fig.2 Block Diagram of the 8 bit Wallace Tree Multiplier
Fig.1 Path Delay Distribution required for CRISTA
From the figure above, we can observe that the two 8-bit inputs, A0-A7 and B0-B7 are given to the input register bank. The latched inputs are then given to the Partial Product Generator which feeds the Wallace Tree Reduction block which comprises of 4 stages. Now, the reduced partial products are given to an intermediate register bank whose outputs are then fed into the Vector Merging Adder (VMA). The final product, S0-S15 generated by the VMA is latched by the output register bank and given as output. This pipelined implementation consists for 2 or 3 stages depending on the choice of the VMA. In our work, we have
considered two topologies for the VMA, namely, Ripple Carry Adder (RCA) and Cascaded Carry Select Adder (CCSA). If the VMA is chosen to be CCSA, we need an additional register bank in between the Wallace Tree Reduction block to unbalance the pipeline. The clocks to the register banks are generated by the Latency Predictor Block (LPB) from the input global clock with a gating circuitry based on the decoding logic output. Appropriate sizing was done to reduce delay and increase driving strength in order to minimize spurious glitches. A brief description of the building blocks is given below. A. Full Adder and Half Adder. The Mirror Full Adder is used both in the Wallace Tree Reduction block and VMA. This topology has the inherent advantages of lesser gate count and faster Carry generation compared to the Sum which is crucial for the operational speed of the Multiplier. The conventional Half Adder with a XOR gate for the Sum and an AND gate for the Carry is used. Also, the inversion property of Full Adders is exploited in the partial product generation stage and stage 1 of the Wallace Tree Reduction to reduce the number of inverters which significantly reduces power consumption, delay and area. B. Registers The positive-edge triggered True Single Phase Clock (TSPC) registers are used in the Register banks. These registers have low gate count, low delay and lower power consumption when compared to the conventional Mux based registers. The Wallace Tree Multiplier was implemented using both these registers and the TSPC registers were found to have a profound savings in power. C. Latency Predictor Block The decoding logic illustrated in [3] was used to isolate critical paths from the non-critical paths in the Wallace Tree Multiplier (Fig.4). The bits, P4-P7, were used to predict the occurrences of the Short Latency Paths (SLP), Long Latency Paths (LLP) which gives a nominal probability of clock gating and reasonable complexity of implementation.
The rising transition of the output of the decoding logic has to be converted a active low pulse for one clock period. This pulse is the Enable signal used to gate the input global clock and the gated clocks are fed to the Register banks. The State Machine (Fig.6) used for the LPB was derived using the State diagram and State table as shown in Fig.5
In 0 1 1
Out 1 0 1
Fig.5 State Diagram and State Table for the LPB In order to drive a large number of registers, buffering is employed to increase the fan out of the gated clocks.
Fig.6 State Machine and Gating Circuitry for the LPB D. Vector Merging Adder VMA forms the final stage of the Wallace Tree Multiplier and the bottleneck in pipelined implementation. The imbalance in the pipeline is achieved because of the longer delay of the VMA when compared to the Wallace Tree Reduction block. In our project work, we have considered two VMA topologies, namely, i. Ripple Carry Adder The RCA implementation is straightforward where the Carry propagates from LSB to MSB through a chain of Full Adders. The splitting of the SLP and LLP using the LPB is shown in Fig.7
Fig.3 Latency Predictor Block
ii. Fig.4 Pipeline Scheme for the Adder with gated clock
Fig.7 Critical Path for RCA Cascaded Carry Select Adder The simple Carry Select Adder implementation employs two parallel chains of Full Adders propagating with input
Carry taken as 0 and 1 respectively. The actual carry is used to select the correct output using multiplexer logic. The delay through CSA is less since the computation is completed before the actual carry arrives. The cascaded CSA splits the chain of full adders into smaller stages so that the delay of the VMA is reduced further. The splitting of the SLP and LLP using the LPB is shown in Fig.8.
Tclk >= Tsetup+Tclk-Q+Tcomb,max Nanosim tool was used to test the multiplier for 10000 random input vectors generated using Matlab. The power consumption, maximum voltage scaling and frequency scaling achievable were measured and are tabulated in Table II. The power consumption when conventional Mux based registers were used was around 32mW whereas the TSPC based registers consumed only 4mW. The bus diagrams that illustrate the functionality of the decoding logic, LPB and the Multiplier as a whole are presented in the Appendix section. TABLE I DELAY MEASUREMENTS
Fig.8 Critical Path for CCSA III. RCA VS CCSA In this project work, we have compared the performance of two different VMAs in order to obtain a better understanding of the trade-offs involved in using unbalanced pipelining for voltage scaling. The two architectures have been designed, implemented and compared in terms of power dissipated, maximum frequency of operation and layout area. The RCA is the simplest VMA implementation with minimum area in layout, lower power consumption but highest delay. Using RCA, we can achieve more voltage scaling due to the highly unbalanced pipeline and thus result in higher power savings. In comparison, the CCSA is more complicated in implementation with different variations possible based on the splitting of the stages in the adder. For the 12 bit adder, we have implemented three different variations, {2,2,2,2,2,2}, {3,3,3,3} and {2,4,2,4}. Among these, the {3,3,3,3} split had the least delay but more gates. This was chosen to be compared with RCA since it will give us a good understanding of the trade-off between performance and power consumption. Since the delay of CCSA was comparable to that of the Wallace Tree Reduction block, an additional Register bank was required in between stages II and III of the Wallace Tree Reduction block in order to unbalance the pipeline. Hence, CCSA takes up almost 30% more layout area due to additional full adders, multiplexers and register bank. Its power consumption will also be higher but it can operate at a higher frequency due to the additional stage in the pipeline. IV. SIMULATION AND RESULTS The schematic and layout (Fig.9, Fig.10) of the designs were implemented in Cadence using TSMC 0.3um technology. The Spice netlist generated was simulated using NanoTime for delay measurements and are tabulated in Table I. The design frequencies were computed based on these measurements using the following equation,
Block Name TSPC Register Propagation Delay TSPC Register Setup Time Wallace Tree Reduction Wallace Tree reduction Stage 1 and 2 Wallace Tree reduction Stage 3 and 4 LPB RCA CCSA (3,3,3,3) CCSA (4,2,4,2) CCSA (2,2,2,2,2,2)
Delay 145ps 200ps 3.29ns 1.75ns 1.55ns 900ps 5.3ns 2.88ns 3.06ns 3.25ns
TABLE II DESIGN METRICS COMPARISON
Design Parameters
Design frequency at VDD = 2.5V Power at Designed frequency and VDD = 2.5V Min VDD Power at Min VDD and designed frequency Max frequency of operation at VDD = 2.5V % Power reduction design frequency Layout Area Post Layout design frequency at 2.5V Post Layout Power dissipation at 2.5V
Wallace Multiplier RCA 188MHz
Tree using
Wallace Multiplier CCSA 333MHz
4.836mW
11.96mW
1.8V 2.237mW
2.2V 8.76mW
303MHz
385MHz
53%
27%
437*327 0.143sq.mm 204MHz 7.337mW
=
473*430 0.203sq.mm 250MHz 13.272mW
Tree using
=
Fig.9 Layout of Wallace Tree Multiplier with RCA
CONCLUSION The concept of using unbalanced pipelining in order to achieve voltage scaling at the same designed frequency was understood and implemented. Different design aspects for the multiplier, VMA and LPB were studied and compared. Voltage scaling from 2.5 to 1.8V was achieved for the Wallace Tree multiplier using RCA as VMA with almost 53% power savings for 5% throughput penalty. When CCSA was used as VMA, we could achieve 27% power savings for the same throughput penalty. The architectures were finally implemented in layout and post layout simulations match the specifications to a good extent. The design metrics were compared for the two architectures. It is found that RCA is more suitable for a Wallace Tree Multiplier for a low power application but moderate performance. Whereas, CCSA is more suitable for applications that demand higher performance and can tolerate a higher power budget. ACKNOWLEDGMENT We would like to thank Dr.Kaushik Roy for providing us the opportunity to work on this project. We would also like to thank Mr.Kuntal Roy for providing us valuable guidance and support throughout the project. REFERENCES [1]
[2] [3]
Fig.10 Layout of Wallace Tree Multiplier with CCSA
Swaroop Ghosh et.al -CRISTA: IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, Vol. 26, No. 11, November 2007 J.M.Rabaey Anantha Chandrakasan , Digital Integrated Circuits: A Design Perspective, 2nd ed.,Prentice Hall of India, 2008. Debabrata Mohapatra, et al. Low-Power Process-Variation Tolerant Arithmetic Units Using Input-Based Elastic Clocking. ISLPED’07, August 27–29, 2007, Portland, Oregon, USA.
APPENDIX In the figures below, Group0 is the 8bit input A0-A7, Group1 is the 8bit input B0-B7 and Group2 is the 16bit output S0-S15. The frequency of operation can be deduced from the markers which show two clock period delays. When the enable signal goes high, we find that the clock to the register bank gets gated on time before the next set of inputs get latched. We can observe that both one cycle and two cycle operations are computed correctly thus verifying functionality. BUS DIAGRAM FOR WALLACE TREE MULTIPLIER WITH RCA (SCHEMATIC )
BUS DIAGRAM FOR WALLACE TREE MULTIPLIER WITH RCA (EXTRACTED )
BUS DIAGRAM FOR WALLACE TREE MULTIPLIER WITH CCSA (SCHEMATIC )
BUS DIAGRAM FOR WALLACE TREE MULTIPLIER WITH CCSA (EXTRACTED )