IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2011

227

Design and Optimization of Power-Gated Circuits With Autonomous Data Retention Jun Seomun and Youngsoo Shin, Senior Member, IEEE

Abstract—Power gating has been widely employed to reduce subthreshold leakage. Data retention elements (flip-flops and isolation circuits) are used to preserve circuit states during standby mode, if the states are needed again after wake-up. These elements must be controlled by an external power management unit, causing a network of control signals implemented with extra wires and buffers. A power-gated circuit with autonomous data retention (APG) is proposed to remove the overhead involved in control signals. Retention elements in APG derive their control by detecting rising potential of virtual ground rails when power gating starts, i.e., they control themselves without explicit control signals. Design of retention elements for APG is addressed to facilitate safe capturing of circuit states. Experiments with 65-nm technology demonstrate that, compared to standard power gating, total wirelength, and average wiring congestion are reduced by 8.6% and 4.1% on average, respectively, at a cost of 6.8% area increase. In order to fast charge virtual ground rails, a pMOS switch driven by a short pulse is employed to directly provide charges to virtual ground. This helps retention elements avoid short-circuit current while making transition to standby mode. The optimization procedure for sizing pMOS switch and deciding pulse width is addressed, and assessed with 65-nm technology. Experiments show that, compared to standard power gating, APG reduces the delay to enter and exit the standby mode by 65.6% and 28.9%, respectively, with corresponding energy dissipation during the period cut by 46.1% and 36.5%. Standby mode leakage power consumption is also reduced by 15.8% on average. Index Terms—Application-specific integrated circuit (ASIC), data retention, leakage, low power, power gating.

I. INTRODUCTION

L

EAKAGE power has been continuously growing with every process generation, and is now responsible for a high proportion of total power consumption, as much as 40% to 50% in many technologies [1]. Leakage current comes from many sources [2], but subthreshold leakage takes the largest proportion in many CMOS technologies these days. Power gating [3]–[5] is the most popular circuit technique to suppress subthreshold leakage. It consists of gating, or cutting off, a circuit from its power supply rails during standby mode. When , is turned off, a footer, located between a logic block and , where footer has its drain, the voltage at virtual ground rises slowly until it reaches a steady-state potential, which is Manuscript received February 07, 2009; revised July 06, 2009. First published November 17, 2009; current version published January 21, 2011. This work was supported in part by Samsung Electronics and the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST) (NO. R01-2007-000-20891-0). The authors are with the Department of Electrical Engineering, KAIST, Daejeon 305-701, Korea (e-mail: [email protected]; youngsoo@ee. kaist.ac.kr). Digital Object Identifier 10.1109/TVLSI.2009.2033356

. Similarly, if a header is used and if it is usually close to , slowly goes down turned off, the voltage at virtual . to a steady-state potential, which is close to or during standby, the cirDue to collapse of either cuit states that are represented by sequential elements and primary outputs have to be captured in advance and preserved. This is typical in such circuits as a peripheral circuitry or a processor, since they have plenty of residual states so that the amount of bus traffic required to reload, in case they are lost, is excessive [6]. There are two approaches to retaining circuit states [6]. The first approach is to use a scan chain, whose original function is a manufacturing test; this approach, however, is too slow and causes substantial switching power during shifting out and shifting in the states; it is thus used only in applications where sleep period is very long most of the time. The second approach relies on a dedicated circuit element for state retention. A sequential element, especially flip-flop, that is capable of retaining a state is called a retention flipflop; extra circuitry that is placed at each primary output to hold output value is called isolation circuit. There are several varying [7]–[13] implementations of retention flip-flop and isolation circuit. However, they invariably require explicit control from an external controller, which is also responsible for controlling footer or header, in addition to being a main source of standby leakage [14]. For the circuits with many sequential elements and primary outputs, the wiring of these control signals can translate into a significant increase of total wirelength and buffers, which is similar to overhead of clock network [15]. The increased wiring can lead to wiring congestion of other signals, which further increases total wirelength. It has recently been reported that the total wirelength of power-gated sequential circuits can increase from 29% to 60% [15]. Virtual power/ground rails clamp (VRC) [16] employs a diode in parallel with footer; since the diode clamps to its turn-on voltage, state-retention elements are not needed; this scheme, however, comes at a cost of large leakage current in standby mode. Dynamic state-retention flip-flop [17] eliminates sleep control signal by preserving the state in internal DRAM cells; the amount of retention, however, is very short. A. Motivational Example We can see the quantitative effect of this wiring of control signals for retention flip-flops and isolation circuits on total wirelength, by considering s38417, which is one of the ISCAS benchmark circuits. It consists of 3333 combinational gates, 1564 flip-flops, and 106 primary outputs, after mapping to a commercial 65-nm gate library. To power-gate this circuit, we connected an appropriately sized footer [18], inserted isolation circuits [10], and replaced all the flip-flops with retention

1063-8210/$26.00 © 2009 IEEE

228

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2011

• Optimization of APG by employing a sleep charge pump driven by a pulse generator, which gives the benefit of reduced transition delay and transition energy, and solving the problem of sizing the switch and deciding the pulse width (see Section IV). The remainder of this paper is organized as follows. In Section II, we give an overview of the concept of APG. In Section III, we address the design of flip-flop and isolation circuit with autonomous data retention, two main components of APG, followed by presentation of their implementation in 65-nm technology. Design considerations for sizing a sleep charge pump and pulse width are addressed in Section IV. The experimental results from several benchmark circuits are presented in Section V; we draw conclusions in Section VI.

II. POWER-GATED CIRCUITS WITH AUTONOMOUS DATA RETENTION (APG)

Fig. 1. (a) Comparison of total wirelength with and without control signals to retention flip-flops and isolation circuits. (b) The net of control signals with buffers inserted after detailed routing of s15850.

flip-flops [13]. We then fixed footer cells along the left- and right-hand sides of placement region, which was followed by automatic placement and routing [19]. The total wirelength was 127 mm; we needed 67 buffers for wiring of control signals to retention flip-flops and isolation circuits. Next, we assumed that it is possible to remove control signals for all the retention flip-flops and isolation circuits, and ran automatic routing again. The total wirelength was reduced to 103 mm (19% saving); 67 buffers were no longer needed. The same experiment was repeated for four other benchmark circuits, and the total wirelength was compared for each circuit with and without control signal, as illustrated in Fig. 1(a). Fig. 1(b) shows the wiring of control signals in s15850 as well as the buffers, which shows how significant the wiring is. B. Our Approach and Paper Organization In this paper, we address the question of how to avoid the wiring of these control signals including extra buffers, while we continue to preserve the states of flip-flops and primary outputs. Our main contributions are as follows. • A new power-gated circuit scheme, which is capable of retaining circuit states without external control, which we call power-gated circuit with autonomous data retention (APG) (see Section II). • Design of retention flip-flop and isolation circuit for APG (see Section III).

Figs. 2(a) and (b) illustrate the configuration of standard power gating (PG) and APG, respectively. In both schemes, and ; the combinational gates are located between is controlled by a footer, which is turned on during active mode and turned off during standby mode. APG replaces retention flip-flops and isolation circuits by autonomous retention flip-flops (ARF) and autonomous retention isolation (ARI). The block marked ARF is similar to the conventional retention flip-flop, and thus can preserve data even when footer is turned off; this is made possible by connecting a slave latch, which is responsible for storing current data (or state), directly and while the remainder of flip-flop to and , to same as the conventional retention flip-flop. However, sleep is connected to local rather than to control input external power management unit (PMU), thus eliminating any wires and buffers between flip-flops and PMU [see Fig. 1(b)]. Similarly, the block marked ARI is similar to the conventional isolation circuit, and can thus preserve the primary output even when footer is turned off. This block is also connected to local , thus again eliminates any wires and buffers between itself and PMU. Design of ARF and ARI will be discussed in Section III. rises towards but very When the footer is turned off, slowly. For example of s1423, which is one of the ISCAS benchmark circuits with 318 gates including flip-flops in a gate library rises to 1.15 V (96% of 1.2 V ) in 100 of 65-nm CMOS, s. If we directly use , which takes too long to rise, for inputs of ARF and ARI, the data cannot be properly captured and preserved; large amount of short-circuit current may flow in ARF and ARI during transition. This is resolved by employing and , which we a large pMOS switch located between call sleep charge pump (SCP), as shown in Fig. 2; the SCP is driven by a pulse generator. Once SLEEP goes high ( goes low) to make a transition to standby mode, the pulse generator produces a short pulse (see Fig. 2) that briefly turns on to rise towards in short the SCP, which in turn allows amount of time. The size of SCP and width of pulse applied to it are important design considerations, which will be discussed in Section IV.

SEOMUN AND SHIN: DESIGN AND OPTIMIZATION OF POWER-GATED CIRCUITS WITH AUTONOMOUS DATA RETENTION

229

Fig. 2. (a) Standard PG. (b) APG.

Fig. 3. ARF.

III. DESIGN OF CIRCUIT ELEMENTS FOR AUTONOMOUS DATA RETENTION A. Autonomous Retention Flip-Flop (ARF) The circuit within the dotted line of Fig. 3 is an example of a retention flip-flop [13]. The gates marked GL are power-gated, and are thus connected to and ; those marked NL are not and directly to . When power-gated and are connected to is high (active mode), nMOS switch is turned off and the footer (not shown in the figure) is turned on, which allow the retention flip-flop to function as a normal flip-flop. When is low (standby mode), the footer is turned off. The switch is turned on, which keeps net at logic low, which in turn allows the slave latch (not power-gated) to capture and preserve the current data of flip-flop. Note that when footer is turned off, will float due to . However, since the all nets including is slow, once is turned on, is immediately change of decoupled from master latch, thus keeping the slave latch from being affected by floating of internal nets. It should be noted has to be connected to PMU (not shown in the figure) that through long wires and buffers in standard power gating. to , which In ARF, on the other hand, we connect is a power rail and thus can be accessed directly within cell as shown in Fig. 3. The steady layout through an inverter in standby mode, which drives , is destate voltage of termined by the size of footer. Sizing footer down will put

closer to in standby mode; sizing down, however, has negative impact on active mode circuit delay, since the smaller the and (thus, the footer is, the bigger the difference of and ) will be in active smaller the logic swing between mode. In practice, footer is sized by the requirement during active mode, specified by the amount of delay increase [18] or by that can be tolerated. This effectively keeps the amount of slightly lower than in standby mode. Since the subthreshold leakage of pMOS device in a standard inverter (if we use it for ), which is turned off during standby mode and thus (pois a source of leakage, exponentially increases with its and ), a small voltage drop of tential difference between from can cause significant subthreshold leakage from the inverter. This is alleviated by employing a stacked inverter (instead of standard one) implemented in high- transistors for . The SPICE simulation with 65-nm technology shows that the subthreshold leakage of the stacked inverter is 3.27 pA when and is 120 mV, which is 10% potential difference of ; the subthreshold leakage of a standard inverter, on the of other hand, is 76.6 pA. Note that, in ARF, CLK has to be maintained in logic low before turning off the footer. This is because if CLK is kept is affected by master latch through high, is also high, and (note that is still turned off). Once footer is turned off is low (logic low implies since is connected when to ), will eventually get close to high due to in steady state, which is then adversely captured in the slave latch. is decoupled from the On the other hand, if CLK is kept low, master latch, and the state of decouple is maintained by turning on . This is not a limitation of ARF, however, since clock is usually gated during standby mode. B. Autonomous Retention Isolation (ARI) Fig. 4(a) shows an example of an isolation circuit [10]. When is high (active mode), the circuit is transparent ( is equal goes low (standby mode), to ). When is high and is decoupled from due to and is maintained in the latch. When is low (note that comes from the power-gated cir) and before power gating cuit, and thus logic low implies is still high), , which is high, can adversely go starts ( (thus A) slowly rises, low once footer is turned off, since

230

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2011

Fig. 4. (a) Conventional isolation circuit and (b) ARI; waveform of data captured in the latch when A is logic low (c) in conventional isolation and (d) in ARI [note that time scales of (c) and (d) are different].

which makes start to discharge. However, since rises goes low, very slowly in standard power gating, once is immediately decoupled from , thus preserves the integrity of data stored in the latch. Fig. 4(c) shows a simulation waveform when we assume 100 s for to reach its steady-state of potential. is supplied Fig. 4(b) shows ARI. Similar to ARF, instead of PMU. When is high (active through mode), the circuit is transparent. When A is high and the footer starts to rise towards ), it is readily seen that turns off ( can be safely captured in the latch. When A is low and the footer turns off, however, care needs to be taken to guarantee . In Fig. 4(a), reaches high-impedance the integrity of well before A rises enough to make the latch capture the wrong at low), since comes from PMU while A value (i.e., rises very slowly. However, in Fig. 4(b), the control input of ( ) and its input ( ) both originate from . In order to becomes high-impedance well before rising A ensure that , nMOS device in is not stacked (at the starts to impact cost of increased subthreshold leakage) so that arrives early enough at , and high- is employed at (at the cost of too early. increased delay) so that A does not propagate to holding steady Fig. 4(d) shows a simulation waveform of to reach its at its logic high when we assume 900 ps for steady-state potential, which is made possible by using the sleep charge pump. C. Implementation of ARF and ARI We implemented ARF and ARI in commercial 1.2 V, 65-nm bulk CMOS technology. Fig. 5 shows layouts, with stacked inverters [see Figs. 3 and 4(b)] identified. Table I compares the area, leakage, and delay of conventional retention flip-flop (see Fig. 3) and ARF, and similarly those qualities of a conventional

Fig. 5. Layout of (a) ARF and (b) ARI.

isolation circuit [see Fig. 4(a)] and ARI. The area of ARF and ARI increases by 8% and 25%, respectively, due to the stacked at 0 V, the active leakages of inverters. When we assume conventional retention flip-flop and ARF are the same, since the extra stacked inverter has little effect on leakage due to its use of high- ; the active leakage of ARI is less than that of a conwhile ventional isolation circuit, since we use high- in employs low- (see Fig. 4). In standby mode, when we as, the gates marked as NL dominate the total sume 1.15 V for leakage; ARF has more leakage due to extra stacked inverter . In the isolation circuit, however, all gates are marked as NL (not having high- . power-gated); ARI has less leakage due to The sequencing overheads (sum of setup time and clock-to-Q delay) of conventional retention flip-flop and ARF are the same; . ARI has larger delay, again due to IV. DESIGN CONSIDERATIONS FOR SLEEP CHARGE PUMP AND PULSE GENERATOR In APG, we use a sleep charge pump (SCP), which is briefly . turned on by a pulse generator (see Fig. 2) to fast charge

SEOMUN AND SHIN: DESIGN AND OPTIMIZATION OF POWER-GATED CIRCUITS WITH AUTONOMOUS DATA RETENTION

TABLE I COMPARISON OF CONVENTIONAL- AND ARF-FLOP AND ISOLATION CIRCUIT

231

set by SCP and 0.78 V, since that differthe difference of ence of potential is taken care of by leakage current (see Fig. 7), which rises slowly and thus induces short-circuit current. Therefore, to keep both transition delay and transition energy low, we need to design SCP and pulse generator such that is set by SCP close to . This can be achieved by ensuring that the total amount of charge supplied by SCP is no less than to . the amount needed to charge A. Computing Total Charge Supplied by SCP

Fig. 6. Transition energy and transition delay for varying charging capability of SCP.

The size of SCP and the pulse width are important design contakes siderations, since they determine the amount of time to reach its steady state (thus the delay for making transition to standby mode) and the amount of energy dissipated during the period (thus transition energy). As an example, we took mc1 [20], a memory controller, and transformed it into APG. We then varied the size of SCP and up to a certain the pulse width, so that SCP could charge voltage; even after SCP is turned off, still rises towards but very slowly. In Fig. 6, the -axis corresponds to the potential , up to which SCP charges; the -axis on the right-hand of side indicates the transition delay, which is the interval from settles down turning off the footer to the point at which to its steady-state potential, which we assume to fall within 5% of ; the -axis on the left-hand side indicates the total energy dissipated during the transition delay. The transition delay decreases almost monotonically with the , to which SCP charges up. This can be unpotential of waveforms shown in Fig. 7. In Fig. 7(a), derstood from is charged up to 0.11 V by SCP, which is then turned off; still rises but very slowly since leakage current is responthereafter; settles down after 57.6 sible for charging s. In Fig. 7(b), transition takes less time, 48.1 s, since is charged up to a higher voltage of 0.82 V by SCP. Note that after drops momentarily and then continues to SCP is turned off, rise. This is because of charge sharing between the capacitance rails and the load capacitance of internal nets, which will in be explained in Section IV-B. Fig. 7(c) shows an even faster tranis charged up to 1.18 V. sition when The transition energy remains almost constant up to the point is charged to about 0.4 V (by SCP) as shown in at which Fig. 6, and then drops rapidly. The main source of transition energy is the short-circuit current of stacked inverters of ARFs and ARIs (see Figs. 3 and 4); the nMOS device starts to turn on at roughly 0.41 V and the pMOS device starts to turn off at roughly 0.78 V. Therefore, between 0.41 and 0.78 V, the amount of energy dissipated via short-circuit current is nearly proportional to

Fig. 8 shows the current waveform of SCP when it is driven by signal SP from pulse generator (see Fig. 2). As SP goes low, SCP ; , which is turns on and draws large saturation current at the drain side of SCP, then goes up quickly and drives SCP out of saturation region and into triode region, as shown in Fig. 8; drain current of SCP then decreases. ), deThe total amount of charge supplied by SCP (to , is equal to the area under the current wavenoted by form, and can be approximated by assuming a triangular shape of waveform (1) where

is a pulse width. Since saturation current is given by (2)

is the carrier mobility, is the oxide capacitance, where and are the channel width and length of SCP, respectively, in (1) and is the threshold voltage; substituting (2) for gives us (for fixed channel length ) (3) where is a proportionality constant. The value of can be obtained by simulating a particular size of pMOS device (thus fixed) and monitoring the amount of charge that can be supplied by the device while we vary the pulse width. Fig. 9 shows an example of simulation result; the slope gives the value of , which can then be substituted for in (3). By equating (3) to the total amount of charge that we need to up to , which will be explained in Section IV-B, charge and , which is followed by sewe obtain the product of lecting the value for each in Section IV-C. B. Computing Total Charge for Charging Let the capacitance involved in the amount of charge stored in to be

to

be denoted by by ; we want

and

(4) since should be charged up to via SCP. There are two : the capacitance of rails and the sum components in of GND terminal capacitance of cells that are placed; the sum of through load capacitance that is electrically connected to turned-on nMOS devices. The first component can be readily

232

Fig. 7. Waveforms of V

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2011

when V

is charged up to (a) 0.11, (b) 0.82, and (c) 1.18 V by SCP. The x-axis is shown in log scale.

Fig. 10. Load capacitance C (a) is charged by SCP when the pull-down network of NAND gate is turned on and (b) is not charged by SCP when the pull-up network is turned on.

Fig. 8. Current waveform of SCP when it is turned on by pulse signal SP.

Fig. 9. Amount of charge supplied by pMOS switch SCP for various pulse width.

extracted once power rails are laid out and placement of cells is performed. , which we denote as , The second component of needs consideration. In Fig. 10(a), the load capacitance of two, is electrically connected to , since input NAND gate, two nMOS devices are turned on. Therefore, as SCP charges rails (the first component of the capacitance involved in ), is also charged. It should be noted, however, that is only charged up to the potential of less than twice the threshold voltage of the nMOS device. On the other hand, is charged by turned-on pull-up network as shown in when Fig. 10(b), it is not charged by SCP. is the sum of load capacitance, which is logic Therefore, low when the circuit makes a transition to standby mode. This can be approximated by (5)

where is the signal probability of net being logic high [thus, is the signal probability of net being logic low, which cordenotes responds to Fig. 10(a)] when the circuit is idle, and the load capacitance of net . To derive , we assume that the signal probabilities of primary inputs when the circuit is idle are available, which is usually possible given the knowledge of typical usage (or we may assume 0.5 for simplicity); the idle state probabilities of flip-flops can be obtained by solving a nonlinear system equation using the Picard–Peano method [21], [22]; the probabilities of primary inputs and flip-flop states are then propagated [23] through a combinational portion of the gate-level netlist to obtain the signal probabilities of all the internal nets. (the Note that taking a sum of the first component of ) and the second component capacitance involved in implies that we make an approximation (for the sake of simplifying the procedure) that both components are charged up to the , even though is charged up to a potensame potential of . This is why, after SCP stops charging , tial less than , charge sharing occurs between the two components of as shown in which causes a momentary potential drop of Fig. 7. C. Selecting SCP Size and Pulse Width from (3) and from (4), the two Once we obtain and pulse are equated, which yields the product of SCP size width (6) and that satisfies (6) will serve the Any combination of , we purpose. However, if we select too small a value for will have too many SCPs (due to the too large value for ) to be dispersed on the placement region, which may conflict with automatic placement of the cells; besides, the pulse should be wide enough to avoid distortion of pulse shape over long

SEOMUN AND SHIN: DESIGN AND OPTIMIZATION OF POWER-GATED CIRCUITS WITH AUTONOMOUS DATA RETENTION

233

TABLE II BENCHMARK CIRCUITS AND THEIR APG IMPLEMENTATION

technology; ARF and ARI were characterized, put into gate library, and used during synthesis process. For APG implementation, pulse generators and SCPs were designed and included following the procedure of Section IV-C, which are reported in is kept 1% columns 5–6. Footer was sized [18] such that in active mode; the number of footer cells is reported in of the seventh column. and SCP size are reported in the The pulse width eighth and ninth columns, respectively. The last column corset by SCP (see Figs. 6 and responds to the potential of is charged up to 1.16 V on average (97% of 1.2 V), 7). which verifies that the procedure of Section IV to determine pulse width and SCP size is quite accurate. A. Effectiveness of APG on Wirelength and Area Fig. 11. Placement of SCPs, pulse generators, and footers for example circuit ram1.

distance. On the other hand, the number of SCPs should be large enough so that, when they are placed, they can charge only local capacitance. first. As shown in Fig. 11, In our approach, we determine SCPs are placed in a regular fashion, with a distance of about 25 m between adjacent SCPs; the channel width of each SCP cell is 7.2 m. Therefore, once placement image is determined, total number of SCP cells is determined, which gives us . The pulse is then determined from (6). The pulse generator was width designed to drive up to four SCPs with a slew constraint of 350 ps; the placement of pulse generators is also shown in Fig. 11. V. EXPERIMENTAL RESULTS We carried out experiments on a set of sequential circuits taken from the ISCAS and ITC benchmarks. We also included circuits extracted from several open cores [20], including a cryptography core, communication controller, and memory controller. Columns 2–4 of Table II are the number of combinational gates, flip-flops, and primary outputs. Each circuit was synthesized [24] with commercial 1.2 V, 65-nm bulk CMOS

To assess the effectiveness of APG on total wirelength, each netlist was placed and routed [19]. We forced about 85% of placement regions to be occupied by the cells in each circuit, which is a tight placement; metal layers up to M3 were allowed for routing; the placement region was divided into a grid with individual square size of 1.6 m 1.6 m for computing congestion. In Table III, total wirelengths of power-gated circuit (PG) and APG are compared in columns 2–4; APG has fewer wires by 8.6% on average. There are three factors that have a combined effect on the reduced wirelength of APG: APG does not have wires for the control signal of retention flip-flops and isolation circuits; this lack of wires helps automatic router reduce other signal wires; increased area of APG (columns 8–10 of Table III), on the other hand, may increase signal wires. Columns 5–7 compare the average congestion of PG and APG, and show that APG has less overall congestion by 4.1% on average. Fig. 12 shows the congestion map of wb1 benchmark when it is implemented in (a) PG and in (b) APG. The maximum congestion in PG implementation was 144%; it was reduced to 119% in APG. The average congestion was reduced by 8% in APG implementation. Due to the use of ARF and ARI, both of which take more area than the retention flip-flop and isolation circuits used in PG (see Table I), area of APG increases by 6.8% on average, as shown in columns 8–10 of Table III; SCPs and pulse generators are

234

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2011

TABLE III COMPARISON OF TOTAL WIRELENGTH, AVERAGE CONGESTION, AND TOTAL AREA OF PG AND APG

TABLE IV COMPARISON OF SLEEP AND WAKEUP DELAY, SLEEP AND WAKEUP ENERGY, AND STANDBY MODE LEAKAGE POWER OF PG AND APG

Fig. 12. Congestion map of wb1 benchmark: (a) PG and (b) APG.

another reason for increased area, even though their contribution to total area was found to be very small for most circuits. B. Effectiveness of APG on Transition Delay, Transition Energy, and Standby Mode Leakage Power Since we employ SCP and pulse generator to fast charge capacitance, the sleep delay, the interval from SLEEP being asserted to the point when settles down to 95% 105% of , is very small compared to that of standard power gating. As shown in columns 2–4 of Table IV, the sleep delay of APG is

reduced by 65.6% on average compared to that of power gating (PG). The wakeup delay, the interval from SLEEP being desettles down to within 5% of asserted to the point when , is also cut by 36.5% on average as shown in columns 8–10. This is because SLEEP drives only footers in APG; it drives retention flip-flops and isolation circuits in addition to footers in PG (see Table II). Large fanouts in PG increase the delay of SLEEP until it arrives at its fanouts due to extra buffers and increase signal transition time. The sleep energy, the energy dissipated during sleep delay, of PG and APG is compared in columns 5–7 of Table IV; APG dissipates less sleep energy than PG does by 46.1% on average. This can be understood from Fig. 13, which compares the power consumption of example circuit mc2 in PG and APG implementations during sleep delay. After footer is turned off at 1 ns, APG initially consumes more power due to fast charging of via SCP; however, its power consumption drops rapidly and, at 21 ns, APG starts to consume less power than PG does. Furthermore, APG completes the transition in 52.1 s while PG takes 123.8 s. The wakeup energy is also decreased as shown in columns 11–13 due to the reduction in wakeup delay. Leakage power during standby mode of PG and APG is compared in columns 14–16 of Table IV; APG consumes less

SEOMUN AND SHIN: DESIGN AND OPTIMIZATION OF POWER-GATED CIRCUITS WITH AUTONOMOUS DATA RETENTION

Fig. 13. Power consumption of example circuit mc2 during transition period in PG and APG implementation. The x-axis is shown in log scale.

leakage power by 15.8% on average. There are two reasons for this: ARI consumes less leakage than isolation circuit used in PG (see Table I); APG is free from buffers, which are not power-gated, in the net of control signals (see Fig. 1). On the other hand, ARF consumes more leakage than conventional retention flip-flop (see Table I). Therefore, for circuits with a large number of flip-flops and a small number of primary outputs such as irda3, leakage of APG is larger rather than smaller than that of PG. Note, however, that the leakage saving of APG comes at the cost of increased delay of ARI by about 14 ps as shown in Table I, which may or may not affect circuit performance depending on the amount of slack at primary outputs. VI. CONCLUSION It has recently been reported that the total wirelength of power-gated sequential circuits can increase by up to 60% [15], which may significantly impact the routability of designs. A major portion of this increase can be attributed to extra signals to control data retention elements specific to power-gated circuits. We have proposed a new circuit scheme called APG, in which retention elements derive their own control by detecting rising potential of rails, thereby removing extra control signals. Experiments on benchmark circuits showed that APG can reduce total wirelength by 8.6% on average while cutting average congestion by 4.1% (at the cost of 6.8% more area). In order to fast charge , a sleep charge pump (SCP) driven by a short pulse has been employed; the optimization procedure for SCP size and pulse width has been addressed. This greatly helps reduce the delay to move to standby mode and to wake up as well as the energy dissipated during those transitions. Standby mode leakage power is also reduced. REFERENCES [1] J. Friedrich, B. McCredie, N. James, B. Huott, B. Curran, E. Fluhr, E. Chan, G. Mittal, D. Plass, Y. Chan, S. Chu, H. Le, L. Clark, J. Ripley, S. Taylor, J. Dilullo, and M. Lanzerotti, “Design of the Power6 microprocessor,” in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2007, pp. 96–97. [2] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, “Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits,” Proc. IEEE, vol. 91, no. 2, pp. 305–327, Feb. 2003.

235

[3] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada, “A 1-V power supply high-speed digital circuit technology with multithreshold-voltage CMOS,” IEEE J. Solid-State Circuits, vol. 30, no. 8, pp. 847–854, Aug. 1995. [4] K. Usami, N. Kawabe, M. Koizumi, K. Seta, and T. Furusawa, “Automated selective multi-threshold design for ultra-low standby applications,” in Proc. Int. Symp. Low Power Electron. Des., Aug. 2002, pp. 202–206. [5] , S. G. Narendra and A. Chandrakasan, Eds., Leakage in Nanometer CMOS Technologies. New York: Springer, 2005. [6] M. Keating, D. Flynn, R. Aitken, A. Gibbons, and K. Shi, Low Power Methodology Manual for System-on-Chip Design. New York: Springer, 2007. [7] S. Shigematsu, S. Mutoh, Y. Matsuya, Y. Tanabe, and J. Yamada, “A 1-V high-speed MTCMOS circuit scheme for power-down application circuits,” IEEE J. Solid-State Circuits, vol. 32, no. 6, pp. 861–869, Jun. 1997. [8] J. Kao and A. Chandrakasan, “MTCMOS sequential circuits,” in Proc. Eur. Solid-State Circuits Conf., Sep. 2001, pp. 317–320. [9] V. Zyuban and S. V. Kosonocky, “Low power integrated scan-retention mechanism,” in Proc. Int. Symp. Low Power Electron. Des., Aug. 2002, pp. 98–102. [10] H.-S. Won, K.-S. Kim, K.-O. Jeong, K.-T. Park, K.-M. Choi, and J.-T. Kong, “An MTCMOS design methodology and its application to mobile computing,” in Proc. Int. Symp. Low Power Electron. Des., Aug. 2003, pp. 110–115. [11] S. Gururajarao, H. Mair, D. Scott, and U. Ko, “Ultra low area overhead retention flip-flop for power-down applications,” U.S. Appl. Pub. 20060267654, Nov. 2006. [12] T. Lueftner, J. Berthold, C. Pacha, G. Georgakos, G. Sauzon, O. Hoemke, J. Beshenar, P. Mahrla, K. Just, P. Hober, S. Henzler, D. S. Landsiedel, A. Yakovleff, A. Klein, R. J. Knight, P. Acharya, A. Bonnardot, S. Buch, and M. Sauer, “A 90-nm CMOS low-power GSM/EDGE multimedia-enhanced baseband processor with 380-MHz ARM926 core and mixed-signal extensions,” IEEE J. Solid-State Circuits, vol. 42, no. 1, pp. 134–144, Jan. 2007. [13] H. Mair, A. Wang, G. Gammie, D. Scott, P. Royannez, S. Gururajarao, M. Chau, R. Lagerquist, L. Ho, M. Basude, N. Culp, A. Sadate, D. Wilson, F. Dahan, J. Song, B. Carlson, and U. Ko, “A 65-nm mobile multimedia applications processor with an adaptive power management scheme to compensate for variations,” in Proc. Symp. VLSI Circuits, Jun. 2007, pp. 224–225. [14] Y. Shin, S. Heo, H.-O. Kim, and J. Choi, “Supply switching with ground collapse: Simultaneous control of subthreshold and gate leakage current in nanometer-scale CMOS circuits,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 7, pp. 758–766, Jul. 2007. [15] H.-O. Kim and Y. Shin, “Semicustom design methodology of power gated circuits for low leakage applications,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 6, pp. 512–516, Jun. 2007. [16] K. Kumagai, H. Iwaki, H. Yoshida, H. Suzuki, T. Yamada, and S. Kurosawa, “A novel powering-down scheme for low Vt CMOS circuits,” in Proc. Symp. VLSI Circuits, Jun. 1998, pp. 44–45. [17] S. Henzler, G. Georgakos, M. Eireiner, T. Nirschl, C. Pacha, J. Berthold, and D. Schmitt-Landsiedel, “Dynamic state-retention flip-flop for fine-grained power gating with small design and power overhead,” IEEE J. Solid-State Circuits, vol. 41, no. 7, pp. 1654–1661, Jul. 2006. [18] S. Mutoh, S. Shigematsu, Y. Gotoh, and S. Konaka, “Design method of MTCMOS power switch for low-voltage high-speed LSIs,” in Proc. Asia South Pac. Des. Autom. Conf., Jan. 1999, pp. 113–116. [19] Synopsys, Mountain View, CA, “Astro user guide,” 2006. [20] Opencores, Dobrova, Slovenia, “Projects,” 2009. [Online]. Available: http://www.opencores.org/ [21] C. Tsui, J. Monteiro, M. Pedram, S. Devadas, A. M. Despain, and B. Lin, “Power estimation methods for sequential logic circuits,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 3, no. 3, pp. 404–416, Sep. 1995. [22] L. Benini and G. D. Micheli, “State assignment for low power dissipation,” IEEE J. Solid-State Circuits, vol. 30, no. 2, pp. 258–268, Mar. 1995. [23] S. Ercolani, M. Favalli, M. Damiani, P. Olivo, and B. Riccó, “Estimate of signal probability in combinational logic networks,” in Proc. Eur. Test Conf., Apr. 1989, pp. 132–138. [24] Synopsys, Mountain View, CA, “Design compiler user guide,” 2007.

236

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2011

Jun Seomun received the B.S. and M.S. degrees in electrical engineering from KAIST, Daejeon, Korea, in 2005 and 2007, respectively, where he is currently working towards the Ph.D. degree in the same institution. His current research interests include leakage power optimization for VLSI circuits.

Youngsoo Shin (M’00–SM’05) received the B.S., M.S., and Ph.D. degrees in electronics engineering from Seoul National University, Seoul, Korea. From 2000 to 2001, he was with the University of Tokyo, Tokyo, Japan, as a Research Associate, and from 2001 to 2004, he was with IBM T. J. Watson Research Center, Yorktown Heights, NY, as a Research Staff Member. He joined the Department of Electrical Engineering, KAIST, Daejeon, Korea, in 2004, where he is currently an Associate Professor. His research interests include the areas of computer-aided design with emphasis on low-power and low-leakage, high-performance sequential circuits, structured ASIC, statistical design, and high-level synthesis. Dr. Shin was a recipient of the Best Paper Award at 2005 ISQED and was nominated for the Best Paper Award at the same conference in 2007. He has been a member of the technical program committee and organizing committee of several technical conferences, including DAC, ICCAD, ISLPED, ASP-DAC, CASES, and ISCAS.

Design and Optimization of Power-Gated Circuits With Autonomous ...

Design and Optimization of Power-Gated Circuits. With Autonomous Data Retention. Jun Seomun and Youngsoo Shin, Senior Member, IEEE. Abstract—Power ...

2MB Sizes 3 Downloads 243 Views

Recommend Documents

Design and Optimization of an XYZ Parallel Micromanipulator with ...
by resorting to the finite element analysis (FEA) via software package ANSYS .... the original and the current CPM are analyzed via the nonlinear statics analysis.

Efficient Optimization for Autonomous Robotic ... - Abdeslam Boularias
robots (Amor et al. 2013). The main ..... the grasping action (Kazemi et al. 2012). objects ..... a snake robot's controller (Tesch, Schneider, and Choset. 2011a ...

Evolution of spiking neural circuits in autonomous ...
converter is used to reinitialize every 50 ms the pseudorandom number generator ..... Conf on Artificial Intelligence, Denver, Colorado; 1988. pp 118–130. 19.

Design and Field Testing of an Autonomous ...
sors for estimating the vehicle's state with respect to its environment and with reference to some .... position with respect to the locally consistent submaps defined during route profiling. The inputs .... Given the dynamic and unstructured nature.

Design and Field Testing of an Autonomous ...
“load-haul-dump” cycle is well suited to automation. In this case, a vehicle called a load-haul-dump (LHD) machine is often used to excavate fragmented.

Synthesis & Optimization of Digital Circuits July 2016 (2014 Scheme ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Main menu.

Interface circuits for modularized data optimization engines and ...
Mar 27, 2008 - See application ?le for complete search history. (51). (52). (58) ... A data optimization engine for optimizing selected frames of a ?rst stream of ...

Optimization of Pattern Matching Circuits for Regular ...
NFA approaches, a content matching server [9] was developed to automatically generate deterministic finite automatons (DFAs) .... construct an NFA for a given regular expression and used it to process text characters. ... [12] adopted a scalable, low

Design and Optimization of Multiple-Mesh Clock Network - IEEE Xplore
at mesh grid, is less susceptible to on-chip process variation, and so it has widely been studied recently for a clock network of smaller skew. A practical design ...

Design Specific Joint Optimization of Masks and ...
5 illustrates comparison of Common Process Window (CPW) obtained by this ... With a tool like PD it is able to test our hypothesis #1 using an enumerated contact ..... ai bi i. a b ai bi i i. s s. C s s. = ∑. ∑ ∑. Proc. of SPIE Vol. 7973 797308

Design and Optimization of a Speech Recognition ...
validate our methodology by testing over the TIMIT database for different music playback levels and noise types. Finally, we show that the proposed front-end allows a natural interaction ..... can impose inequality constraints on the variables that s

Design and Optimization of Multiple-Mesh Clock Network - IEEE Xplore
Design and Optimization of Multiple-Mesh. Clock Network. Jinwook Jung, Dongsoo Lee, and Youngsoo Shin. Department of Electrical Engineering, KAIST.

Design and Optimization of Thermal Systems by yogesh jaluria.pdf ...
Whoops! There was a problem loading more pages. Retrying... Design and Optimization of Thermal Systems by yogesh jaluria.pdf. Design and Optimization of ...

Design and Optimization of Scientific Workflows, UC ...
Users can leverage semantic type information by checking if actors are compatible with each other, or to find actors that operate on certain data in a large library.

Design and Optimization of Scientific Workflows, UC ...
In e-Science, the nature of the data that is processed poses ad- ...... workflow, the workflow creator needs to know primarily the XML schema on the stream.

Ant colony optimization for multicast routing - Circuits ...
Institute of Automation, Shanghai Jiaotong University, Shanghai, 200030, China. E-mail:wv(ii> ... polynomial time, but this algorithm could not get the best result.

Design Principles of Biological Circuits (Chapman ...
highlighting simple, recurring circuit elements that make up the network. This book provides a ... Python Machine Learning, 1st Edition · Physical Biology of the ...