Exploring the Opportunity of Optimizing Sequencing Elements in ASIC ...

Viewer
Transcript

Exploring the Opportunity of Optimizing Sequencing Elements in ASIC Designs Seungwhun Paik

Jaeha Kung

Youngsoo Shin

Department of Electrical Engineering, KAIST Daejeon 305-701, Korea

60

Abstract- An edge-triggered flip-flop is a de facto standard sequencing element in ASIC designs. As sequencing elements occupy increasing portion of timing and power, it is necessary to explore other types of elements. We identify pulsed-latch and dual edge-triggered flip-flop as two promising candidates. The challenges when they are employed for conventional ASIC design are identified, and potential solutions are addressed.

I. INTRODUCTION Most ASIC designs rely on an edge-triggered flip-flop as underlying sequencing element. This is mainly because of

0 �

40

"0 a

.�

a. C>!. 0 a

20

U

0

predictable timing that it offers. The amount of time available to a combinational block that lies between two flip-flops is fixed. This constrains timing uncertainties within a combina

• Margin o Logic • Flip-flops

>:

ro Qi "0 '
Fig. 1.

ARM9

ARM11

ARM Cortex A8

(5-stage)

(8-stage)

(13-stage)

Increasing proportion of sequencing overhead.

tional block alone, which is important in the synthesis and optimization of ASIC designs. The sequencing overhead, the sum of clock-to-Q delay and setup time, of a typical flip-flop is 6 F04 delay [1]. As clock frequency increases for higher performance, the proportion of sequencing overhead in clock period increases. This is illustrated in Fig. 1. The proportion of sequencing overhead is 13% in a processor of 5-stage pipeline, but becomes 21% when the number of pipeline stage reaches 13 [2]. The flip-flops are also a major source of power consumption. It is common that clock network consumes 40-50% of total dynamic power in ASIC designs [3]; an appreciable amount of clock power, e.g. 47% [4], in tum is consumed by flip-flops. The large power contribution of flip-flops can be understood from two factors: the number of clocked transistors in a flip flop is relatively large and the number of flip-flops used in a design has increased due to the design trend of employing more pipeline stages for higher throughput [5]. Therefore, it is now important to explore other types of sequencing elements that offer less sequencing overhead and less power consumption. A worthy candidate is a pulsed-latch. It is a latch driven by a narrow clock pulse and therefore inherits the small sequencing overhead of a latch that ranges from 2 to 4 F04 delay. Using pulsed-latches also benefits an appreciable amount of power saving in clock distribu

enables a simple migration of a flip-flop circuit to a pulsed latch version by replacing all (or some) flip-flops [8], which offers the opportunity of saving in the clock period and the power consumption. Dual edge-triggered flip-flops (DETFFs) can be considered to further reduce the clocking power. DETFFs are triggered at both rising- and falling-edges of clock, so the operating frequency of a circuit can be halved without affecting the throughput of a circuit. In addition, multiple clock domains with different clock frequencies can be easily implemented using a single global clock source. A part of a design that operates in a normal frequency can use single edge-triggered flip-flops (SETFFs), while the part that operates in twice the frequency can use DETFFs. In this paper, we address the challenges when pulsed latch and DETFF are used for conventional ASIC designs and discuss potential solutions. We review several sequencing elements in Section II. Design considerations and necessary approaches to adopt pulsed-latches are addressed in Sec tion III. DETFF circuits are then explored in Section IV. A brief summary will be given in Section V. II. SEQUENCING ELEMENTS: REVIEW

tion [6]. Pulsed-latches have less number of transistors that are

Since a flip-flop is designed by cascading two latches, its

triggered by clock signal than flip-flops do; the overhead of

sequencing overhead is about twice that of a latch. Latches

pulse generator can be amortized by sharing it among multiple

are therefore widely used in high-performance custom designs.

latches. In ASIC designs, the pulsed-latch can be approximated

Latches also offer flexibility by allowing combinational blocks

as fast flip-flops due to the similarity of timing model [7]; this

to have a delay more than a clock period, by using a technique

978-1-61284-857-0/11/$26.00@2011 IEEE

� W '� 'W D has to arrive

--'I 1.

_ _

i IlTt---D has to be hold

t}+·--------+l1

Q is released

L1

____

(a)

Q can be released any time ----'I t t

_ _

D can arrive

D can arrive

as early as

as late as

+

L+�--o"'PCK CK

+

+-------�======== l l======== l : Lr(b )

Q can be released any time

�

D has to be hold

D can arrive as early as

lFi: ��: �� ::�:�e as ========IEl4 _ _ _

----"fTl\ ...4 �4======== (c)

d

t

D

(a)

(b)

Fig. 3. (a) Implementation of pulsed-latch and (b) pulser and its SPICE waveform of CK and PCK.

has to be hold

CK

Fig. 2. Comparison of timing models: (a) edge-triggered flip-flop, (b) level sensitive latch, and (c) pulsed-latch.

D-j

CK f-D

f-5-j

f-CK

CK-j

called time borrowing or cycle stealing. As shown in Fig. 2(b), the amount of time available to a combinational block between two latches varies a lot; this variation can be almost as large

Fig. 4.

as one and a half clock period, or about as small as half

An example of DETFF (10).

a clock period. The time borrowing is also useful in circuit robustness. Clock skew and jitter can be tolerated to some extent since data is captured within the period of clock being high (or low). This, on the other hand, makes timing analysis

occupy 20% more area than SETFF in 32-nm technology, but it consumes 36% less power under the same throughput.

more difficult, which keeps them from being adopted in ASIC designs. In addition, data has to be held for a longer period of time, increasing the likely number of hold-time violations.

III. PULSED-LATCH CIRCUITS A.

Design Considerations

A pulsed-latch is a latch driven by a brief clock pulse. As

A pulsed-latch captures input data while PCK is high. This

shown in Fig. 2(c), the amount of time available to a combi

implies that it has to hold data for a longer period than

national block is still variable, but the amount of the variation

conventional flip-flops. More hold-time violations thus occur

is significantly less than it is in latch circuits. The scope for

with increasing width of PCK. The violations can be removed

hold-time violations is also reduced. This makes a pulsed-latch

by inserting delay buffers [11]-[13] or increasing the delay of

an ideal sequencing element for high-perfonnance [9] or low

short paths through re-synthesis [14].

power ASIC designs [8]. A pulsed-latch can be implemented using a latch and an external pulse generators (or pulsers). An

The load capacitance of pulser tance

Cw

Cp

consists of wire capaci

and clock input capacitance of latches

C[:

example is shown in Fig. 3(a). The pulser takes a nonnal clock

(1)

CK with 50% duty cycle as an input and generates a clock

pulse PCK. The schematic of pulser and SPICE wavefonns

where

of CK and PCK are shown in Fig. 3(b). This implementation,

pulser.

n is the number of latches that are driven by the same Cp should be kept in small amount to ensure the shape

when a pulser is shared by four latches, occupies 40% less area

of PCK that warrants a correct operation. If it is too large, a

and 31% less power than four flip-flops in 32-nm technology.

distortion in PCK may affect the timing behavior of pulsed

Sequencing elements mentioned so far are all activated at

latch circuits. For example, clock-to-Q delay may increase too

a single edge of clock signal. DETFF, on the other hand, is

much from its nominal value; even worse, latches may fail to

triggered at both rising and falling edges, thereby launching

capture the input data, which leads to a malfunctioning.

and capturing data at a rate twice that of other sequencing elements for the same clock period. Using DETFFs, therefore,

B. Design

allows us to use half the clock frequency while maintaining the

A conventional ASIC design synthesized with flip-flops can

same throughput. This implies that the dynamic power of clock

be migrated to pulsed-latch version by replacing all (or some)

network is cut in half. The details of power benefit of DETFF

flip-flops with latches; delay buffers have to be inserted to

circuits over SETFF ones will be shown in Section IV-B. There

fix a likely increase of hold-time violations [7]. A key step

are various implementations of DETFFs; one of them is shown

in the migration process is to insert pulsers. Latches should

in Fig. 4. This particular implementation [10] turns out to

be grouped so that each group can be connected to a single

D

Pulser

D

Latch

�oIm b

Gate·level netlist with flip·flops

Gate·level netlist with flip·flops

•

•

(a)

(b)

rill �

fl �

(a)

(b)

Fig. 5. (a) Grouping c and d results in three latch groups while (b) there is a better solution with two groups.

pulser; this may be performed after initial placement so that the latch locations can be usefully used. Grouping of latches determines the capacitance of wire to route clock pulse, which again affects the number of latch groups. This is illustrated using an example in Fig. 5(a); once c

and

d

form a latch group, no more latches can be added to

Fig. 6. Design flow of pulsed-latch circuits: (a) using conventional design tool and (b) pulser-aware placement [15].

the group due to a maximum load that a pulser can drive; this results in using three pulsers to drive all latches whereas there

the performance, the extent of time borrowing is necessarily

is a better solution with less clocking power that uses one less

limited due to the increasing risk of hold-time violations and

pulser and shorter total wirelength of clock routing as shown

the limited number of different pulsers that can be included in

in Fig. 5(b). We need to find a grouping of latches such that

a practical design. However, PWA can be usefully combined

the number of latch groups (or pulsers) is minimized while the

with other sequential optimization techniques such as clock

load capacitance of each group is less than a given maximum;

skew scheduling [16] or retiming [17].

the former is to minimize the power consumption as pulsers

To further reduce the clocking power, clock gating can be

contribute large portion of total power [6] and the latter is

implemented via pulsers, instead of clock gating cells, by

to ensure the correct pulse shape [15]. This problem can be

feeding a gating function to the enable pin of a pulser [6]. This

solved using a simple heuristic after converting the problem

approach is named pulser gating. This implies a new problem,

into a graph formulation [6]. Once latch groups are identified, a whole design should be

in which we identify a group of latches that can be driven by the same pulser (thus, they are placed nearby) as well as gated

placed again, either in incremental fashion or as a completely

at the same time and as often as possible. A preliminary result

new placement step. A simple approach is to fix the location

of solving pulser gating problem [6] demonstrates additional

of latches and pulsers to assure that the load of each pulser

power saving of up to 30% compared with a non-pulser-gated

remains the same. Then placement is performed incrementally

pulsed-latch circuit.

to remove any cell overlap due to the insertion of pulsers;

IV. DETFF CIRCUITS

this is shown in Fig. 6(a). A more systematic method is to use a new placement algorithm, which is shown in Fig. 6(b);

A.

Design

the connection between pulser and latch can be explicitly

There are several implementations of DETFFs [10], and

constrained by introducing extra barrier force into the con

many of them can be designed to have comparable timing

ventional analytic placer [15].

parameters (i.e., clock-to-Q delay, setup time, and hold time)

C.

Optimization

with SETFFs [18]. Therefore, a DETFF circuit can be obtained from a design synthesized with SETFFs by simply replacing

A simple migration of a flip-flop circuit to pulsed-latch one

all SETFFs with DETFFs and lowering the clock frequency

benefits performance. It is reported that the clock period is

by half. Since DETFFs generally occupy more area than

reduced by 5% due to less sequencing overhead and is further

SETFFs, explicit-pulsed DETFFs, which are based on a shared

reduced by 2.5% due to time borrowing [2]. Dynamic power

external pulser, can also be considered. Since they have similar

consumption is also reduced, e.g., about 20% when some flip

circuit structure as pulsed-latches, the similar approach as in

flops are replaced by pulsed-latches [8]. More aggressive form of timing optimization is also possible

Section III can be used to obtain an explicit-pulsed DETFF circuit.

in pulsed-latch circuits. The difference of pulse width between

One limitation of DETFF circuits is the difficulty of timing

launching and capturing latches can be exploited, which can

analysis. A combinational block between two DETFFs must be

be considered as another form of time borrowing. A problem

checked under two different timing conditions: data launched

of assigning pulse width to each latch for minimizing the

at rising-edge and captured at falling-edge, and vice versa. The

clock period is called pulse width allocation (PWA) [16],

amount of time allowed to the combinational block is different

[17]. Although PWA provides a new handle on optimizing

when duty ratio of clock is not 0.5 and timing parameters

o Comb. gates 0 Clock buffers • Sequencing elements

1.0 Q; ;;

0 a. "0 Ql .!::!

(ij E 0 z

0.8 1--

'--

0.4 I--

-

0.2 r-

-

-

'--

LL LL

This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic

c-

0.6 I--

VI. ACKNOWLEDGMENT

-

f-

-

0.0

� Pulsers

c--

-

c--

-

--

-

I-

o-

1-

--' c..

51423

� LL LL f- W o

LL LL

r-

-r-

Ir-

--' c..

r-

� LL LL f- W o

55378

LL LL

Research Promotion Fund) (KRF-2008-331-DOO406). --

-

-

-

--' c..

LL LL f- W o

-

59234

Fig. 7. Power consumption of flip-flop circuits (FF). pulsed-latch circuits (PL). and DETFF circuits (DETFF).

of DETFF at rising-edge do not match those at falling-edge. Considering clock gating further complicates this issue [19]. The timing analysis problem must be thoroughly addressed in order to use DETFFs in ASIC designs, which merits further investigation.

B. Benefit in Power Fig. 7 compares the power consumption of DETFF circuits with that of SETFF ones for three ISCAS benchmark circuits

(81423,85378, and 89234) synthesized using 32-nm tech

nology; pulsed-latch circuits, in which each pulser drives four latches, are also compared. We included the power consump tion of leaf-stage clock buffers where each buffer drives four sequencing element. Note that the power consumption of clock buffers is halved in DETFF circuits since their clock frequency is halved. Each buffer drives four pulsers in pulsed-latch circuits. This is why their power consumption is only about quarter of that of flip-flop circuits. Overall power consumption is reduced by 29% and 36% after migration to pulsed-latch and DETFF circuits, respectively. This saving mainly comes from replacing power-hungry flip-flops and less number of clock buffers. Notice the difference of power consumption between flip-flops and other sequencing elements. A standard D-type flip-flop consumes about 1.67 J-IW, a DETFF [10] consumes 1.07 J-IW, and a pulsed-latch consumes 1.15 J-IW (a latch consumes 0.3 J-IW and a pulser consumes 3.4 J-IW). The power consumption of combinational gates is increased in pulsed-latch circuits due to extra delay buffers to fix hold time violations. V. SUMMARY

Conventional ASIC designs can benefit from replacing flip-flops with other sequencing elements. Pulsed-latches and DETFFs are promising alternatives due to their superior char acteristics. We showed that adopting these sequencing ele ments can improve the performance and power consumption of ASIC designs without major changes in the standard ASIC design flow.

REFERENCES [1] D. Chinnery and K. Keutzer. Closing the Gap Between ASIC & Custom. Springer. 2002. [2] T. Baumann. D. Schmitt-Landsiedel. and C. Pacha. "Architectural asses ment of design techniques to improve speed and robustness in embedded microprocessors." in Proc. DAC. July 2009. pp. 947-950. [3] Y. Luo. J. Yu. J. Yang. and L. Bhuyan. "Low power network processor design using clock gating." in Proc. DAC. June 2005. pp. 712-715. [4] R. S. Shelar. "An efficient clustering algorithm for low power clock tree synthesis." in Proc. Int. Symp. on Physical Design. Mar. 2007. pp. lSI-ISS. [5] S. M. Kang. "Elements of low power design for integrated systems;' in Proc. Int. Symp. on Low Power Electronics and Design. July 2003. pp. 205-210. [6] S. Kim. I. Han. S. Paik. and Y. Shin. "Pulser gating: A clock gating of pulsed-latch circuits." in Proc. ASPDAC. Jan. 2011. pp. 190-195. [7] y. Shin and S. Paik. "Pulsed-latch circuits: a new dimension in ASIC design;' IEEE Design & Test of Computers. 2011. accepted for publi cation. [S] S. Shibatani and A. Li. "Pulse-latch approach reduces dynamic power." July 2006. EE Times. [9] H. Lee. S. Paik. and Y. Shin. "Pulse width allocation with clock skew scheduling for optimizing pulsed latch-based sequential circuits." in Proc. ICCAD. Nov. 200S. pp. 224--229. [10] T. L. W. Chung and M. Sachdev. "A comparative analysis of low power low-voltage dual-edge-triggered flip-flops." IEEE Trans. on VLSI Systems. vol. 10. no. 6. pp. 913-9IS. Dec. 2002. [II] N. Shenoy. R. Brayton. and A. Sangiovanni-Vincentelli. "Minimum padding to satisfy short path constraints." in Proc. ICCAD. Nov. 1993. pp. 156-161. [12] C. Lin and H. Zhou. "Clock skew scheduling with delay padding for prescribed skew domains." in Proc. ASPDAC. Jan. 2007. pp. 541-546. [l3] Y. Sun. J. Gong. and C. Chen. "Method and apparatus for fixing hold time violations in a circuit design." U.S. Patent 7 27S 126 B2. Oct.. 2007. [14] P. Kotecha. F. Musante. V. Pureswaran. L. Trevillyan. and P. Villarrubia. "Method of minimizing early-mode violations causing minimum impact to a chip design." U.S. Patent 2010/0 042 955 AI. Feb 2010. [15] Y. Chuang. S. Kim. Y. Shin. and Y. Chang. "Pulsed-latch-aware place ment for timing-integrity optimization." in Proc. DAC. June 2010. pp. 2S0-2S5. [16] H. Lee. S. Paik. and Y. Shin. "Pulse width allocation and clock skew scheduling: optimizing sequential circuits based on pulsed latches." T CAD. vol. 29. no. 3. pp. 355-366. Mar. 2010. [17] S. Lee. S. Paik. and Y. Shin. "Retiming and time borrowing: optimizing high-performance pulsed-Iatch-based-circuits." in Proc. ICCAD. Nov. 2009. pp. 375-3S0. [IS] R. Llopis and M. Sachdev. "Low power. testable dual edge triggered flip-flops." in Proc. Int. Symp. on Low Power Electronics and Design. Aug. 1996. pp. 341-345. [19] C. Oh. S. Kim. and Y. Shin. "Timing analysis of dual-edge-triggered flip-flop based circuits with clock gating." in Proc. Int'l Con! on IC Design & Technology. May 2009. pp. 59-62. .•

Optimizing user exploring experience in emerging e ...