Exploring the Opportunity of Optimizing Sequencing Elements in ASIC Designs Seungwhun Paik
Jaeha Kung
Youngsoo Shin
Department of Electrical Engineering, KAIST Daejeon 305-701, Korea
60
Abstract- An edge-triggered flip-flop is a de facto standard sequencing element in ASIC designs. As sequencing elements occupy increasing portion of timing and power, it is necessary to explore other types of elements. We identify pulsed-latch and dual edge-triggered flip-flop as two promising candidates. The challenges when they are employed for conventional ASIC design are identified, and potential solutions are addressed.
I. INTRODUCTION Most ASIC designs rely on an edge-triggered flip-flop as underlying sequencing element. This is mainly because of
0 �
40
"0 a
.�
a. C>!. 0 a
20
U
0
predictable timing that it offers. The amount of time available to a combinational block that lies between two flip-flops is fixed. This constrains timing uncertainties within a combina
• Margin o Logic • Flip-flops
>:
ro Qi "0 '
Fig. 1.
ARM9
ARM11
ARM Cortex A8
(5-stage)
(8-stage)
(13-stage)
Increasing proportion of sequencing overhead.
tional block alone, which is important in the synthesis and optimization of ASIC designs. The sequencing overhead, the sum of clock-to-Q delay and setup time, of a typical flip-flop is 6 F04 delay [1]. As clock frequency increases for higher performance, the proportion of sequencing overhead in clock period increases. This is illustrated in Fig. 1. The proportion of sequencing overhead is 13% in a processor of 5-stage pipeline, but becomes 21% when the number of pipeline stage reaches 13 [2]. The flip-flops are also a major source of power consumption. It is common that clock network consumes 40-50% of total dynamic power in ASIC designs [3]; an appreciable amount of clock power, e.g. 47% [4], in tum is consumed by flip-flops. The large power contribution of flip-flops can be understood from two factors: the number of clocked transistors in a flip flop is relatively large and the number of flip-flops used in a design has increased due to the design trend of employing more pipeline stages for higher throughput [5]. Therefore, it is now important to explore other types of sequencing elements that offer less sequencing overhead and less power consumption. A worthy candidate is a pulsed-latch. It is a latch driven by a narrow clock pulse and therefore inherits the small sequencing overhead of a latch that ranges from 2 to 4 F04 delay. Using pulsed-latches also benefits an appreciable amount of power saving in clock distribu
enables a simple migration of a flip-flop circuit to a pulsed latch version by replacing all (or some) flip-flops [8], which offers the opportunity of saving in the clock period and the power consumption. Dual edge-triggered flip-flops (DETFFs) can be considered to further reduce the clocking power. DETFFs are triggered at both rising- and falling-edges of clock, so the operating frequency of a circuit can be halved without affecting the throughput of a circuit. In addition, multiple clock domains with different clock frequencies can be easily implemented using a single global clock source. A part of a design that operates in a normal frequency can use single edge-triggered flip-flops (SETFFs), while the part that operates in twice the frequency can use DETFFs. In this paper, we address the challenges when pulsed latch and DETFF are used for conventional ASIC designs and discuss potential solutions. We review several sequencing elements in Section II. Design considerations and necessary approaches to adopt pulsed-latches are addressed in Sec tion III. DETFF circuits are then explored in Section IV. A brief summary will be given in Section V. II. SEQUENCING ELEMENTS: REVIEW
tion [6]. Pulsed-latches have less number of transistors that are
Since a flip-flop is designed by cascading two latches, its
triggered by clock signal than flip-flops do; the overhead of
sequencing overhead is about twice that of a latch. Latches
pulse generator can be amortized by sharing it among multiple
are therefore widely used in high-performance custom designs.
latches. In ASIC designs, the pulsed-latch can be approximated
Latches also offer flexibility by allowing combinational blocks
as fast flip-flops due to the similarity of timing model [7]; this
to have a delay more than a clock period, by using a technique
978-1-61284-857-0/11/$26.00@2011 IEEE
� W '� 'W D has to arrive
--'I 1.
_ _
i IlTt---D has to be hold
t}+·--------+l1
Q is released
L1
____
(a)
Q can be released any time ----'I t t
_ _
D can arrive
D can arrive
as early as
as late as
+
L+�--o"'PCK CK
+
+-------�======== l l======== l : Lr(b )
Q can be released any time
�
D has to be hold
D can arrive as early as
lFi: ��: ���� ::�:�e as ========IEl4 _ _ _
----"fTl\ ...4 �4======== (c)
d
t
D
(a)
(b)
Fig. 3. (a) Implementation of pulsed-latch and (b) pulser and its SPICE waveform of CK and PCK.
has to be hold
CK
Fig. 2. Comparison of timing models: (a) edge-triggered flip-flop, (b) level sensitive latch, and (c) pulsed-latch.
D-j
CK f-D
f-5-j
f-CK
CK-j
called time borrowing or cycle stealing. As shown in Fig. 2(b), the amount of time available to a combinational block between two latches varies a lot; this variation can be almost as large
Fig. 4.
as one and a half clock period, or about as small as half
An example of DETFF (10).
a clock period. The time borrowing is also useful in circuit robustness. Clock skew and jitter can be tolerated to some extent since data is captured within the period of clock being high (or low). This, on the other hand, makes timing analysis
occupy 20% more area than SETFF in 32-nm technology, but it consumes 36% less power under the same throughput.
more difficult, which keeps them from being adopted in ASIC designs. In addition, data has to be held for a longer period of time, increasing the likely number of hold-time violations.
III. PULSED-LATCH CIRCUITS A.
Design Considerations
A pulsed-latch is a latch driven by a brief clock pulse. As
A pulsed-latch captures input data while PCK is high. This
shown in Fig. 2(c), the amount of time available to a combi
implies that it has to hold data for a longer period than
national block is still variable, but the amount of the variation
conventional flip-flops. More hold-time violations thus occur
is significantly less than it is in latch circuits. The scope for
with increasing width of PCK. The violations can be removed
hold-time violations is also reduced. This makes a pulsed-latch
by inserting delay buffers [11]-[13] or increasing the delay of
an ideal sequencing element for high-perfonnance [9] or low
short paths through re-synthesis [14].
power ASIC designs [8]. A pulsed-latch can be implemented using a latch and an external pulse generators (or pulsers). An
The load capacitance of pulser tance
Cw
Cp
consists of wire capaci
and clock input capacitance of latches
C[:
example is shown in Fig. 3(a). The pulser takes a nonnal clock
(1)
CK with 50% duty cycle as an input and generates a clock
pulse PCK. The schematic of pulser and SPICE wavefonns
where
of CK and PCK are shown in Fig. 3(b). This implementation,
pulser.
n is the number of latches that are driven by the same Cp should be kept in small amount to ensure the shape
when a pulser is shared by four latches, occupies 40% less area
of PCK that warrants a correct operation. If it is too large, a
and 31% less power than four flip-flops in 32-nm technology.
distortion in PCK may affect the timing behavior of pulsed
Sequencing elements mentioned so far are all activated at
latch circuits. For example, clock-to-Q delay may increase too
a single edge of clock signal. DETFF, on the other hand, is
much from its nominal value; even worse, latches may fail to
triggered at both rising and falling edges, thereby launching
capture the input data, which leads to a malfunctioning.
and capturing data at a rate twice that of other sequencing elements for the same clock period. Using DETFFs, therefore,
B. Design
allows us to use half the clock frequency while maintaining the
A conventional ASIC design synthesized with flip-flops can
same throughput. This implies that the dynamic power of clock
be migrated to pulsed-latch version by replacing all (or some)
network is cut in half. The details of power benefit of DETFF
flip-flops with latches; delay buffers have to be inserted to
circuits over SETFF ones will be shown in Section IV-B. There
fix a likely increase of hold-time violations [7]. A key step
are various implementations of DETFFs; one of them is shown
in the migration process is to insert pulsers. Latches should
in Fig. 4. This particular implementation [10] turns out to
be grouped so that each group can be connected to a single
D
Pulser
D
Latch
�oIm b
Gate·level netlist with flip·flops
Gate·level netlist with flip·flops
•
•
(a)
(b)
rill �
fl �
(a)
(b)
Fig. 5. (a) Grouping c and d results in three latch groups while (b) there is a better solution with two groups.
pulser; this may be performed after initial placement so that the latch locations can be usefully used. Grouping of latches determines the capacitance of wire to route clock pulse, which again affects the number of latch groups. This is illustrated using an example in Fig. 5(a); once c
and
d
form a latch group, no more latches can be added to
Fig. 6. Design flow of pulsed-latch circuits: (a) using conventional design tool and (b) pulser-aware placement [15].
the group due to a maximum load that a pulser can drive; this results in using three pulsers to drive all latches whereas there
the performance, the extent of time borrowing is necessarily
is a better solution with less clocking power that uses one less
limited due to the increasing risk of hold-time violations and
pulser and shorter total wirelength of clock routing as shown
the limited number of different pulsers that can be included in
in Fig. 5(b). We need to find a grouping of latches such that
a practical design. However, PWA can be usefully combined
the number of latch groups (or pulsers) is minimized while the
with other sequential optimization techniques such as clock
load capacitance of each group is less than a given maximum;
skew scheduling [16] or retiming [17].
the former is to minimize the power consumption as pulsers
To further reduce the clocking power, clock gating can be
contribute large portion of total power [6] and the latter is
implemented via pulsers, instead of clock gating cells, by
to ensure the correct pulse shape [15]. This problem can be
feeding a gating function to the enable pin of a pulser [6]. This
solved using a simple heuristic after converting the problem
approach is named pulser gating. This implies a new problem,
into a graph formulation [6]. Once latch groups are identified, a whole design should be
in which we identify a group of latches that can be driven by the same pulser (thus, they are placed nearby) as well as gated
placed again, either in incremental fashion or as a completely
at the same time and as often as possible. A preliminary result
new placement step. A simple approach is to fix the location
of solving pulser gating problem [6] demonstrates additional
of latches and pulsers to assure that the load of each pulser
power saving of up to 30% compared with a non-pulser-gated
remains the same. Then placement is performed incrementally
pulsed-latch circuit.
to remove any cell overlap due to the insertion of pulsers;
IV. DETFF CIRCUITS
this is shown in Fig. 6(a). A more systematic method is to use a new placement algorithm, which is shown in Fig. 6(b);
A.
Design
the connection between pulser and latch can be explicitly
There are several implementations of DETFFs [10], and
constrained by introducing extra barrier force into the con
many of them can be designed to have comparable timing
ventional analytic placer [15].
parameters (i.e., clock-to-Q delay, setup time, and hold time)
C.
Optimization
with SETFFs [18]. Therefore, a DETFF circuit can be obtained from a design synthesized with SETFFs by simply replacing
A simple migration of a flip-flop circuit to pulsed-latch one
all SETFFs with DETFFs and lowering the clock frequency
benefits performance. It is reported that the clock period is
by half. Since DETFFs generally occupy more area than
reduced by 5% due to less sequencing overhead and is further
SETFFs, explicit-pulsed DETFFs, which are based on a shared
reduced by 2.5% due to time borrowing [2]. Dynamic power
external pulser, can also be considered. Since they have similar
consumption is also reduced, e.g., about 20% when some flip
circuit structure as pulsed-latches, the similar approach as in
flops are replaced by pulsed-latches [8]. More aggressive form of timing optimization is also possible
Section III can be used to obtain an explicit-pulsed DETFF circuit.
in pulsed-latch circuits. The difference of pulse width between
One limitation of DETFF circuits is the difficulty of timing
launching and capturing latches can be exploited, which can
analysis. A combinational block between two DETFFs must be
be considered as another form of time borrowing. A problem
checked under two different timing conditions: data launched
of assigning pulse width to each latch for minimizing the
at rising-edge and captured at falling-edge, and vice versa. The
clock period is called pulse width allocation (PWA) [16],
amount of time allowed to the combinational block is different
[17]. Although PWA provides a new handle on optimizing
when duty ratio of clock is not 0.5 and timing parameters
o Comb. gates 0 Clock buffers • Sequencing elements
1.0 Q; ;;
0 a. "0 Ql .!::!
(ij E 0 z
0.8 1--
'--
0.4 I--
-
0.2 r-
-
-
'--
LL LL
This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic
c-
0.6 I--
VI. ACKNOWLEDGMENT
-
f-
-
0.0
� Pulsers
c--
-
c--
-
--
-
I-
o-
1-
--' c..
51423
� LL LL f- W o
LL LL
r-
-r-
Ir-
--' c..
r-
� LL LL f- W o
55378
LL LL
Research Promotion Fund) (KRF-2008-331-DOO406). --
-
-
-
--' c..
LL LL f- W o
-
59234
Fig. 7. Power consumption of flip-flop circuits (FF). pulsed-latch circuits (PL). and DETFF circuits (DETFF).
of DETFF at rising-edge do not match those at falling-edge. Considering clock gating further complicates this issue [19]. The timing analysis problem must be thoroughly addressed in order to use DETFFs in ASIC designs, which merits further investigation.
B. Benefit in Power Fig. 7 compares the power consumption of DETFF circuits with that of SETFF ones for three ISCAS benchmark circuits
(81423,85378, and 89234) synthesized using 32-nm tech
nology; pulsed-latch circuits, in which each pulser drives four latches, are also compared. We included the power consump tion of leaf-stage clock buffers where each buffer drives four sequencing element. Note that the power consumption of clock buffers is halved in DETFF circuits since their clock frequency is halved. Each buffer drives four pulsers in pulsed-latch circuits. This is why their power consumption is only about quarter of that of flip-flop circuits. Overall power consumption is reduced by 29% and 36% after migration to pulsed-latch and DETFF circuits, respectively. This saving mainly comes from replacing power-hungry flip-flops and less number of clock buffers. Notice the difference of power consumption between flip-flops and other sequencing elements. A standard D-type flip-flop consumes about 1.67 J-IW, a DETFF [10] consumes 1.07 J-IW, and a pulsed-latch consumes 1.15 J-IW (a latch consumes 0.3 J-IW and a pulser consumes 3.4 J-IW). The power consumption of combinational gates is increased in pulsed-latch circuits due to extra delay buffers to fix hold time violations. V. SUMMARY
Conventional ASIC designs can benefit from replacing flip-flops with other sequencing elements. Pulsed-latches and DETFFs are promising alternatives due to their superior char acteristics. We showed that adopting these sequencing ele ments can improve the performance and power consumption of ASIC designs without major changes in the standard ASIC design flow.
REFERENCES [1] D. Chinnery and K. Keutzer. Closing the Gap Between ASIC & Custom. Springer. 2002. [2] T. Baumann. D. Schmitt-Landsiedel. and C. Pacha. "Architectural asses ment of design techniques to improve speed and robustness in embedded microprocessors." in Proc. DAC. July 2009. pp. 947-950. [3] Y. Luo. J. Yu. J. Yang. and L. Bhuyan. "Low power network processor design using clock gating." in Proc. DAC. June 2005. pp. 712-715. [4] R. S. Shelar. "An efficient clustering algorithm for low power clock tree synthesis." in Proc. Int. Symp. on Physical Design. Mar. 2007. pp. lSI-ISS. [5] S. M. Kang. "Elements of low power design for integrated systems;' in Proc. Int. Symp. on Low Power Electronics and Design. July 2003. pp. 205-210. [6] S. Kim. I. Han. S. Paik. and Y. Shin. "Pulser gating: A clock gating of pulsed-latch circuits." in Proc. ASPDAC. Jan. 2011. pp. 190-195. [7] y. Shin and S. Paik. "Pulsed-latch circuits: a new dimension in ASIC design;' IEEE Design & Test of Computers. 2011. accepted for publi cation. [S] S. Shibatani and A. Li. "Pulse-latch approach reduces dynamic power." July 2006. EE Times. [9] H. Lee. S. Paik. and Y. Shin. "Pulse width allocation with clock skew scheduling for optimizing pulsed latch-based sequential circuits." in Proc. ICCAD. Nov. 200S. pp. 224--229. [10] T. L. W. Chung and M. Sachdev. "A comparative analysis of low power low-voltage dual-edge-triggered flip-flops." IEEE Trans. on VLSI Systems. vol. 10. no. 6. pp. 913-9IS. Dec. 2002. [II] N. Shenoy. R. Brayton. and A. Sangiovanni-Vincentelli. "Minimum padding to satisfy short path constraints." in Proc. ICCAD. Nov. 1993. pp. 156-161. [12] C. Lin and H. Zhou. "Clock skew scheduling with delay padding for prescribed skew domains." in Proc. ASPDAC. Jan. 2007. pp. 541-546. [l3] Y. Sun. J. Gong. and C. Chen. "Method and apparatus for fixing hold time violations in a circuit design." U.S. Patent 7 27S 126 B2. Oct.. 2007. [14] P. Kotecha. F. Musante. V. Pureswaran. L. Trevillyan. and P. Villarrubia. "Method of minimizing early-mode violations causing minimum impact to a chip design." U.S. Patent 2010/0 042 955 AI. Feb 2010. [15] Y. Chuang. S. Kim. Y. Shin. and Y. Chang. "Pulsed-latch-aware place ment for timing-integrity optimization." in Proc. DAC. June 2010. pp. 2S0-2S5. [16] H. Lee. S. Paik. and Y. Shin. "Pulse width allocation and clock skew scheduling: optimizing sequential circuits based on pulsed latches." T CAD. vol. 29. no. 3. pp. 355-366. Mar. 2010. [17] S. Lee. S. Paik. and Y. Shin. "Retiming and time borrowing: optimizing high-performance pulsed-Iatch-based-circuits." in Proc. ICCAD. Nov. 2009. pp. 375-3S0. [IS] R. Llopis and M. Sachdev. "Low power. testable dual edge triggered flip-flops." in Proc. Int. Symp. on Low Power Electronics and Design. Aug. 1996. pp. 341-345. [19] C. Oh. S. Kim. and Y. Shin. "Timing analysis of dual-edge-triggered flip-flop based circuits with clock gating." in Proc. Int'l Con! on IC Design & Technology. May 2009. pp. 59-62. .•