Session_04_penmor.qxp:Session_
12/29/07
9:54 AM
Page 94
ISSCC 2008 / SESSION 4 / MICROPROCESSORS / 4.7 4.7
Circuit Design for Voltage Scaling and SER Immunity on a Quad-Core Itanium® Processor
Dan Krueger1, Erin Francom1, Jack Langsdorf2 1
Intel, Fort Collins, CO Intel, Hudson, MA
2
The 700mm2 65nm Itanium® processor [1] doubles the number of cores over its predecessor [2], from 2 to 4. It also adds a system interface that is roughly as large as two cores, including six QuickPath interconnects and four FBDIMM channels. This 3× increase in logic circuits per socket presents two major circuit challenges that are addressed in this paper. First, the chip is increasingly dependent on voltage-frequency scaling to fit in an acceptable power envelope. The resulting reduction in typical operating voltage makes it challenging to design the approximately 10 million non-static logic instances; this count excludes the L2 and L3 caches. Key circuit changes and analysis methods that improve low voltage operation are shown. Second, enterprise reliability require that per-socket SER does not increase over the previous processor generation. With triple the logic states per socket, we maintain acceptable SER by extensive use of SER-hardened circuits. For the chip to achieve its power and performance targets the cores must enable the power-frequency management system to vary the core voltage widely and system interface power must be contained by running at as low a voltage as possible. Process variation, a pulse-based design methodology, and the sheer size of the chip make this quite difficult. To achieve a wide operating region, an analysis of all non-static circuits was developed, requiring functionality across 7 process corners, 0.7 to 1.35V, and -10°C to 125°C with process variation. To have less than 1% yield loss due to the 10 million non-static circuits on the die, a total of 6.4σ of transistor length, width, and Vt variation is applied to each FET as a function of the effect it has on a given measurement. The processor instantiates many circuits that rely on pulse clocks for correct operation [3]. Pulsed writes are self-timed, gaining no margin as the clock frequency is reduced. Pulsed writes become difficult as the power supply voltage is reduced, approaching variation-affected Vt. Consequently, only the very peak of the pulse is effective for writing. Since the processor core is a port from 90nm to 65nm, substantial schedule savings are realized by using the existing pulse latch structure (Fig. 4.7.1) and placement. Overall latch size was not allowed to increase, which prevented qualifying the PMOS feedback transistor.
off the pulse. To further ensure a full-rail pulse, the output drive is 3× that of the pulse gater, relative to the allowed output loading. Despite the robustness of the new pulse generator, local pulse generator usage dropped from 40,000 to 300 per core because of the increased size of the circuit. Most usages were replaced by AND gates that are tuned to pass pulses from a gater with minimal distortion. These AND gates provide a local enable with timing characteristics similar to the local pulse generator. Since a majority of the core is ported from 90nm, we could not make a drastic reduction in per-core SER. This forced the design team to be very aggressive in addressing SER in the new parts of the design. More than 99% of the latches in the system interface are SER-hardened, and 33% of the core latches have also been converted. Furthermore, the system interface protects 34 unique small register files, including all CAMs, with SER-hardened storage. DICE-type structures [4] are chosen for both types of storage. Figures 4.7.4 and 4.7.5 compare these hardened structures to unprotected latches and register file cells. The SER-hardened latch of Fig. 4.7.4 uses a write structure that achieves timing characteristics comparable to the unprotected pulse latch of Fig. 4.7.1. The write mechanism presented here pulls the feedback out of the way, resulting in excellent clock–tooutput delays, which is 25 to 30% faster than writing two same sense nodes. Figure 4.7.5 shows the register file cell. Added wire length and the double write construction make the cell harder to write, but write timing is generally not critical. The load on the read bitlines and address lines in the decoder increase by less than 10%, making timing only slightly worse. The SER benefits of a DICE structure depend on layout that physically separates the storage node diffusions. We use a layout that keeps the latch nodes that are sensitive to multi-bit strikes a minimum of 1.1µm apart. This results in a 100× reduction in SER over the unprotected latch in Fig. 4.7.1. From an error-rate perspective, the 865,000 SER-hardened latches are equivalent to 8,650 unprotected latches. The more compact register file cell achieves an 80× improvement, on 600Kb of storage. This spacingdependent strategy is expected to work for 1 to 2 more process generations. The costs of this SER protection are 34 to 44% area increases and 25% higher power consumption. Including the entire overhead, the area penalty for a 32b×32entry, 1-read, 1-write register file is 25%. In comparison, ECC costs the same 25% area for small amounts of storage, and presents difficult pipelining and timing problems that are avoided with this SER-hardened design.
Entry-latches also use a pulse to capture data at a clock edge and produce monotonic data for dynamic logic. Implicitly pulsed entry-latches are eliminated in favor of an explicitly pulsed topology, shown Fig. 4.7.2. Here, the NMOS feedback transistor is qualified to improve low-voltage precharge.
This chip overcomes a significant circuit design challenge due to process variability, operating voltage, power envelope, high circuit counts, pulsed writes, and SER limitations. Achieving the design goals required extensive simulation and targeted design changes as well as broad use of SER tolerant structures.
There are two circuits used to create pulse clocks. Clock gaters typically drive a large number of latches along a short wire. A transfer gate is added to the internal delay chain that determines the clock pulse width, shown in Fig. 4.7.3. The slope of transfer gate output is matched to the latches, enabling the pulse width to track latch write characteristics across PVT. All gaters have programmable pulse widths. In new designs, a 20% wider pulse is software programmable, Fig. 4.7.3, and ported designs have a metal option for an 8% wider pulse.
Acknowledgements: Shawn Davidson, Laura Dietz, Kevin Duda, Jon Lachman, Casey Little, Charles Morganti, John Wanek
Local pulse generators drive 1 or 2 latches, when a larger gater is not practical. We converted from the simple construct shown in Fig. 4.7.1, to a more complex circuit because the transistors are small and subject to more random variation than the larger devices in the gaters. The new structure requires the output pulse to reach the high VIH of a feedback inverter before turning
94
References: [1] B. Stackhouse et al., “A 65nm 2-Billion Transistor Quad-Core Intel® Itanium® Processor”, ISSCC Dig. Tech. Papers, pp. 92-93, Feb. 2008. [2] S. Naffziger, et al., “The Implementation of a 2-core, Multi-threaded Itanium® Family Microprocessor”, IEEE J. of Solid State Circuits, vol. 41, no. 1, pp. 197-209, 2006. [3] S. Naffziger et al, “The Implementation of the Itanium 2 Microprocessor”, IEEE J. of Solid State Circuits, Vol. 37, No. 11 pp 14481460, 2002. [4] P. Hazucha, et al., “Measurements and Analysis of SER-Tolerant Latch in a 90-nm Dual-Vt Process”, IEEE J. of Solid State Circuits, vol. 39, no. 9, pp. 1536-1543, 2004.
• 2008 IEEE International Solid-State Circuits Conference
978-1-4244-2010-0/08/$25.00 ©2008 IEEE
Session_04_penmor.qxp:Session_
12/29/07
9:54 AM
Page 95
ISSCC 2008 / February 4, 2008 / 4:45 PM latch
scan 90nm
4
90nm local pulse generator CKS CK PCK 65nm
65nm pulse generator
Can be NAND/NOR
High VIH
Figure 4.7.2: Entry-latch changes.
Figure 4.7.2: Entry-latch changes.
Figure 4.7.1: Pulse latch with local pulse generator.
Pulse Gater
90nm Delay Line Latch
Primary feedback
parameter
% of unprotected
area
134%
pck load
136%
flowthru (in to q)
98%
pckÆ Æout (pck rise to q)
96%
setup (in before pck fall)
106%
SER FIT
100x better
Standby power
127%
Active power
125%
1.1um
Scan
65nm Programmable Delay Line 1.1um
Slope matched to latch storage node
Figure 4.7.3: Programmable pulse gater with transfer gate.
parameter
Figure 4.7.4: SER-hardened pulse latch.
Figure 4.7.4: SER-hardened pulse latch.
% of unprotected
Word line dimension
100
Bit line dimension
144
Write time
135
Read bit cap
110
Read word cap
103
Write bit cap
167
Write word cap
164
SER FIT
80x better
Standby power
124
Instances per Tukwila Register file cells
Figure 4.7.5: SER-hardened register file cell.
4,400,000
Pulse latches
1,500,000
Dynamic circuits
1,000,000
Entry-latches
570,000
Voltage converters
340,000
e 4.7.6: Die micrograph and Gaters tics. Figure 4.7.6: Die photo and statistics.
110,000
DIGEST OF TECHNICAL PAPERS •
95