POWER CHALLENGES IN THE INTERNET WORLD Deo Singh Vivek Tiwari Low Power Design Technology Intel Corp. November, 1999 ”Third party marks are the property of their owners"

®

1

Power challenges per segment Power related system cost drivers Price drivers

Servers

Desktops

Mobile

Handhelds

Thermal cost ($)

Thermal cost ($)

Thermal cost ($)

Delivery cost ($)

Delivery cost ($)

Delivery cost ($)

Form factor (in^3) Battery size (lbs.)

Form factor (in^3)

Form factor (in^3)

Battery cost ($)

Perf (SPEC, TPC-C) Perf (SPEC,MHz)

Battery size (lbs.) Perf (SPEC,MHz)

Perf (MIPS,MHz)

Perf/in^3

Perf/lb.

Perf/battery hrs

Perf/$

High end servers

Today’s focus

Performance PCs

Na

tu

ra

lP

ow

er

Gr

ow

th

100W

Mobile PCs

10W

Have talked about these in the past

Handhelds 1W ®

9

8

What did we tell you before? $1/W

Gra

phic

s

20

10

se

t

M

CPU

Ch

0

DRA

– Thermal & power dstbn cost – Every CPU Watt over 40 W = $1/Watt

30

ip

The cost of power in Desktop PCs

l

Total Integration Cost ($)

40

ASP Vs. Performance Bin

500

0

10

450

20 Power (Watts)

30

40

400

$$$ $140/Bin

$$

350

l

300 250

Reduce Perf. by 1 bin to stay in Power Envelope

200 150 100 50 0 Seg 0

Seg 1

Seg 2

Seg 3

Seg4

Impact of power reduction – If reduce freq. to maintain 40W: – Loss of 1 Perf Bin => $$$ on ASP

Perf. Bin

®

3

Environmental burden of CPUs! l

Total power consumption of CPUs in world’s PCs: 1992: 160 MWatts (87M CPUs) 10000 8000

l

That’s 4 Hoover Dams!

6000 4000 2000

9000 MW

MegaWatts

2001: 9,000 MWatts (500M CPUs)

160 MW

Courtesy: United States Department of the Interior Bureau of Reclamation - Lower Colorado Region

®

2001

1999

2000

1997

1998

1995

1996

1994

1993

1992

0

[Source: Dataquest (for installed base) + estimates for avg. installed CPU power] Projected with PentiumIITM Power

Andy’s vision: 1 Billion Connected PCs! 9

The New World Order: Internet & Communication are merging 1999 $260B

Telecom/Voice • Lucent • Siemens • Nortel • Alcatel • Ericsson

2003 $350B

Data Networking • Cisco • Ascend • Nortel (Bay) • Fore • 3COM

IP Network

Server Systems

®

The Inflection Point Creates An Opportunity for The Computing Industry Source: DataQuest, Piper Jaffrey

5

Server market segment trends Sever market Segment will show strong growth l Strongly driven by: Web server

l

– Internet

Email app.

– Telecom

IP telephony

E-commerce

Comm. Server

– Corporate

1999

2002

Application specific server TAM ®

(Intel Estimates)

6

10

New World Order Case-study: Large ISP server topology Applications Mid-tier

IP Services Front End

External

Database Back-end

web (HTTP) E-Commerce

~ ~ Clients

ERP Data Backup (NAS)

mail (SMTP, POP) news (NNTP) Streaming Media Network Switch Router

~ ~

auth (RADIUS) Firewall

~ ~

®

ISDN, ADSL Cable...

// Telco 7

Network Operations Center

What are the costs to an ISP? Internet data centers are like heavy-duty factories – E.g. Data Center 25,000 Sq. ft., ~8000 servers, ~2,000,000 Watts !!

l

l l

Want lowest net cost per server per sq foot of data center space Cost driven by: – Racking height – Cooling air flow – Power delivery – Maintenance ease (access, wt.)

“Behind

all of this, power is the lead cost driver in the facility - about 25% of the total cost of a data center” ®

“They are concerned about power because it increases the weight of the node due to massive heat sinks. Weight is very critical for hot swaps.” Customer quote

Data Center Facilities Mgr. 8

11

How does this drive Internet server requirements? Applications Mid-tier

IP Services Front End

Application Firewall, Web, Mail, News

ISP needs

Database Back-end

Ecommerce, ERP

Database, Dir., NAS

Reliability

Reliability

Reliability

Perf. (SpecWeb)

Perf. (Spec,TPC-C)

Perf. (TPC-C)

Form factor (<1U)

Cost

Mem addr space

Power

Power

Power

Cost

Form factor

Cost

Hot swap Design challenge

Max performance for given form factor

Max performance at perf/size/cost balance

Most performance at leading edge tech.

®

Key design challenges for CPU designers

9

Internet server examples l

Cobalt* High Density ISP Server

l

Front end server

l

1U ff (Up to 40 in one rack)

l

250MHz 64-bit CPU

l

256MB mem, 16.8 GB disk

l

Perf. metric: Web trans./sec

l

35 Watts

l

Intel AC450NX System

l

Back-end, Enterprise

l

7U ff

l

1-4 Pentium III XeonTM 500 MHz

l

up to 8GB ECC mem

l

Perf. metric: TPC-C

l

3 Power supplies - 420 Watts ea.

from http://www.cobaltnet.com

®

* Other brands and names are property of their respective owners

http://intel.com/design/servers/AC450NX/

10

12

Doing business in a power-limited world Two vectors for higher revenue – Both require power reduction

l

Higher ASP $$$

$$

Power

Higher Vol.

High End

Mid Range

more segment options

$

Bin

®

Lower power =

Bin+1 Performance

11

Summary - power reduction motivation and key message l

l l

Traditionally measure ourselves by perf or price-perf, but now its also perf/foot3 Power is THE key driver for foot3 Watts/foot3 matter

l

Hot-spots also limit allowed perf Power reduction helps perf

l

Watts/mil2 matter

l

CPU thermal maps

Power reduction is more critical than ever We need your help! ®

12

13

Future Design Challenges [Source: Microprocessor Report]

Transistor Count

l

300

10

The traditional path to pref.

A21164-300

9

250 8

200 PPro-150 PP166

5 4

2 1

i386 0

PPro200

HP PA8000 PPC 604-120 UltraSparc-167 MIPS R5000 SuperSparc2-90 PPC 601-80A21064A PPC603e-100 MIPS R4400 i486C-33 DX4 100 486-66 HP PA7200 PP-66

3

MIPS R5000 PPro200

MIPS R4400

HP PA8000

MIPS R10000

6

PP-100

PP-133

Freq(MHz)

xTors (M)

–2x Power l

A21164-300

A21064A

7

– 2x devices, 2x MHz per generation

Frequency

UltraSparc-167 150

93

95

94

MIPS R10000

PP166 PPro-150

PPC 604-120 100

DX4 100

50

486-66 PPC 601-80 PP-66 i386C-33 i486C-33 i386

i386C-33 '91

HP PA7200 PP-133

96

0

'91

'93

PP-100

PPC603e-100

SuperSparc2-90

'94

'95

'96

To continue industry’s performance ramp:



Need to adopt “Power-Aware” design in all product segments - not just in mobile/handhelds

– Need to design for power AND performance at all levels uArch to Ckts

Low Power Research IS High Performance Research ®

Key Research Opportunities: Fundamental circuit techniques

l

Enable continued Vcc reduction

– Enable development of ultra low voltage circuits – New logic families, multi threshold (Vt), dynamic Vt etc. l

Efficient leakage control – Leakage is catching up with switching power

l

Ckt. families for future process technology limitations 10,000 No Vcc scaling Leakage power Active power

1,000

First order analysis, using constant field scaling.

No Vcc scaling

Total Power (Watts)

Leakage power

100

Source: Shekhar Borkar, Intel Corp.

Active power 10 ®

0.25µ

0.18µ

0.13µ Min. feature size

0.09µ

0.06µ

14

14

Key Research Opportunities: High-level (arch/uarch) design l

Breakthrough machine organizations are needed – Diminishing performance improvements from uarch – E.g. N-way superscalar does not give speedup of N, but power goes up by factor of N – Increasing levels of speculation: prefetches, out-of-order-issue, branch prediction, data-speculation

Power Perf

uarch complexity

l l ®

Best way to use a Billion xtors for power and perf? Enhanced modeling capabilites to quanitfy powerperf. Space 15

Key Research Opportunities: Design for power delivery l

Power delivery may be the limiter ahead of thermals – More devices, higher frequency, higher currents – Dynamic power mgmt and clock gating make things worse Power Down Mode Disabled 100%

Powe r

% Power Down Mode Enabled Time (Seconds)

l l

Modeling of the tech. parameters (L,di/dt,C) and costs Design at all levels (systems, pkg., uarch, logic, ckts)

®

16

15

The Power Managing OS meets A Thermally Aware Processor

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 1

Agenda t

What is the motivation for this tutorial t The Power Managing OS Ÿ

OS Power Management ACPI

Ÿ

t

A Thermally Aware Processor Ÿ Ÿ Ÿ

t

®

OS response time System Impacts of High Power Processors PowerPC 750* Die Temperature Characteristics

Conclusions

*Third party brands and names are the property of their respective owners

R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 2

78

System Impact of High Power Processors t

It is clear that power dissipation adds to total system cost Ÿ

Al Extrusion Cu Heat Sink Embedded Ht Pipe

Adds cost to power supply

t

Ht Pipe + Remote cool

$60

Reduces the lifetime of the battery Ÿ

Fan Duct + Heat Sink

Adds cost to the cooling solutions

There is a processor power dissipation level beyond which cooling solutions add unreasonable costs Ÿ Ÿ

Fortunately the threshold changes with time & technology Current threshold seems to be 0 ~60W

100W

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 3

Power dissipation of existing processors The gap between Max Power and Typical Application Power (TAP) is increasing Max Power

Application Traces reveal a broad distribution of TAP Designing a system to be able to handle Max power of a leading edge processor can be expensive

Typical Power

45.0

# of Applications

40.0 35.0 30.0 25.0 20.0 15.0 10.0

Max power

5.0 0.0 1984

1986

1988

1990

1992

1994

1996

1998

2000

Power Dissipation

It is possible to constrain the power dissipation of a processor without significant impact to application execution performance ® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 4

79

The Power Managing OS

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 5

Operating System Power Management(OSPM) Based on User preferences Run in Performance mode or Quiet mode or Maximize Battery mode

Supported by Microsoft’s desktop operating systems via APM - Advanced Power Management OS/BIOS co-operation When OS goes to idle condition it performs an access to a register that causes an SMI# SMI handler puts system into low power state APM required OS to trust the system BIOS

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 6

80

Current OSPM - ACPI Advanced Configuration and Power Management Interface (ACPI) OS visible (SCI-based) as opposed to OS invisible (SMI-based) OS/drivers/BIOS are in sync regarding power states

Individual device management w/o H/W traps and timers OS & drivers are better judges of system/device state

ACPI defines multiple sleep states Global states (G) CPU states (C) System states (S) Device states (D) Bus states (B)

Thermal Management

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 7

ACPI System Architecture Applications

OS Dependent Application APIs

Kernel

OSPM System Code

Device Driver

HAL

ACPI Table Interface

ACPI Register Interface

Existing industry standard register interfaces to: CMOS, PIC, PITs, ...

OS Specific technologies, interfaces, and code.

ACPI BIOS Interface

ACPI Registers

ACPI BIOS

ACPI Tables

Platform Hardware

®

OS Independent technologies, interfaces, code, and hardware.

BIOS

- ACPI Spec Covers this area. - OS specific technology, not part of ACPI. - Hardware/Platform specific technology, not part of ACPI.

R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 8

81

ACPI Processor Power States Power Throttling

Latency

THT_EN=1 and DTY=value

C1 < C2 < C3

C0

Full Speed

Throttling

THT_EN=0 Interrupt or BM Access

HLT Interrupt

P_LVL2

C1

Interrupt

P_LVL3, ARB_DIS=1

C2

Power

C3

G0 Working

C1 > C2 > C3 ® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 9

ACPI System States State G0 Working

S1 Sleeping

S2 Sleeping

S3 Sleeping

S4

S4BIOS Sleeping

G2/S5

CPU

Devices Powered Up & Down based on demand D0 -D3

Not Executing Context Retained CPU CLK: OFF System CLK: ON Power: ON

Devices Power down depending on wakeup & power requirements

Retained Power : ON Refresh : Normal

Wake Up

Lowest Latency Restart @ CS:IP +1

Context Tracking

H/W responsible for saving context of CPU, System I/O, & Memory

Not Executing Retained CPU/Sys Cache Context Lost Power : ON CPU CLK: OFF Refresh : Standby / System CLK: OFF Auto Power: ON

Devices Power down depending on wakeup & power requirements

Not Executing CPU/Cache Context Lost CPU CLK: OFF System CLK: OFF Power: OFF

Retained Power : ON Refresh : Standby / Auto

Devices Power down depending on wakeup & power requirements

Latency > S2 Restart @ Boot Vector

Not Executing CPU/Cache Context Lost Everything: OFF

Context Lost Power : OFF Refresh : N/A

Devices Power down depending on wakeup & power requirements

Latency > S3 Restart @ Boot Vector

OS(S4) / BIOS(S4bios) is responsible for saving and restoring all system context, including memory

OFF

OFF

Devices are OFF, Power Button Press will wake up the system

Latency > S4 Restart @ Boot Vector

OS uses S5 to turn the machine off

Soft OFF

®

Memory

C0: Executing @ Full Speed Retained C1:C3 Executing in PM state Power: ON (ie Thermal Throttle/HLT) Refresh: Normal

Latency > S1 Restart @ Boot Vector

H/W responsible for saving context of System I/O & Memory OS responsible for saving CPU context H/W responsible for saving Memory context BIOS restores Memory Controller Context. OS responsible for saving CPU & System I./O context

NOTES: - OS chooses the lowest supported sleep state in which all enabled wakeup devices still functions under the latency requirements from apps. - ASL binds each Sx state to a SLP_TYP value, which based on platform design of power planes & clocking logic det what portions of the h/w power down. - For each Device, ASL lists which power resources are needed to maintain a ‘wakeup’ capable state - ‘System I/O’ refers to Motherboard Devices: PIT, PIC, DMAC, NMI State....OS saves & restores this stuff for S3

R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 10

82

ACPI Timer t

Service provided to OS by chipset Ÿ Ÿ

t

Generates an interrupt every 2.34 seconds 24-bit continuously running counter allows for fine granularity time measurement between events

Hardware timer ensures correct OS timing algorithms in the face of variable processor and device execution speed

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 11

Processor Power Management t

OS manages the power dissipation of the processor Ÿ

OS chooses Cx state based on idle time above a threshold

Ÿ

ACPI Timer used for idle timer detection

Ÿ

Processor clock throttling

Above threshold, OS uses lower power Cx state Timer sampled prior to and after exiting idle loop OS could scale processor duty cycle to match system usage

E.g 25% idle time, processor performance throttled to 75% Clock throttling has a longer latency today than C1 implementati on

t

OS uses power dissipation of processor to manage zone temperatures Ÿ

ACPI allows thermal zones Different thermal characteristics are allowed per zone OS will use processor clock throttling to control temperature of the zone that includes the processor

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 12

83

ACPI SW Concepts t

Dedicated ACPI Interrupt Ÿ Ÿ Ÿ

SCI (System Control Interrupt) RTC, ACPI Timer, Thermal sensor, Lid on Laptop, PCI device hotplug etc ACPI power/configuration events reflected to OS via SCI Wake events cause system to transition to S0 state from a sleeping state, e.g. Wake on RTC Runtime events - hot plug of PCI device, thermal sensor

t

OS/BIOS Interaction Ÿ Ÿ

OS can generate SMI# by setting bit in Chipset BIOS can generate SCI by setting a bit in Chipset

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 13

ACPI SW Concepts t

ACPI can execute “BIOS like” code Ÿ

Code is created by BIOS/platform developers ACPI Source Language (ASL) for writing methods Compiled to ACPI Machine Language (AML) and merged w/ BIOS as part of ACPI tables AML executed by OS

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 14

84

ACPI Thermal Management Methods t

Active Cooling Ÿ Ÿ

t

Passive cooling Ÿ Ÿ Ÿ

t

Turn on/Speed up system fan when system is hot Turn off/slow down fan when system is cool Reduces processor power dissipation when system is hot Restricts power by modulating processors STPCLK# pin STPCLK# duty cycle roughly proportional to reduction in CPU thermal dissipation

Separate trigger points for Active versus Passive

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 15

ACPI Thermal Model

Temperature

predefined equation with OEM supplied constants OEM defined sampling period

100% Tn - 1

∆P Tn

t

CPU Performance

For passive cooling the OS actively monitors the temperature in order to cool the platform. The OS calculates the CPU T performance change required to bring the temperature down

_TSP (Sampling period)

50% Time

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 16

85

Response Time Operating system response times SCI interrupt handler is NOT always the highest priority interrupt SCI service routine potentially paged to disk Results in a high latency between a thermal trigger point and OS induced response responses may vary from small microseconds to several 10’s of milliseconds

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 17

A Power Aware Processor

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 18

86

PowerPC 750* t

Is a low power and thermally aware processor Ÿ

Contains a Thermal Assist Unit Provides die temperature monitoring & measurement ±4°C temperature resolution Allows interrupt generation when temperature levels exceed preset thresholds

Ÿ Ÿ

Implements reduced power operating modes Implements Instruction Cache Throttling Programmable delay inserted between cache fetch operations Reduces maximum power dissipation

t

With an ACPI compliant interface the processor die temperature could be controlled directly by an ACPI aware operating system

*Third party brands and names are the property of their respective owners

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 19

Simulated Die Thermal Rise Time When a processor package is cooled to its Max Power the rise in die temperature over the last 1°C takes a “long time” From a manufacturing perspective it is better to define max die temperature to be the lowest possible value Direct impact on processor yield and/or frequency

When a processor is cooled to something less than its Max Power a 1°C temperature change can happen much quicker Temperature increases from 105°C to 109°C in ~60ms

Power Up Transient (Post Throttle) - 70% (91W) to 100% (130W)

∆ T (Tj_max - Tamb) (°C)

Step power increase from 70% to 100% of Max Power 110 75 74 73 72 71 70 69

104 0

0.02

0.04

0.06

0.08

0.1

0.12

Time (s) ® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 20

87

Conclusion t

Current OS ACPI implementations could not reliably control processor die temperature to an accuracy of < 1°C Ÿ

t

OS can control system temperatures to an accuracy of < 1°C Ÿ

t

A processor die thermal rise-time of 1°C in 15mSec is less than SCI maximum response time

Thermal mass of system ensures that OS response times are adequate for effective closed loop control of system temperature

Need other options for accurate temperature control

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 21

Closed Loop Control of Processor Die Temperature Hardware could be added to a system to enable closed loop control of processor die temperature Temperature sensor on die as in the PowerPC 750* Closed loop control circuitry on the system baseboard Adds cost to the basic system and dilutes the original “objective”

Closed loop control logic would be most effective if the complete circuit was added to the processor die Any logic enabling closed loop temperature control should also be SW accessible/controllable

Closed loop control is somewhat antagonistic to the goals of ACPI The OS is no longer the best judge of when the processor should execute at < 100%

® R

*Third party brands and names are the property of their respective owners

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 22

88

Conflict of Processor Bandwidth Allocation Versus Temperature t

The constraint of processor performance to manage thermal dissipation or temperature would incrementally impact SW performance Ÿ Ÿ

t

Impact is incremental to the existing variables of interrupt rate, cache misses, I/O workload, available bus bandwidth etc OS support for committed processor bandwidth allocation (real time) is currently unavailable

Closed loop processor thermal control must be considered when determining total available processor bandwidth in real time allocation policy

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 23

Summary t

Cost can be removed from a system Ÿ Ÿ

t

Processor yield and frequency can be helped by accurately controlling die temperature Ÿ

t

reducing processor power dissipation cooling a processor to less than its maximum power dissipation

To something less than ± 1°C

Operating System Power Management as implemented today cannot respond quickly enough to achieve ± 1°C control on higher power processors Ÿ

On chip closed loop control can achieve better than ± 1°C

® R

fbinns @ichips .intel.com

Micro32 - Cool Chips Tutorial

Page 24

89

Micro32 Cool Chip Tutorial R

®

Architectural Level Power/Performance Optimization and Dynamic Power Estimation - An Example from Simple Scalar Simulator Power Model

George Cai

Chee How Lim Intel Corp

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 1

Micro32 Cool Chip Tutorial

Acknowledgements

R

®

Thanks to Vivek De and Shekhar Borkar for their very valuable statistical analysis, data collection, and many helps. Thanks to Tosaku Nakanishi, Phil Wennblom, Krishnan Ravichandran, Shawn Searles, Tom Fletcher, Doug Carmean, Steve Gunther, and many of our colleagues at Intel. Thanks to Professor Trevor Mudge, Brad Calder, Dean Tullsen, Wen-Mei Hwu, G. Gao, Dirk Grunwald and their research groups for their valuable feedback and encouragement. Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 2

90

Micro32 Cool Chip Tutorial

Agenda

R

®

• Challenges to microprocessor architecture Power/performance optimization is a new dimension of microprocessor architecture

• Dynamic power estimation technique for power/performance optimization An example from Simple Scalar Simulator Power Model

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 3

Micro32 Cool Chip Tutorial

Power/Performance Optimization

R

®

-new dimension of microprocessor architecture • Power and thermal limitation impacts on very high frequency implementation • Difficulties of traditional voltage scaling for power reduction – Transistor scaling difficulties – Complexity of high speed circuit scaling increases rapidly – Power/perf. tradeoff between leakage and multiple Vt – Architectural solution of soft error for reliability at low Vcc • Rapid increase in the number of transistors for noncomputational logic blocks on chip

Power Powerefficient efficientarchitecture architecturemust mustbe beemphasized emphasized Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 4

91

Micro32 Cool Chip Tutorial

Power Consumption And Thermal Requirement Restricting High Frequency Implementation

R

®

• • • •

Gate_delays/clock_cycle reduction of 25% per generation Clock frequency doubles every generation Deeper and deeper pipeline for higher frequencies The better the power efficient architecture, the higher the implementation frequency within given thermal budget • The better dynamic power behavior the microarchitecture have, the higher performance and lower cost the CPU can achieve

High Highfrequency frequencyarchitecture architecturemust mustbe bepower powerefficient efficient Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 5

R

®

100

Intel

Processor freq scales by 2X per generation

IBM Power PC DEC Gate delays/clock

21264S

1,000

21164A 21264 Pentium(R) 21064A 21164 MPC750II 21066 604 604+

10

Pentium Pro (R) Pentium(R)

100

601, 603

Gate Delays/ Clock

10,000

Mhz

486 386 1 2005

2003

2001

1999

1997

1995

1993

1991

1989

10 1987

Micro32 Cool Chip Tutorial

Processor frequency trend

Ê Frequency doubles each generation Ë Number of gates/clock reduce by 25% Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 6

92

Micro32 Cool Chip Tutorial

Challenges of Voltage Scaling For Power Reduction

R

®

• Voltage scaling has been the most effective way to reduce power consumption • Reduce gate_delay 25% per generation • Double the transistor density per generation • Reduce energy per transition by 30%-65% per generation

• Traditional voltage scaling faces the challenge • Difficult to scale transistor oxide aggressively • Transistor scaling becomes more difficult than it was

Equivalent EquivalentArchitectural ArchitecturalScaling ScalingMethod MethodShould ShouldBe BeEnabled Enabled

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 7

R

®

5 Oxide Field (MV/cm)

OX

Electrical T OX (nm)

Micro32 Cool Chip Tutorial

Transistor scaling trends - Tox 66 55 4

4

3

SiO2

23 1

2

0

Advanced Gate Dielectric Required 1997 1999 01

03

06

09 2012

p+ Shallow trench isolation

n-well

3 2 1 0

.25 .18 .15 .13 .10 .07 .05 Technology Dimension ( µm ) Thin T OX

4

p+

1.5 1.2 1.0 0.8 0.6 0.350.250.18 Technology Dimension (um) Shallow highly doped source/drain extension

Retrograde Well Halo/pocket Deep source/drain

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 8

93

Micro32 Cool Chip Tutorial R

®

Complexity Of High Speed Circuit Scaling Increasing Rapidly • Self timed circuits gate delay and IC delay sensitivities to Vcc are different • Feedback circuits: N-mos in keeper is often non-minimum L different L MOS has different Vcc sensitivity • I/O circuits: I/O timing is relative to external clock, voltage translator delay sensitivity to Vcc is non linear • Interconnect scaling difficulties – High parasitic (R&C) and micro-architectural complexity – Complex interconnect distribution reducing transistor density

Microarchitecture Microarchitecture must mustconsider considerits itsscalability scalability Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 9

R

®

Pentium Pro (R) Pentium(R) II Pentium (MMX) Pentium (R) Pentium (R) II

Source: Intel

No of nets (Log Scale)

Micro32 Cool Chip Tutorial

Interconnect distribution

10

100

1,000

10,000

100,000

Length (u)

Interconnect distribution does not change significantly

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 10

94

Interconnect performance

Micro32 Cool Chip Tutorial

Total Capacitance (Relative)

R

®

Relative Resistance

6

1.6 5

1.4 1.2

4

1.0

Poly M1

0.8

M2 M3

0.6

M4

0.4

0.0

M2 M3 M4

1

M5

0

0.8µ

7

0.6µ 0.35µ Relative RC delay

0.25µ

1.0µ

6 5 Poly M1

4

M2

3

M3

0.8µ

0.6µ

0.35µ

0.25µ

% increase each tech generation R C RC Poly 45% -2% 42% M1 53% 5% 61% M2 46% 12% 62% M3 39% 8% 51% M4 18% 24% 46%

M4

2

M5

1 0

1.0µ

M1

2

M5

0.2

1.0µ

Poly

3

0.8µ

0.6µ

0.35µ

R increases faster at lower levels C increases faster at higher levels RC increases ~40-60%

0.25µ

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 11

Micro32 Cool Chip Tutorial

Leakage Current Important To Performance And Power Consumption

R

®

• Leakage Ioff increasing 5X when Vt scaling 15% in future microprocessors • Driving thermal runaway! • Cooling systems have high cost • Increasing total power consumption and negative impact on high end server product performance • Critical to battery life of mobile computing

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 12

95

®

Micro32 Cool Chip Tutorial

R

R

®

Dual Vt microarchitecture for power/performance optimization • Low Vt for performance critical microarchitecture and data paths • High Vt for power critical microarchitecture • Evaluate each class of dual Vt circuits for systematically applying them to appropriate microarchitecture to achieve the best power/performance optimization • Need architecture/circuit/cad tool/process cooperation

Power/performance Power/performanceoptimization optimizationfor forDual Dual Vt Vt CPUs CPUs Architectural Level Power/Performance Optimization and Dynamic Power Estimation

Lower Vt

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 13

higher drain Leakage

10,000 0.10m 0.13m 0.18m 0.25m

1,000

Ioff (na/ )

Micro32 Cool Chip Tutorial

Power/performance Tradeoff Between Leakage And Dual Vt

100 10 1

30

40

50

60

70

80

90 100 110

Temp (C)

Starting with 0.25µ technology, assume: Vt 450 mV Ioff at 30 C 1 na/µ Subthreshold slope at 30 C 80 mv/decade Subthreshold slope at 100 C Vt scaling per generation

100 mv/decade 15%

Ioff increase at 30 C

5X

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

Reference: Mark Bohr, et al IEDM, 1996

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 14

96

R

®

Number of paths Number of paths Number of paths

Micro32 Cool Chip Tutorial

Dual Vt design--a leakage control technique Technology provides two V t Ê High Vt with nominal Ioff (lower performance) Ë Low Vt with ~10X higher loff (higher performance)

High Vt

Delay

Employing high Vt everywhere yields lower performance, and lower leakage (1X)

Low Vt Employing low Vt everywhere yields higher performance, but higher leakage (10X) Delay

Dual Vt

Logic path between latch boundaries

Delay

Selective usage of low and high Vt yields higher performance, yet low leakage between 1X, and <<10X

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 15

Micro32 Cool Chip Tutorial

Power And Architectural Power Efficiency

Rapidly increasing non-computational logic used transistors on chip • Active capacitance (power/[Vdd2Frequency]) grows by 35% • Die size grows 25% • 2X frequency • Vdd scaling down 30% • Misprediction penalty higher and higher because we predict almost everything! • If trends continue (per generation) …...

2000 2000watts wattsfor forsupply supplyvoltage voltagescaled scaledmicroprocessors microprocessors R

®

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 16

97

Micro32 Cool Chip Tutorial

Active capacitance density

R

2

Active Cap Density (nf/mm )

1.00

Pentium Pro (R)

0.10

Active Cap =

Power VC2C × freq

Cap Density =

Pentium (MMX) (TM) 386 486 0.01 1.5µ



C Area

Pentium(R)

0.8µ

0.6µ

0.35µ

0.25µ

0.18µ

0.13µ

Active capacitance grows 30-35% each technology generation

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

®

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 17

Micro32 Cool Chip Tutorial

How To Achieve Power/Performance Optimization

R

®

• • • • •

Architectural definition stage RTL implementation stage Circuit implementation stage Physical design stage Processing and manufacture stage Fundamental difficulty for new microarchitecture: – Important architecture and design decisions must be made at early design phase, such as architecture definition stage – Accurate power estimation is obtained at late design phase

Leads Leadsto tomultiple multiplephase phaseoptimization optimization and many iterations among and many iterations amongall allphases phases Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 18

98

Micro32 Cool Chip Tutorial R

®

Power Estimation Errors vs. Microprocessor Design Phases Power Est. Error

25 20 15 10 5 0

Architecture definition

RTL

Circuit

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

Layout

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 19

Micro32 Cool Chip Tutorial

Power Consumption Correlation Studies And Power Estimation Error Analysis

R

®

• Correlation between power estimations from low level design and high level architecture power simulator • Critical paths based power estimation correlation analysis between architectural simulation and low level design – Circuit type based analysis – Typical activity factor based analysis

• Thermal imagines based correlation analysis – Hottest spot locations, coolest spot locations – Temperature differences, temperature distribution

• Micro-benchmarks based power estimation correlation analysis – Low level design (circuit, schematics, post layout) – Silicon correlation (max, average, min. I/O d i/dt, thermal analysis) Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 20

99

Micro32 Cool Chip Tutorial

Dynamic Power Modeling And Estimation

R

• Architectural and design partition • Dynamic power behavior measurability and controllability • Power density, activity, and effectiveness • Architecture, circuit, layout, process impacts • Statistical estimation and error analysis • Relative value and absolute value

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

®

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 21

Micro32 Cool Chip Tutorial

Modeling Parameters For Dynamic Power Estimation (1)

R

®

• • •



di/ dt threshold (DT): the threshold of the supply current difference during a unit clocking time; power threshold (PT): the threshold of the microprocessor dynamic power consumption during its execution; dynamic power monitor (DPM): a group of runtime counters and procedure calls that monitor the microprocessor runtime dynamic power behaviors including di/dt and max power violations and violation distribution; effective activity factor (EAF): a scaling factor that appropriately scales architecture activity factors and dynamic power monitor variables to applied logic and its possible layout area for power impact measurement;.

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 22

100

Micro32 Cool Chip Tutorial

Modeling Parameters For Dynamic Power Estimation (2)

R

®

• •





effective area (EA): the scaled circuit area of several categorized circuits. active power density (APD): the power consumption per unit circuit area within a functional block during the functional block implementation. It is one of the most important power parameters for dynamic power estimation; inactive power density (IPD): the power consumption per unit circuit area within a functional block during the functional block inactive, such as sub-threshold leakage current. It is one of the most important power parameters for future microprocessor average power estimation; average power (AP): an average power consumption of microprocessor during an given execution time or an given performance benchmark. Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 23

Micro32 Cool Chip Tutorial

Assumption of Dynamic Power Modeling And Estimation Example • Activity-sensitive power based on functional blocks • Functional block activity derived from Simplescalar • Functional blocks comprised of selected circuit types:

• Static, dynamic, SRAM, clock, programmable logic array (PLA), synthesizable and custom design • Each circuit type dissipate power through:

• Active power: Pdynamic is dominant • Inactive power: Pleakage is dominant • Power can be statistically estimated from “reference” circuit designs • Power = Σ i{EAF* Σ m(EA*APD) + (1-EAF)* Σ m(EA*IPD)} – where i = #cycles; m = circuit types

R

®

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 24

101

Simple Scalar Power Model Overview

Micro32 Cool Chip Tutorial

(Characteristics)

R

• •

Activity-sensitive power simulation Block-level power (active)= activity * (circuit_type * a rea * active_power_density)



Block-level



Basic Simplescalar architecture is partitioned into 32 physical blocks for power estimation. This partition and area estimation are done based on microprocessor design experience. Circuit power density is estimated from SPICE simulations (the circuit structures -SRAM, dynamic, static, PLA, clock - can be obtained from textbooks).

power (inactive) = (1 - activity) * ( circuit_type * area *inactive_power_density)

• •

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

®

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 25

Micro32 Cool Chip Tutorial

Added Power Simulation Variables

R

®

• • • • • • • • • • •

ppres_blockname= present cycle power contribution of blockname pprev_blockname = previous cycle power contribution of blockname pdidt_blockname = the change from present to previous power (dynamic power) pres_blockname = present cycle activity contribution of blockname prev_blockname = activity contribution of blockname up to previous cycle count_blockname = activity contribution of blockname up to present cycle blockname.ckt_pda = active power density of circuit ckt for block blockname where ckt = {dyn,sta,mem,pla,clk} blockname.ckt_pdi = inactive power density of circuit ckt for block blockname where ckt = {dyn,sta,mem,pla,clk} blockname.ckt_a = circuit area Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 26

102

Micro32 Cool Chip Tutorial

Names of 32 Functional Blocks • • • • • • • • • • • • • • • • • R

®

npclog = Next-pc generation logic btblog = BTB logic btbcac = BTB cache rsbcac = RSB itlbcac = Instruction TLB dtlbcac = Data TLB pmhlog = Page miss handler il1log = L1 inst cache logic il1tag = L1 inst cache tag il1cac = L1 inst cache array dl1log = L1 data cache logic dl1tag = L1 data cache tag dl1cac = L1 data cache array dispatchq = Dispatch queue decodepla = Inst decoder decodemisp= Misprediction handling logic decodestall = Decoder stall logic Architectural Level Power/Performance

• • • • • • • • • • • • • • •

ratarr = Rename table ruuarr = Re-order buffer lsqarr = Load/Store queue ruurdyq = Re-order buffer ready queue lsqrdyq = Load/Store queue ready queue ruuarb = Re-order buffer arbitration logic lsqarb= Load/Store queue arbitration logic ruuwb= Re-order buffer writeback scheduler lsqwb= Load/Store queue writeback scheduler fuint = Integer execution unit fufp = Floating point execution unit ul2log = Unified L2 logic ul2tag = Unified L2 tag ul2cac = Unified L2 cache biu = Bus/IO buffer ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Optimization and Dynamic Power Estimation

Page 27

Micro32 Cool Chip Tutorial

Example of Simple Scalar Simulator Partition (1)

R

®

fetch

IL1

dispatch

scheduler

ITLB

exec mem

DL1

writeback

commit

DTLB

DL2 IL2 BIU & Memory

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 28

103

Micro32 Cool Chip Tutorial

Example of Simple Scalar Simulator Partition (2) FEBTBBRLOG

FEBTBBRCAC

SCRRTALARR DEMISBRLOG

FERSBBRCAC

FENPCGNLOG

DEDISINQUE

DEDECINPLA

SCRUUALARR

SCRUURYQUE EXIEUCPLOG

EXFEUCPLOG

SCRUUWBLOG

SCRUUABLOG METLBINCAC

MEIL1INLOG

MEIL1INTAG

DESTLINLOG

MEIL1INCAC SCRRFFPREG SCRRFGPREG

MEPMHUDLOG

SCLSQALARR

SCLSQABLOG

METLBDACAC

SCLSQWBLOG

SLSQRYQUE

MEDL1DALOG

MEDL1DATAG MEDL1DACAC

MEUL2IDLOG

MEUL2IDTAG MEUL2IDCAC

BUBUSIOBUF

OFF-CHIP MEMORY

R

®

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 29

Micro32 Cool Chip Tutorial

Example of Simple Scalar Simulator Partition (3)

R

®

UNIT

BLOCK

ARCHITECTURE

ACTIVITY

FEATURE

FETCH

FENPCGNLOG Next pc generation logic

npc

FETCH

FEBTBBRCAC BTB cache, 512 entry, 4 way

brlookup + brupdate

FETCH

FEBTBBRLOG BTB logic, 3 extra misprediction penalty cycles

brlookup + brupdate

FETCH

FERSBBRCAC RSB stack, 8 entry

rsbpush + rsbpop

FETCH

FETLBINCAC

ITLB, LRU, 16 set 4KB page, 4 way

itlbacc + itlbrep + itlbwbk + itlbinv

FETCH

FEIL1INCAC

L1 insn cache, 512 set, 32B block, LRU, 1 cycle hit latency

il1acc + il1rep + il1wbk + il1inv

FETCH

FEIL1INTAG

L1 insn cache tag

il1acc + il1rep + il1wbk + il1inv

FETCH

FEIL1INLOG

LRU,decode logic

il1acc + il1rep + il1wbk + il1inv

DECODE

DEDISINQUE

Dispatch queue, 4 insn/cycle dispatch

dispatchqwr + dispatchqrd + dispatchqrel +dispatchqrec

DECODE

DEDECINPLA Decode logic, 4 insn/cycle decode

decoder

DECODE

DESTLINLOG

decodestall + decodestallchk

DECODE

Decoder stall logic

DEMISBRLOG Mispredict detect

decodemisp + decodemispchk

SCHEDULER SCRATALARR Register alias table, 24 entry

ratidep + ratodep + ratstall + ratstallchk +

SCHEDULER SCRRFGPREG Int register file, 32 entry

(integrated into RUU)

ruuret + ruurec + lsqret + lsqrec

SCHEDULER SCRRFFPREG Fp register file, 32 entry

(integrated into RUU)

SCHEDULER SCRUUALARR RUU array, 16 entry

ruuarr + ruurdyqsch + ruurec + ruuret

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 30

104

Micro32 Cool Chip Tutorial

Example of Simple Scalar Simulator Partition (4) UNIT

ARCHITECTURE

ACTIVITY

FEATURE SCHEDULER SCRUUWBLOG Writeback logic

ruuwb + ruuwbq + ruuret

SCHEDULER

SCRUURYQUE Reg Dep. Chk Cam

ruurdyqcam

SCHEDULER

SCRUUABLOG Writeback Arbitration

ruuarb

SCHEDULER

SCLSQALARR LSQ array, 8 entry

lsqarr + lsqrdyqsch + lsqrec + lsqret

SCHEDULER

SCLSQWBLOG Writeback logic

lsqwb + lsqwbq + lsqret

SCHEDULER

SCLSQRYQUE Mem Dep. Chk Cam

lsqrdyqcam

SCHEDULER

SCLSQABLOG Writeback Arbitration

lsqarb

EXIEUCPLOG Int execution logic, 4 ALU, 1 MULT/DIV

fuint

EXECUTE

EXFEUCPLOG Fp execution logic, 4 ALU, 1 MULT/DIV

fufp

MEMORY

METLBDACAC DTLB, 32 set, 4KB page, 4 way, LRU

dtlbacc + dtlbrep + dtlbwbk + dtlbinv

MEMORY

MEDL1DACAC L1 data cache, 128 set, 32B block, 4 way, LRU, 1 cycle hit latency

dl1acc + dl1rep + dl1wbk + dl1inv

MEMORY

MEDL1DATAG L1 data cache tag

dl1acc + dl1rep + dl1wbk + dl1inv

MEMORY

MEDL1DALOG DL1 decode, LRU

dl1acc + dl1rep + dl1wbk + dl1inv

MEMORY

MEUL2IDCAC Unified L2 cache, 1024 set, 64B block, 4 way, LRU, 6 cycle hit latency

ul2acc + ul2rep + ul2wbk + ul2inv

EXECUTE

MEMORY

MEUL2IDTAG Unified L2 cache tag

ul2acc + ul2rep + ul2wbk + ul2inv

MEMORY

MEUL2IDLOG UL2 Decode, LRU

ul2acc + ul2rep + ul2wbk + ul2inv

MEMORY BUS

R

BLOCK

MEPMHUDLOG Page Miss Handler, 30 cycle latency

itlbmis + dtlbmis

BUBUSIOBUF Bus interface logic, 8B bus width, 18:2 cycle latency

ul2mis + ul2rep + ul2wbk + ul2inv

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

®

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 31

Micro32 Cool Chip Tutorial

Running the Simulator

R

®



> pow-outorder3 -tlb:dtlb dtlb:4:4096:64:l ../spec95-big/compress95.ss < inputs/compress/test.in

• The statement above will run the pow-outorder3 simulator for compress benchmark with specific data TLB configuration (refer to Todd Austin's manual). • If you want to recompile ... Makefile: * Ensure that pow-outorder3.c and main1.c are listed * Point the BINUTILS_INC and BINUTILS_LIB to appropriate areas

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 32

105

Micro32 Cool Chip Tutorial

Dynamic Power Behaviors of Register Rename Table (16 RUU Entries)

R

®

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 33

FullChip Average Power 600

Relative Power

500 400 FullChip

300 200 100

LI

G O

GC C

PE RL

IJP EG

0

M 88 KS IM

Micro32 Cool Chip Tutorial

Full Chip Average Power Estimation (Relative Value)

SPEC95 Benchmarks

R

®

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 34

106

Micro32 Cool Chip Tutorial

Conclusion

®

Micro32 Cool Chip Tutorial

R

R

®

• Power efficiency has severe impact on microprocessor performance, function, and cost • New architecture and implementation challenges from power/performance optimization • New technology providing new opportunities for power efficient architecture and power/performance optimization • Dynamic power behavior observation and power estimation are important for power/performance optimization Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 35

Backup Foils • Environment concerns • Measurement entities • Statistical estimation • Circuit design style categorization • Design variables • Reference design boundary

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 36

107

Micro32 Cool Chip Tutorial

Environment Concerns

R

®

• Active Power Case (Mostly P dynamic) – Fast Process Skew (low Vt , low Leff, low Tox ) – Low Temperature – High Voltage

• Inactive Power Case (Mostly P leakage ) – Fast Process Skew (low Vt , low Leff, low Tox ) – High Temperature – High Voltage

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 37

Micro32 Cool Chip Tutorial

Measured Entities • Root-Mean-Square (RMS) current consumption (x Vsupply to get power) • Area occupied W/L IN OUT Instantaneous Current (mA)

L

IN

RMS Equivalent R

®

VDD W

Tperiod

OUT

time AREA

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

VSS ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 38

108

Micro32 Cool Chip Tutorial

Statistical Estimation • Determine distribution weights of each sim. point – for simplicity, assume equal weights

• Determine mean and standard distribution

Occurrence

σ

R

®

Occurrence

µ

σ

Gate

Power Architectural Level Power/Performance Optimization and Dynamic Power Estimation

σ

µ

σ

Gate Area

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 39

Micro32 Cool Chip Tutorial

Statistical Estimation

R

®

• Determine “confidence” criteria (e.g. 1 sigma) • Power Density = PowerM / AreaN (in W/µm2) – M,N = {-σ, µ, +σ} – Example: PD(worst-case) = Power +σ / Area -σ PD(best-case) = Power -σ / Area +σ

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 40

109

Micro32 Cool Chip Tutorial

Circuit Types • Static Circuit – Simulate NAND/NOR gates with different fanin/fanout • 2/3/4 fanin NAND with fanout of 2/3/4 • 2/3/4 fanin NOR with fanout of 2/3/4 • INV with fanout of 2/3/4 NAND

NOR

OUT B

B

OUT A

A

R

®

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 41

Micro32 Cool Chip Tutorial

Circuit Types

R

®

• Dynamic Circuit – Simulate NAND/NOR gates with different fanin/fanout • 2/3/4 fanin NAND with fanout of 2/3/4 • 2/4/8/16/32 fanin NOR with fanout of 2/3/4 • Domino_A and Domino_B DOMINO_A

DOMINO_B

CLK

CLK

A B

NMOS TREE

OUT

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

A B

NMOS TREE

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

OUT

Page 42

110

Micro32 Cool Chip Tutorial

Circuit Types • SRAM Circuit – Simulate READ/WRITE cycles (e.q. M x 32bits) • RD: Wordline(Domino), Bitline(Pch), Sense Amplifier(SA) • WR: Wordline(Domino), Bitline(Pch), Write Buffer(Inverter) PCH

Bitline Precharge/Write Buffers Bit

Cell

Bit#

Precharge#

Cell Bit

Word Decoder

Cell

Cell

Cell

Cell

Precharge# OUT#

Precharge# OUT Bit

Wordline Sense Amplifiers

R

®

Bit#

Bit# RD

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 43

Micro32 Cool Chip Tutorial

Circuit Types • Clock Buffer – Simulate Global Distribution (H tree/Grid) and Local Clock Generation (Buffers/Choppers) PLL / GLOBAL CLOCK Ref Clk

Phase Detect

Charge Pump

Clk Spine Loop Filter

VCO

Clk

PLL

Freq Divider

LOCAL CLOCK From Clk Spine

Clk Gate CLK3# CLK2

From Clk Spine

CLK3# CLK2

CLK1# R

®

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

CLK1# ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 44

111

Micro32 Cool Chip Tutorial

Circuit Types

R

®

• Programmable Logic Array – Simulate AND-OR Plane (e.g. M x N matrix) – Implement with Dynamic NOR-NOR circuit

OR-PLANE

AND-PLANE CLK

OUT C A

D

B

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 45

Micro32 Cool Chip Tutorial

Design Variables • Input vectors – Random, Bimodal, Gaussian, Favor- high, Favor- low

• Circuit Fanin (2/3/4/5 etc. inputs) • Circuit Fanout (2/3/4/5 etc. output loads) • Circuit Sizes (2/4/6/8 etc. µm widths) gate INPUT VECTORS

gate

FANIN

gate

gate

R

®

Architectural Level Power/Performance Optimization and Dynamic Power Estimation

FANOUT

Source: chee how dissertation

gate

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 46

112

Micro32 Cool Chip Tutorial

Reference Design Boundary

R

®

• Reference design requirement – – – –

Number of gates per pipestage (e.g. 8 gates/stage) Pipestage period = 1 / Fmax Fmax = 1/ (Tskew + Tclkq,stage1 + Tlogic + Tsu,stage2 ) Gate Delay ≈ Tlogic / 8 Period = 1 / Frequency

Stage1

Logic

Stage2

skew

Clock Architectural Level Power/Performance Optimization and Dynamic Power Estimation

ACM/IEEE Micro32 Nov. 15, 1999 Haifa, Israel

Page 47

113

POWER CHALLENGES IN THE INTERNET WORLD ...

Nov 15, 1999 - Doing business in a power-limited world .... cooling solutions add .... responses may vary from small microseconds to several 10's of.

459KB Sizes 1 Downloads 122 Views

Recommend Documents

Challenges of Bangladeshi Fashion Branding in the World Market.pdf ...
Challenges of Bangladeshi Fashion Branding in the World Market.pdf. Challenges of Bangladeshi Fashion Branding in the World Market.pdf. Open. Extract.

Greening the Internet; Power Optimization - Ashutosh Dhekne
Building faster, smaller and more powerful computer systems has been a ... of power has been a consideration only in laptops and devices that are not mains .... is set to 0.1, the station remains awake for 10 beacons and then sleeps between.

Greening the Internet; Power Optimization - Ashutosh Dhekne
of power has been a consideration only in laptops and devices that are not ..... [9] have connected a low power radio to a PDA so that the main PDA does not.

THE POWER OF INTROVERTS IN A WORLD THAT ...
... to society-from van Gogh's sunflowers to the invention of the personal computer. ... QUIET talks about the New Groupthink, the value system holding that ... books from Malcolm Gladwell, Daniel Pink, and other masters of psychological non- ... int

CHA091138 Multimodal Authoring Pedagogy CHALLENGES IN THE ...
Paper presented at the Australian Association for Research in Education, Canberra, ... technologies that most young people master independently (e.g. Instant ...

CHA091138 Multimodal Authoring Pedagogy CHALLENGES IN THE ...
multimedia writing pedagogy is urgently needed to prepare students to be ... columns might indicate “phenomena under investigation” and rows the “themes of analysis”. .... The challenge which is before us, the rather large 'blind spot' in the

Challenges for Large-Scale Internet Voting Implementations - Kyle ...
Challenges for Large-Scale Internet Voting Implementations - Kyle Dhillon.pdf. Challenges for Large-Scale Internet Voting Implementations - Kyle Dhillon.pdf.

CHA091138 Multimodal Authoring Pedagogy CHALLENGES IN THE ...
CHALLENGES IN THE DEVELOPMENT OF A MULTIMEDIA AUTHORING ..... The teacher of information and communications technology (ICT) may be inclined ...

Human Rights Challenges in the Digital Age
May 25, 2018 - 55 Although not referred to in the Long Title of BORA, international conventions other ...... under pain of legal penalty.116. The existence of a ...

Challenges for Large-Scale Internet Voting Implementations - Kyle ...
A fundamental component of any Internet voting system is the software through ... Challenges for Large-Scale Internet Voting Implementations - Kyle Dhillon.pdf.

Challenges in Warsaw.pdf
Page 1 of 1. Challenges in Warsaw.pdf. Challenges in Warsaw.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Challenges in Warsaw.pdf.

Challenges in Poland.pdf
WARSAW. 1. Got that Christmas feeling, huh? Go to see The Royal Festival of Light in ... Meeting point: National Stadium Ice Rink. ... Challenges in Poland.pdf.

PDF Download Empires in World History: Power and ...
... in PDF Empires in world history power and the politics of difference Full text HTML PDF Full Although Empires in World History ... nation-centered perspectives.