Design of SoC for Network Based RFID Applications

Viewer
Transcript

Sri rama

Design of SoC for Network Based RFID Applications by Krishna Teja Malladi (Y4177196) B.Tech.-M.Tech. Dual Degree (2004-2009) A dissertation submitted in partial satisfaction of the requirements for the degree of Master of Technology in Electrical Engineering (VLSI and Microelectronics) Guided by Dr. Shafi Qureshi

Department of Electrical Engineering, Indian Institute of Technology, Kanpur, India.

DEDICATION

To my Family.

iii

TABLE OF CONTENTS Signature Page . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction . . . . . . 1.1 Systems on a Chip . . . . 1.2 Motivation and Approach 1.3 Dissertation Outline . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. 5 . 5 . 7 . 10

Chapter 2 System Architecture . . . . . . . 2.1 Interface with RFID Reader . . . . 2.1.1 SLM 0151M RS232 . . . . . 2.1.2 GSM Module . . . . . . . . 2.2 Programmability with RISC Core . 2.2.1 8 bit RISC Processor . . . . 2.2.2 ALU . . . . . . . . . . . . . 2.2.3 Control Engine . . . . . . . 2.2.4 RISC operation . . . . . . . 2.3 DES Security Core . . . . . . . . . 2.3.1 DES Algorithm . . . . . . . 2.3.2 Architecture . . . . . . . . . 2.4 Network Processing Unit . . . . . . 2.4.1 Packet Engine . . . . . . . . 2.4.2 Array Lock hardware . . . . 2.5 Queuing Engine . . . . . . . . . . . 2.6 Control Path/ Engine Controller . . 2.7 Cache Controller and Prefetch Unit 2.8 Software . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

iv

. . . .

2

12 14 14 17 18 19 20 21 21 23 24 26 28 31 48 50 52 52 55

2.9

Overall Architecture and SoC Function . . . . . . . . . . . . . . . . 56

Chapter 3 Architectural Optimizations 3.1 Macro Optimization Framework 3.1.1 Parallelization . . . . . . 3.1.2 Data-path width . . . . 3.1.3 Split-Cache . . . . . . . 3.1.4 Power Aware Scheduling 3.1.5 Accelerating Units . . . 3.2 Micro Architectural Framework 3.2.1 Pipelining . . . . . . . . 3.2.2 Clock Gating . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

59 59 60 62 64 66 67 69 69 71

Chapter 4 Circuit Level Techniques . . . . 4.1 Overview . . . . . . . . . . . . . . . 4.2 Multi-Vdd and Multi- Vth . . . . . . 4.2.1 Multi-Vdd . . . . . . . . . . 4.2.2 Multi-Vth . . . . . . . . . . 4.3 Vdd - Vth Assignment Algorithm . . 4.3.1 Choice of Design Metric . . 4.3.2 Optimization Framework . . 4.3.3 Process Variations and µ . . 4.3.4 Heuristic Voltage Clustering 4.4 Results and Analysis . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

73 73 74 74 76 77 78 80 84 88 90

Chapter 5 Simulation and Synthesis . . . 5.1 SoC Design Flow and Tools Used 5.2 Simulations . . . . . . . . . . . . 5.3 Custom Design and Synthesis . . 5.3.1 Synthesis . . . . . . . . . 5.3.2 Back-Annotation . . . . . 5.3.3 SoC Place and Route . . . 5.3.4 Xilinx Place and Route . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

95 95 97 101 101 103 103 104

Chapter 6 Conclusions and Future Work . 6.1 Conclusions . . . . . . . . . . . . 6.2 Contributions . . . . . . . . . . . 6.3 Future work . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

110 110 111 113

Appendix AXilinx Synthesis and PAR . . . . . . . . . . . . . . . . . . . . . . 114 Appendix BSynopsys Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

v

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

vi

LIST OF FIGURES Figure 1.1: SoC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.2: Power Consumption of various components in a sensor node. . Figure 1.3: Layer wise Optimization Approach. . . . . . . . . . . . . . . . Figure 2.1: Figure 2.2: Figure 2.3: Figure 2.4: Figure 2.5:

Block Modular diagram of overall system architecture. . . . . . RFID Interface Hardware. . . . . . . . . . . . . . . . . . . . . GSM Module Architecture. . . . . . . . . . . . . . . . . . . . . RISC Architecture with Controller, Processor and Memory Bank. Unrolled DES Architecture where each stage is depicted in Fig. 2.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.6: Hardware for each round of DES. . . . . . . . . . . . . . . . . Figure 2.7: OSI Layer Hierarchy in NPU. . . . . . . . . . . . . . . . . . . Figure 2.8: Systolic Array Architecture of NPU with UDP, IP, Eth, Phy forming a PE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.9: UDP Packet Format. . . . . . . . . . . . . . . . . . . . . . . . Figure 2.10: UDP Layer Architecture. . . . . . . . . . . . . . . . . . . . . . Figure 2.11: IP Packet format. . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.12: IP layer architecture. . . . . . . . . . . . . . . . . . . . . . . . Figure 2.13: Ethernet layer hardware architecture. . . . . . . . . . . . . . . Figure 2.14: Ethernet Layer Hardware. . . . . . . . . . . . . . . . . . . . . Figure 2.15: Physical Layer stack and functionality . . . . . . . . . . . . . . Figure 2.16: Physical Layer PHY of the Systolic Array. . . . . . . . . . . . Figure 2.17: 4B-5B encoder schematic. . . . . . . . . . . . . . . . . . . . . . Figure 2.18: Serializer and Scrambler. . . . . . . . . . . . . . . . . . . . . . Figure 2.19: MLT3 Encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.20: DC Encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.21: Array Lock Hardware. . . . . . . . . . . . . . . . . . . . . . . . Figure 2.22: Queueing Engine Architecture. . . . . . . . . . . . . . . . . . . Figure 2.23: Cache-controller and Prefetch unit hardware. . . . . . . . . . .

Figure Figure Figure Figure

3.1: 3.2: 3.3: 3.4:

Figure 3.5: Figure 3.6: Figure 3.7: Figure 3.8:

Macro Architectural Tradeoffs . . . . . . . . . . . . . . . . . . Improving the Transceiver Line efficiency by parallelization. . . Data-path optimization of IP Module. . . . . . . . . . . . . . . Power Savings in UDP and IP modules by datapath optimization algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . CRC shift register Hardware. . . . . . . . . . . . . . . . . . . . IP checksum hardware. . . . . . . . . . . . . . . . . . . . . . . Modular Pipelining Approach in our design for High throughput. Latch Based Clock Gating. . . . . . . . . . . . . . . . . . . . .

vii

6 7 9 13 16 18 20 25 27 29 30 31 33 34 35 36 38 41 42 44 45 47 48 49 50 55 60 61 62 65 68 69 70 71

Figure 4.1: Multi Vdd Circuit design. . . . . . . . . . . . . . . . . . . . . Figure 4.2: Multi Vth circuit design. . . . . . . . . . . . . . . . . . . . . . Figure 4.3: Delay and Temperature variations for various metric based optimizations in sub-100 nm design. . . . . . . . . . . . . . . Figure 4.4: P T µ metric surface plot with Vdd and Vth . . . . . . . . . . . . Figure 4.5: Variation of P T µ metric with µ. . . . . . . . . . . . . . . . . Figure 4.6: Power savings in NPU by using the algorithm . . . . . . . . Figure 4.7: Power savings in RISC Core by using the algorithm and clock gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.8: Reduction of spread of timing slack histogram by M ulti − Vdd and M ulti − Vth . . . . . . . . . . . . . . . . . . . . . . . . .

. 75 . 76 . . . .

79 82 87 92

. 93 . 94

Figure 5.1: Complete SoC Design Flow, Xilinx Mapping and Tools Used. . 96 Figure 5.2: Software running on remote node which receives RFID data and forwards to email-server. . . . . . . . . . . . . . . . . . . . 98 Figure 5.3: Ethernet TriMAC Core in Xilinx Coregen . . . . . . . . . . . . 100 Figure 5.4: Ethernet data integrity check in our design . . . . . . . . . . . 105 Figure 5.5: Test Bench results showing TXpos and Txneg bits of NPU. . . 106 Figure 5.6: Test bench showing DES Encryption results with internal pipeline stages also. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Figure 5.7: SLM 015M RFID interface simulation (Commands received over tx and sent over rx.) . . . . . . . . . . . . . . . . . . . . . 106 Figure 5.8: AVM GSM Module interface simulation (SMS command sent over tx). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Figure 5.9: Overall Simulation showing cascaded RFID Interface, DES Security and NPU engine. Data stream is finally placed over Tx bus. The synchronous handshake signals (enables) are used to pass data from one layer to another . . . . . . . . . . . . . . . . 107 Figure 5.10: Back Annotation result of NPU showing Tx bits. . . . . . . . . 108 Figure 5.11: Slight physical delay in assignament at clock edge after synthesis.108 Figure 5.12: SoC after Placement and Routing. . . . . . . . . . . . . . . . . 109 Figure 6.1: Comparision of FPGA DES throughput against various other implementations. . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Figure A.1: RTL schematic of NPU. . . . . . . . . . . . . . . . . . . . . . . 115 Figure A.2: RTL Schmeatic of DES. . . . . . . . . . . . . . . . . . . . . . . 116 Figure A.3: Device Summary for NPU on Xilinx Vitrex5 FPGA. . . . . . . 117 Figure B.1: DES after synthesis. . . . . . . . . . . . . . . . . . . . . . . . . 119 Figure B.2: Synthesized NPU, Packet engines and their internal structures in the following figures. . . . . . . . . . . . . . . . . . . . . . . 120

viii

Figure Figure Figure Figure

B.3: B.4: B.5: B.6:

Array of Packet Engines. Internal systolic structure RFID interface. . . . . . GSM interface. . . . . . .

. . . . . . . of each PE. . . . . . . . . . . . . . .

ix

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

121 122 123 124

LIST OF TABLES Table 2.1: 4B-5B Look up Table . . . . . . . . . . . . . . . . . . . . . . . . 43 Table 2.2: Control Signals of SoC . . . . . . . . . . . . . . . . . . . . . . . 53 Table 2.3: Systolic Array Controls . . . . . . . . . . . . . . . . . . . . . . . 54 Table 4.1: Modules and their Power-Efficiencies . . . . . . . . . . . . . . . 84 Table 4.2: Power Efficiency and Optimum Supply, Threshold voltage values 88 Table 4.3: Clustered Voltage values . . . . . . . . . . . . . . . . . . . . . . 90 Table 5.1: Dynamic Power savings in Various modules by use of Multi Vdd , Multi Vth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Table 5.2: Leakage Power savings in Various modules by use of Multi Vdd , Multi Vth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Table 5.3: Layout Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 104

x

ACKNOWLEDGEMENTS I would like to express my deepest gratitude to Prof. S. Qureshi for his invaluable guidance throughout the the past 2 years and making my thesis work possible. He has been the one person who had not only inculcated deep interest in digital circuits but also been constant positive critic of my work giving feedback, new ideas and help during difficult times. I also want to acknowledge his patience, kindness and constant motivation to always keep pushing for more. I would want to thank all my teachers who have taught me a wide range of courses at IIT Kanpur over the last 5 years in dual degree especially Dr. S. Qureshi, Dr. Aloke Dutta, Dr. S.S.K. Iyer, Dr. S. P. Das and Dr. Mainak Chaudari who have been reservoirs of knowledge in VLSI architecture, circuits, devices and technology. I’m what I’m today because they have been what they are in their professions! I would also like to acknowledge Mr. J. Whiteford for making the complex cluster of EDA tools easily accessible and for being always ready to help. I want to thank all my VLSI labmates especially Prasanna, Sanjeev and Arun Kumar for not only helping me with technical issues but for also having nice humored conversations in the lab everyday. Arun, my long time friend from CSE deserves a special mention of gratitude. He’d been the one close person with whom I’d long discussions on things ranging from technology to sports to movies to trivia. He always seemed to have the right answer to all my questions, involving EE issues too! It goes without saying that I’m really thankful to my family, my wingmates and friends who had been always behind me in good and tough times and have moulded my positive look towards life and to always look for higher destinations.

xi

ABSTRACT The last decade has seen an increased usage of RFID systems and peer-to-peer sensor nodes. These need a hardwired PC/FPGA/Network Processor to communicate with remote servers for enterprise applications, which consume atleast three orders of magnitude higher power than the system. Ongoing research addresses low power RFID, sensor nodes design, but the additional cost and power for the communication modules are hidden. Our current work addresses the issue of providing a scalable, power- area- cost efficient network connectivity to the above class of applications by the design a custom SoC. The design principle rests on the observation that a Network processor has hardware intensive features like IP Address look-up, parsing, hash functions, etc which could be simplified for the given application. A Systolic-Array architecture of hardware units each operating on one level of network OSI hierarchy is proposed here for the network transceiver. It is further parallelized into packet engines each operating at a line speed of 125 MHz on the 100 base-TX ethernet network. Also, interface units to the RFID reader, a GSM unit for SMS connectivity, DES Security module for secure transmission, a Coprocessor for system acceleration are designed. With the SoC acting as a UDP client, a remote server has been programmed to poll for RFID data and send it as an email to a mail-server. After the functionality and throughput requirements are met, we use a novel multi-layer hierarchy framework which enables power and area optimizations across the boundary of algorithm, architecture, and circuits by employing techniques like parallelism, pipelining, clock-gating, scheduling, queuing algorithms etc . A general purpose RISC processor and an SRAM cache controller have been designed to have flexibility for additional functionalities. Further, at the circuit level, we propose a novel optimization framework to use Multi-Vdd and Multi-Vth techniques simultaneously in the presence of process variations which

xii

1 showed power savings of 57% in the transceiver. The design is then floor-planned for area, timing and the layout of the proposed SoC is completed. By a series of tests, the SoC has been verified for IEEE 802.3 Ethernet requirements in Xilinx where it has synthesized and mapped to onto a Xilinx Vitrex-5 FPGA also. After the entire design flow hierarchy, the SoC has an area 0.1015mm2 , power of 2.143 mW at a clock speed of 125 MHz and a supply voltage of 1 V for about 0.1 million transistors in 90 nm CMOS process. We use the tools, Modelsim Mentor graphics for architecture design, Synopsys Design vision for synthesis and Cadence SoC Encounter for layout in conjunction with TSMC 90 nm cells in the design.

List of Symbols AMD

Advanced Micro-Devices

ANSI

American National Standards Institute

ASIC

Application Specific Integrated Circuit

BSIM

Berkeley Simulation

CAT

Category

CISC

Complex Instruction Set Computer

CRC

Cyclic Redundancy Check

DES

Data Encryption Scheme

DRAM

Dynamic Random Access Memory

FCC

Federal Communications Commission

FIFO

First In First Out

FPGA

Field Programmable Gate Array

FSM

Finite State Machine

GSM

Global System for Mobile Communications

2

3 HDL

Hardware Description Language

IC

Integrated Circuit

IP

Internet Protocol

LAN

Local Area Network

LUT

Look Up Table

MAC

Media Access Control

NAK

Negative Acknowledgement

NPU

Network Processing Unit

NRZI

Non-Return to Zero Inverted

OSI

Open Systems Interconnections

RFID

Radio Frequency Identification

RISC

Reduced Instruction Set Computer

RJ-45

Registered Jack 45

RS232

Recommended Standard 232

RTL

Register Transfer Logic

SIMD

Single Instruction, Multiple Data

SoC

System On Chip

SRAM

Static Random Access Memory

TCP

Transmission Control Protocol

4 TSMC

Taiwan Semiconductor Manufacturing Company

UART

Universal Asynchronous Receiver/Transmitter

UDP

User Datagram Protocol

USB

Universal Serial Bus

UTP

Unshielded Twisted Pair

VLIW

Very Large Instruction Word

XOR

Exclusive OR

Chapter 1 Introduction 1.1

Systems on a Chip

The last decade has seen an exponential growth of complexity in digital circuit design. In the past, the amount of functionality that could be integrated on chip was limited by area and latency; but today, with aggressive technology scaling to 45 nm nodes, we have smaller and faster transistors that lead to about a million times increase in complexity on chip. However, power dissipation remains the primary limiting factor in current day digital circuit designs. Also, as technology scales into the deep sub-micron regime (beyond 0.25 µm node) designers are starting to see some new technological problems which cannot be ignored any longer [1]. Leakage power has been increasing exponentially up to 90 nm node, and the impact of process parameter variations on power and performance has also been increasing with each technology generation. To keep Moore’s law alive in the coming decade, the system design should give due weightage to these issues. Systems-on-a-Chip (SoC) gives a powerful way of scaling the complexity of digital circuits by organizing the design into a tree type hierarchy. Fig. 1.1(a) illustrates how complex digital circuits are being integrated onto a single die, with a

5

6

(a) Nexperia SoC with General Purpose Processor, Signal Processor, Memory, Co-Processor, input-output terminals

(b) Trading off flexibility versus energy efficiency

Figure 1.1: SoC Design

7 hierarchy consisting of various computational building logic blocks, memories and on-chip communication [2]. Unlike ASICs which are hardwired for functionality, SoCs can offer limited amount of flexibility by the inclusion of on-chip processors. However, the major challenge in SoC design is posed by the conflicting demands for performance, energy efficiency and flexibility. Fig. 1.1(b) compares the energyefficiency (the number of operations that can be performed for a given amount of energy) of various implementation styles versus their flexibility. A staggering 3 orders of magnitude of variation in efficiency can be observed between the general purpose processor and the hardwired design styles [3]. Clearly hardwired implementation with limited flexibility is preferable when energy efficiency is important.

1.2

Motivation and Approach

Figure 1.2: Power Consumption of various components in a sensor node.

Currently, RFID systems are being deployed in a wide-variety of applications in logistics, transport, tracking, inventory and retail. Even with the advent of simpler RFID readers to bring down power and cost, the major bottleneck has been the necessity of a computer/network processor/FPGA hardwired to the reader to

8 communicate with remote server machines to make RFID systems more scalable. To eliminate this problem, a low power transceiver is necessary which can provide networking functions. Peer-to-Peer sensor nodes form another class of application which need low power communication modules to actively monitor parameters and communicate to a remote host. Though wireless realizations seem promising, the potential problems could be high power, cost of bandwidth and importantly, the non-zero possibility of absence of peer nodes in the wireless range resulting in loss of communication. In the above class of applications, an attractive alternative could be to provide low cost wired communication modules without the overhead of network processors/FPGAs which otherwise defeat the purpose of sensor nodes (with orders of magnitude higher power and cost). Fig. 1.2 shows a breakdown of the current consumption of a state-of-the-art sensor node when active [4]. It shows that the power consumption of the transmitter and receiver are atleast 40 times higher than other useful modules in sensor design thus showing the necessity of a low power network transceiver. The aim of the dissertation is to develop an ASIC/SoC to enable such network/mobile connectivity for RFID readers. A low cost- low power- low area architecture which, in the absence of a computer can replicate network processor’s functionalities to transmit data from the RFID unit to an email server is desired. Such a node will be scalable in terms of power, area and cost to employ them ubiquitously while shifting processing onto a central server. The architecture needs an RFID interface, a Processor to execute user defined functions on the input data, a security mechanism to encrypt user data and most importantly a network interface to the commonly used 100 Mbps Ethernet network. A few other options are possible to realize the complete architecture. RISC Cores though offer flexibility, are not viable because the latency becomes very

9

Figure 1.3: Layer wise Optimization Approach.

high with the non-custom hardware. Commercially available Network Processors like Intel IXP 2850, AMD Au1550 or even similar non-custom designs on FPGAs are not scalable power and cost wise (Rs 5000/- with around 19 W power) [5]. The traffic rate and requirements, latencies are different for an RFID reader compared to a LAN user on a 1 Gbps network. Also, applications like Table look up and Hash functions necessary in routers and switch-fabrics which consume chunk of the power are no longer necessary. These requirements call for a custom made low power SoC with maximized performance which can bring down power and area to micro-regime. The challenge is to maintain robust functionality in doing so. To scale down the

power perf ormance

, it is however important to have optimiza-

tions at various design hierarchies. There are three separate abstraction layers where power-performance-flexibility tradeoffs must be considered: architecture

10 level, micro-architecture level and circuit level [1]. Constraints for each layer must be propagated to the other layers to ensure an optimal tradeoff between power, performance, and area as in Fig. 1.3. Chandrakasan and Brodersen showed that a design optimization spanning architecture, algorithm, micro-architecture, circuit, and technology resulted in three orders of magnitude power savings for their portable media terminal benchmark at the cost of increased latency and area [6]. Their power optimizations included low threshold devices, architecture driven voltage scaling, parallelism, gated clocks, power gating and accounting for switching activity in architecture design. Today, even though these techniques are now widely employed in optimizing designs for power and performance, they are still disjointly optimized. Architecture optimizations are rarely carried out in the context of technology or circuit constraints as the teams that design architecture, circuit, and technology are usually different. This leads to architectures that are not well-suited to the underlying algorithm, circuit style or technology. So in this dissertation, while functionality is the main goal, we follow this approach to bring down power to ultra-low regime.

1.3

Dissertation Outline

The dissertation is organized in the following way following the current introductory chapter: • Chapter 2 introduces the SoC’s system architecture and goes into details of all the independent modules and their hardware realizations. Functionality is the main goal in the preliminary design described here to output the network/GSM transmission bit stream • Chapter 3 deals with the macro and micro level architectural improvements implemented in the design to improve latency and throughput, area, relia-

11 bility and also energy efficiency. The motivation for using the improvement, implementation and the results are discussed • Chapter 4 describes the novel optimization frame-work we developed at the circuit level by using Mult Vdd and Multi Vth assignments. Apart from the motivation to develop such an approach, a mathematical analysis which takes power, delay and process variations into consideration is also presented • Chapter 5 shows the simulation and synthesis results of the hardware and software, power savings and back annotation results of the architecture design • Chapter 6 concludes the work while also pointing out the scope for future work and further improvements.

Chapter 2 System Architecture The current chapter descibes in detail the system architecture of the proposed SoC. Briefly, the system consists of an RFID Interface unit, a GSM module, DES Security Engine, a Programmable RISC Core , a Network transceiver, Control Unit, Queuing Engine and a Cache controller as depicted in Fig. 2.1. Reduction of router functions in network transceiver enables a pipelined architecture with more registers and wider data-paths for faster processing. The system is designed to be hardwired to increase the performance for the data-flow while having an on-chip RISC Core to add flexibility for user functions on the RFID Data. Each section below describes the architecture used to realize the modules and the scheduling of data. Although it is difficult to differentiate betwwen control and data paths in SoCs , the control path here is defined by the in/out signals of control engine while the data flow between the modules i.e. RS232, GSM, DES, RISC, NPU defines the datapath.

12

13

clk, reset, outpacket_data inpacket_data poll1 poll2

tx

GSM Module Interface

Custom USB RFID Interface

RFID_data[0:31], data_enable

tx

External RISC Core RFID_data[0:31], data_enable CU

RFID_data[0:31], data_enable

1 RS 232 RFID_data[0:31], SL015M data_enable 0 RFID Interface

rx

rx

1 0

Pro

RISC Core write_addr ALU

ntwk_gsm

rfid_switch

data

data_enable Engine Controller

RFID_data_ready

in_data data_read_DES

Queueing Engine

DES

counter External memory

write_en data

SRAM Cache

address

PE_busy

Cache Controller

Tx+ Tx-

Crypto_data[0:63]

Systolic NPU Array Tx Phy1

Eth1

IP1

UDP1

Phy2

Eth2

IP2

UDP2

Phy3

Eth3

IP3

UDP3

Phy4

Eth4

IP4

UDP4

Tx Array Lock Hardware

Tx

100 Mbps LAN

Remote Node Software

done_flag NPU_buffer SOP,EOP counters

Prefetch Unit

Cache_data

Magnetics and RJ45 Connector

clk,txclk global_reset Idle_flag

Tx

Co-Processor CRC32, Checksum, Scrambler

Figure 2.1: Block Modular diagram of overall system architecture.

14

2.1

Interface with RFID Reader

The architecture should include an interface with the RFID reader to poll continously and accept data from the RFID reader while also acting as a server and configuring the reader. So, a robust communication protocol is necessary to synchronize the incoming RFID data before passing to the inner layers for processing and transmission. The protocol should be able to handle multiple RFID clients for extension in the future which necessitates an address protocol too. USB appears to be the best protocol in the current scenario. But a complete USB protocol architecture will be expensive power-wise and area wise. So, a new protocol on the lines of USB is proposed which is suitable for ultra-low power applications. The necessary protocol has been designed along with a few design changes that can be integrated into a custom made RFID reader. It also consists of token based system to handle multiple RFID readers. SLM 0151M is a low cost, low power RFID reader from StrongLink Inc. which communicates via RS232 protocol and is suitable for the current system realization to target low power. It transmits 4 bytes of data serially using pins tx, rx, tagsta on board. The 4 bytes constitute the data stream which finally needs to be put on the 100 Base-TX bus by the network transceiver. The two RFID interface units have been multiplexed by rf id switch depending on the make of the RFID reader.

2.1.1

SLM 0151M RS232

SLM reader interacts with the host via a set of commands as packets pushed over the RS232 interface. The communication protocol is byte oriented and the commands from both host to reader and reader to host are as below, [7]. Header Len

Command Status Data Checksum

15 Header : Communication header, 1 byte; From host to module: 0xBA ; From module to host: 0xBD. Len: Byte length counting from Command to Checksum inclusively, 1 byte. Command: Command, 1 byte. Status: Exists only in Reader to Host transactions; Gives command status of 1 byte. Data: Data, variable length depending on the command type. Checksum: XOR result from Header to Data inclusively, 1 byte. The key commands used in the current design for the interface are ”Select Mifare Card” and its response from the RFID reader. The packet from the host to the reader consists of empty data field and the command field has 0x01. The returned frame from reader also consists of 0x01 in the command field and the following bytes in the status field depending on whether the right data could be put in the data field: Status: 0x00: Operation success 0x01: No tag 0x0A: Collision occur 0xF0: Checksum error . The data field consists of 4 bytes of data which is programmed in the RFID reader middleware to give information about the user’s details and all the other information necessary in the current transaction at the web-server/email-server (Ex: Gas Station). One more byte of data informs the make of the RFID card. These packet frames are sent serially over the interface using tx and rx pins. Each byte is sent in little endian format consisting of 1 start bit and 1 stop bit at a baud rate of 9600 bps. There are no flow-control or parity bits. Our hardware interface unit depicted in Fig. 2.2 has been custom written for the above described

16

clk

tagsta

once_data_sent send_enable

40

! 40

i (counter)

tx

D Q

tx_data[0] 1

0 1

0

clk

tx_data[0:39]

tx_data[39]

tagsta

39

clk once_data_receive rx_enable

100 ! 100

j (counter) 100 bit shift register

1

Q D

rx

0 clk

rx_data[41:48],rx_data[51:58], rx_data[61:68],rx_data[71:78] RFID_Data[0:31] D Q

data_enable clk

Figure 2.2: RFID Interface Hardware.

transactions. The pin tagsta is pulled low by the reader when it senses a card in its EM proximity. When the host sees this active Low signal, it generates a reset signal which triggers sending of hardwired host command from 40 bit tx data buffer over the tx pin at the postive edges of send enable gated Clk. The receive interface consists of a 100 bit shift register to store the incoming bits (10 bytes with start and stop bits). The rx enable signal remains High as long the incoming data

17 on rx pin is being written into the shift-register. The design is synchronous, with Clk running at baud rate of 9600 bps and once data sent and once data receive signals are used to send and receive the command packets only once per active Low of tagsta which is sufficient to get the information of the RFID card. After buffering the data and removing the control bits, the data enable signal is pulled High and the 4 byte data is launched onto the RF ID data bus in big-endian format. tagsta, rx,tx are the IO pins of the module.

2.1.2

GSM Module

The GSM module is used in the current design to send the received RFID data stream as an SMS to a user mobile [8]. The transmission protocol used is RS232 and the following is the protocol to be followed by our GSM interface module to send a message over the SMS device: C: AT+CMGF=1 C: 0x0D (carriage Return Word) C: 0x0A (Linefeed) C: AT+CMGS= 91(10 digit mobile number) C: 0x0D (carriage Return Word) C: 0x0A (Linefeed) C: 4 byte RFID Message C: 0xA Note that the protocol requires the functioning of only the rx pin (of rx, tx pins) for the transaction to be complete. Also the transmission is little endian with 1 start and 1 stop bit after every byte. The hardware required is realized as in Fig. 2.3 which is similar to RFID transmit interface with 380 bit data GSM tx data. Also,

18 depending on the control signal, ntwk gsm, RF ID Data is read either by DES as Crypto in[0 : 31] for secure transmission via Ethernet or the GSM module (GSM) to be sent as an SMS to the user by the above protocol. clk

send_enable

380 ! 380

i (counter) D Q

GSM_tx_data[0] 1

0 1

0

tx

clk

GSM_tx_data[0:379]

RFID_data[0:31]

GSM_tx_data[379]

379

GSM_hardwired_data

01 11

Crypto_in[0:31]

{ntwk_gsm,data_enable}

Figure 2.3: GSM Module Architecture.

2.2

Programmability with RISC Core

In the current SoC design, we have also included an embedded 8-bit processor modelled on the lines of RISC instruction set [9]. This provides enterprise functionality in future and avoids being hardwired while also designing for performance, power and cost. The RISC core can execute instructions stored in external

19 memory and write the result back to a specified memory location to be used by others. The machine consists of 3 functional units: a processor, a controller and memory. Program instructions and data are stored in memory and are fetched synchronously, decoded and executed to 1) operate on data within the arithmetic and logic unit (ALU) 2) change the contents of storage registers 3) change the contents of program counter (PC) and address register (ADD R) 4) execute instructions and data write back. PC holds the address of the instruction to be executed whereas ADD R hold address of memory the next write or read will affect. A C-code assembler has been included in our design. The motivation behind RISC instruction set is that all complicated software programs can be broken down into Load-Store kind of operations that can executed by RISC machine. So enterprise applications which manipulate the incoming 4 byte RFID data stream can be included in future by programming the memory. The RISC architecture is as shown in Fig. 2.4

2.2.1

8 bit RISC Processor

The processor includes 8-bit registers, datapath, control signals and an ALU which acts according to opcode held in instruction register. Multiplex mux 1 determines the data to be scheduled for bus 1 and mux 2 for bus 2. The inputs to mux 1 are from processor registers R0, R1, R2, R3 and PC. Bus 1 goes to either memory or to ALU or to bus 2 via mux 2. The inputs to mux 2 are from ALU, mux 1 and memory. So, an instruction is fetched from memory, placed on bus 2 and passed to instruction register. Data is fetched from memory, given to general purpose register or to operand register reg y prior to operation in ALU. Later on, it is placed on bus 2 and transferred back to memory. reg z holds a flag indicating the ALU’s result is 0.

20

Load_R0

Processor

Load_R1 Load_R2 Load_R3 Load_PC

R0

PC

R1

IR

Inc_PC Instruction

Controller

Sel_Bus1_Mux

R2

Load_IR

R3

0

1 2 3

Reg_Y

4

Opcode

Load_Add_Reg

Mux_1

Load_Reg_Y Load_Reg_Z Bus1 zero

Memory

Reg_Z

Sel_Bus2_Mux

Mem word

0

1 Mux2

2

ADD_R

Address

Write

Alu_zero_flag

ALU

Bus2

Figure 2.4: RISC Architecture with Controller, Processor and Memory Bank.

2.2.2

ALU

ALU is the module which performs arithmetic and logic operations on the operands. The load-store set of RISC requires the following opcodes to be executable by ALU with datapaths, data 1 and data 2: 1. ADD: data 1 + data 2 2. SUB: data 1 − data 2

21 3. AND: data 1 bitwise and with data 2 4. NOT: bit wise complement of data 1

2.2.3

Control Engine

RISC core has its individual control engine to improve performance. Control engine sends the timing signals and maintains FSM states required for RISC operation. The following are the control signals used in the synchronous design: 1. load addreg: To load the address register 2. load pc: Loads bus 2 to PC 3. load ir: To load bus 2 to IR 4. inc pc: Increment PC by 1 5. bus1 sel : Select signal between PC, R0, R1, R2, R3 onto bus 1 6. ri load : Put data onto register Ri where ∈ 0, 1, 2, 3 7. y load/z load : Load Reg Y/ Reg Z 8. write : To load bus 1 data back into SRAM

2.2.4

RISC operation

The assembled code is stored in the form of 8 bit words. There are 2 types of instructions in RISC format: long and short. Short instructions have a 4-bit opcode, 2 bit source register address and 2 bits for destination. Long instructions require 2 bytes and have the same format as short for the 1st word; the 2nd word consists of address of memory word required by the operation. 1 word instructions

22 are already listed in the section 2.2.2. The 2 word instructions used here are listed below: 1. RD: Fetch a memory word from the address of 2nd byte into the destination register. 2. WR: Write contents of source resister to memory addressed by 2nd byte. 3. BR: Branches the routine by loading the value at 2nd bytes address into program counter. 4. BRZ: Same as above only if zero flag of ALU is high. At global reset, PC holds a value of 0 and starts reading from address location 0. The instruction from address in PC is loaded into instruction register IR, opcode is recognized and the suitable operation is scheduled. RISC machine goes through three FSM states i.e. fetch, decode and execute. 1 byte instructions take a single Clk cycle in which the both fetch and decode are completed. 2 byte instructions on the other hand take an extra Clk cycle to act on the content addressed by 2nd byte. The following are the FSM states the RISC machine goes through to execute the stored program: 1. F SM idle : FSM state after reset 2. F SM f etch a: Address register is loaded with content of PC 3. F SM f etch b: Instruction register is loaded with word addressed by address register and PC incremented by 1. 4. F SM decode: Instruction register decoded and control signals asserted to be read at next Clk edge. 5. F SM execute a: ALU operation done if a 1 byte instruction; zero flag asserted if necessary

23 6. F SM read a: Load the address register with the second byte of read, increment PC 7. F SM read b: Load the destination register with the value at memory from above 8. F SM write a : Load the address register with the second byte of write, increment PC 9. F SM write b: Load the source register content to memory pointed from above address 10. F SM br a: Load the address register with the second byte of BR, increment PC 11. F SM br b: Load the PC with the value at memory from above A top level module is written with a reset signal to instantiate the RISC processor, engine control and a test-memory holding stored program.

2.3

DES Security Core

The RFID Data from the RFID reader frequently contains sensitive information. So, it is necessary to make the data secure by encryption to prevent other nodes on the network operating in Ethernet promiscuous mode from snooping the data. At the same time, the state of the art AES scheme is not necessary because it is computationally intensive and consumes hardware resources i.e. area and power. Also, the security unit is in the critical path of the system and delay becomes an issue too. Under these requirements, Data Encryption Scheme (DES) had been proved efficient and is used in the current IC design [10].

24

2.3.1

DES Algorithm

DES is a block based cipher stream encryption scheme which takes sets of 64 data bits, 64 bits of key and outputs 64 bit encrypted data after performing the operations described in the algorithm below. So, the 4 bytes of data from RFID reader is padded with zeroes to 64 bits. The bijection DESk is constructed such that it is almost impossible to retrieve the plaintext from the ciphertext without the knowledge of the key k. The bijection can be inverted on the receiver side by using the same algorithm: this operation is noted as DESk−1 . Once the key scheduling and plaintext preparation have been completed, the actual encryption or decryption is performed by the main DES algorithm. The 64-bit block of input data is first split into two halves, L and R. L is the left-most 32 bits, and R is the right-most 32 bits. The following process is repeated 16 times, making up the 16 rounds of standard DES. We call the 16 sets of halves L[0]-L[15] and R[0]-R[15] [10]. 1. R[I-1] - where I is the round number, starting at 1 - is taken and fed into the E-Bit Selection Table, which is like a permutation, except that some of the bits are used more than once. This expands the number R[I-1] from 32 to 48 bits to prepare for the next step. 2. The 48-bit R[I-1] is XORed with K[I] and stored in a temporary buffer so that R[I-1] is not modified. 3. The result from the previous step is now split into 8 segments of 6 bits each. The left-most 6 bits are B[1], and the right-most 6 bits are B[8]. These blocks form the index into the Substitution boxes, known as S-boxes, which are used in the next step. The S-boxes are a set of 8 two-dimensional arrays, each with 4 rows and 16 columns. The numbers in the boxes are 4 bits in length and the S-boxes are numbered S[1]-S[8].

25

RFID_data[0:31]

Iprep

RFID_data_enable L1

R1

Stage 1 L2

R2

Stage 2

R3

L3

Stage 3 R4

L4

Stage 4 L5

R5

Stage 5 L6

R6

Crypto_data[0:63] Invprep data_read_DES

Figure 2.5: Unrolled DES Architecture where each stage is depicted in Fig. 2.6.

4. Starting with B[1], the first and last bits of the 6-bit block are taken and used as an index into the row number of S[1], which can range from 0 to 3, and the middle four bits are used as an index into the column number, which can range from 0 to 15. The number from this position in the S-box is retrieved and stored away. This is repeated with B[2] and S[2], B[3] and S[3], and the others up to B[8] and S[8]. The 8 4-bit numbers, which when strung together in the order of retrieval, give a 32-bit result.

26 5. The result from the previous stage is now passed into the P Permutation which permutes the data array. 6. This number is now XORed with L[I-1], and moved into R[I]. R[I-1] is moved into L[I]. 7. At this point we have a new L[I] and R[I]. Here, we increment I and repeat the core function until I = 17, which means that 16 rounds have been executed and keys K[1]-K[16] have all been used. When L[16] and R[16] have been obtained, the two halves are swapped such that R[16] becomes the left-most 32 bits and L[16] becomes the right-most 32 bits of the 64-bit pre-output. The final step is to apply the permutation IP −1 to the pre-output. The result is the completely encrypted ciphertext.

2.3.2

Architecture

Layered architecture of DES consisting of unrolled 16 rounds is expensive and is not preferred. However, it becomes possible to implement pipeline registers and increase the throughput by 16 times which is important because, DES is present in the critical path. In the current design, security has been slightly compromised by truncating the number of rounds to 5 and hence it becomes possible to deeply pipeline the stages because the latency also goes down by

16 5

times. Also, the

architecture has been implemented by using a Look-up tables(LUT) because they are the fundamental blocks in FPGA and this method will improve the throughput on Vitrex FPGA boards. The idea is to insert pipelines registers at the end of every stage within a round which computes logic within a Clk cycle. Fig. 2.5 illustrates the hardware designed for 5 round DES algorithm while Fig. 2.6 elaborates on each stage of the DES. IPrep module takes inp data and computes a permutation to produce 64 bit

27 In_data from (i-1) DES_en clk

64 D-reg

32

32

L(i-1)

R(i-1) 32

clk

E-bit Selection 48

48

4

4

LUT Si1 8

4

LUT Si2

8

4

LUT Si3

4

LUT Si4

8

Key Scheduler

8

4

LUT Si5

4

LUT Si6

8

4

LUT Si7

8

8

LUT Si8

8

clk

32

Permutation

32

R(i-1) Li

Ri 64 th

to (i+1) stage

Figure 2.6: Hardware for each round of DES.

iprep. Following the algorithm, its right half R1 is then used as input to the E-bit selection box eb which computes 48 bit eb1 . Taking a simple key of 48 bits of 1,

28 the XOR function reduces to negation of eb1 . e˜b1 is broken into 8 6-bit vectors which are scheduled parallelly at the same clock edge into 8 S-boxes (S11 to S18 in Fig. 2.6, which written in an LUT-format assignment compute 8 4-bit vectors depending on the address (2D array LUT with 2 bit row and 4 bit column). The resulting 32-vector is passed via Permutation table P1 which is then XORed with L1 and moved to R2 . At the same Clk, R1 is moved into L2 and the process repeated 4 more times. The resulting L6 and R6 are combined to be passed into InvPrep module which outputs the 64 bit cipher text. The overall latency of the design has been found to be 18 clock cycles and throughput of our design on FPGA is 25.6 Gbps which is the highest known to our knowledge in literature. Also, if we consider the DES algorithm, the left part of the next round is the right part of the preceding one. So only half of the information needs to be stored. By doing this way, we optimized for area, as we store only 32 bits instead of 64 bits. This leads to a less efficient design in term of speed which has been compensated by pipelining. data read DES is asserted high when the operation is complete and cipher text is available as Crypto data to the next stage i.e. Network Processing Unit.

2.4

Network Processing Unit

Network Processing Unit (NPU) is the custom Network transceiver for 100 Mbps Ethernet which is responsible to send the cipher stream over the network to a remote server. A typical Network processor like Intel IXP2XXX [11] needs to have flexibility to be used in desktops, routers, firewalls etc which need functions such as Address look up, header parsing, hash functions, table look-up, traffic policy shaper, IPv6 compatibility, half-duplex function, freedom across of bandwidths from 10 Mbps to 10s of Gbps, Queue managers etc. In the current design, a lot

29 of these functions are not needed given the goal at hand which necessitates the design of a custom IC. Lot of these hardware intensive operations can be pruned to save power, cost and area. The NPU has to take the input data stream through the OSI stack of TCP/UDP, IP, Ethernet/MAC, and Link layer as in Fig. 2.7 The 32 bit RFID Data 64 bit Crypto Data UDP Packet IP Header Ethernet Header

UDP data Ethernet data

Physical Layer

Figure 2.7: OSI Layer Hierarchy in NPU.

final bit stream to be put on the network via RJ-45 connectors (LAN cable) has to be compatible with IEEE 802.3 standards of Ethernet [12]. To reduce connectionoverhead which is necessary for scalability of our nodes, we have chosen UDP stack running over IP at the cost of TCP reliability. But a secure network with minimal packet losses (like IITK) will make UDP sufficient. The choice of architecture is important for NPU as it is the most computationally intensive block of all with 100 Mbps throughput. Of the possible RISC, CISC, SIMD, Systolic instruction sets, Systolic array of processors has been found to be especially efficient for applications ranging from DSP to numerical problems such as matrix multiplication or solution of linear equations, image processing, computer graphics, cryptography and tomography [13]. Systolic array is a mesh network arrangement of processing units called cells as depicted in Fig. 2.8 and is a specialized form of parallel computing. Each cell computes the data independently

30 and passes it to its neighbor either synchronously/ asynchronously. The significant advantage of systolic arrays is the short connections between cells and prevention of loss of speed and high throughput which is the main goal of our design and hence this architecture is implemented. A design novelty which we have used here is to make the cells non-uniform as shown in Fig. 2.8 with each operating as a different layer in OSI network stack and yet achieve the advantages of a systolic processor.

datain

gr

UDP

Data_bus[0:63] From DES

clk

datain

IP

UDP

dataout

gr

IP

datain

clk

gr

UDP

dataout

gr

IP

clk

clk

gr

UDP

dataout 80

Eth

cache_data clk

nibble

cache_data

clk

outpacket

cache_data

gr

IP

clk

Tx

Phy

2

gr

Eth

cache_data

clk

nibble

txclk

gr

Phy

Tx 2

gr

Eth

cache_data clk

nibble

txclk

gr

Tx

Phy

2

4

clk

outpacket 80

gr

4

outpacket

cache_data

txclk

4

80

80

63

outpacket

gr

80

80

clk

cache_data clk

80

clk

gr

63

63

dataout

gr

80

63

datain

clk

gr cache_data

Eth

nibble 4

clk txclk

Phy

Output bus[0:7]

clk

Control bus

To Array_Lock

From Control_Engine

gr

Tx 2

Figure 2.8: Systolic Array Architecture of NPU with UDP, IP, Eth, Phy forming a PE.

The NPU is 4-way parallelized with each unit called as Packet Engine (PE) being responsible for generating the bit stream for an individual packet. Chapter 3 describes the parallelization and pipelining process to improve on throughput

31 and power. The NPU interacts with the Engine controller (EC), Queuing Engine and the Co-Processor as will be described in the following sections to achieve the desired function.

2.4.1

Packet Engine

Packet-Engine is a custom module to deal with one packet at a time and schedule it for transmission. The module is developed on the lines of Intel IXP 2xxx network processors which have 8 Microengines (MEs) which do packet processing [11]. A Microengine is a RISC machine itself executing instructions at a clock frequency of about 600MHz. To improve performance, we have instead implemented hardwired FSM based ASIC-parallelization to achieve better performance at the cost of configurability.

Figure 2.9: UDP Packet Format.

Each PE consists of the systolic architecture cell components namely UDP, IP, Ethernet, Physical which frame packets for the respective stack. The 4 UDP modules also communicate important data with each other, such as header information and hence the mesh communication architecture with low communication overhead improves performance.

32 UDP Layer User Datagram Protocol (UDP) is a best-effort connectionless protocol used to send byte-streams over network. UDP needs the source and destination ports (ports 1024-65535) over which data is being transmitted apart from the data fields. UDP packet format is as in Fig. 2.9 and the necessary hardware is realized as in Fig. 2.10. The use of checksum is optional and is not used currently (set to 16b0).

For architectural optimization reasons (described in 3.1), UDP module has a data-path of 64 bit input bus datain, 80 bit output bus dataout and control signals Clk and global reset. A UDP packet as per the above protocol is a 128 bit array which needs to be passed onto IP stack. So it takes 2 Clk cycles for the computation. A new packet to the UDP layer generates a global reset (from Engine Control) at whose edge, the header field of 64 bits (Destination port: 1500) is evaluated to be ready at the next Clk edge to be written to dataout. At the next Clk egde, the next 16 bits from data in are copied onto data out and the data starts to be read by IP from this Clk cycle as we have a valid 80-bit output. At the next Clk edge, the remaining 48 bits are written and Ethernet has appropriate logic to recognize that the remaining 16 bits are not valid. The packet is transmitted in big-endian format with the top left byte first. IP Layer IP is an important layer in the OSI stack which handles the UDP data, adds the appropriate IPv4 headers and is often the granularity at which software can handle network data. The IP packet format is as shown in Fig. 2.11 The current V ersion field of IP is 4, called as IPv4. The header processing in software starts by looking at the version and then branches to process the rest of the packet. The next field, HLen, specifies the length of the header in 32-bit words.

33

datain[0:63] datain[0:15]

64 D

Q

0

clk

counter 1

datain[16:63] temp[64:79]

global_reset

0

16

0

temp[0:47]

Header data Q

1 48

1

D

Q

dataout[0:79]

48 1

80 x

clk

clk 0

temp[48:63] global_reset global_reset global_reset Header data Q

16

1

16 clk 0

Figure 2.10: UDP Layer Architecture.

When there are no options, (most common), the header is 5 words (20 bytes) long. The 8-bit T OS (type of service) field allows packets to be treated differently based on application needs. (Set to 0 here). The next 16 bits of the header contain the Length of the datagram, including the header. Unlike the HLen field, the Length field counts bytes rather than words which is 36 in the current case of 8

34

Figure 2.11: IP Packet format.

bytes of Crypto data. The second word of the header contains information about fragmentation. The third word of the header, the next byte is the T T L (time to live) which is decremented by 1 at every router hop to discard infinite loop packets and the value 64 is the current default. The P rotocol field is a key that identifies the higher-level protocol to which this IP packet should be passed (Defined 17 for UDP and used here). The Checksum field included for error-detection at the IP layer is calculated by considering the entire IP header as a sequence of 8-bit words, adding them and taking the ones complement of the result. The last two required fields in the header are 32 bit SourceAddr and DestinationAddr for the packet. Finally, there may be a number of options at the end of the header which are not used in the current design. The hardware needed for IP is implemented as in Fig. 2.12. Again, for the same reason in chapter 3.1, IP module has a data-path of 80 bit input bus dataout, 80 bit output bus outpacket, control signals Clk and global reset and cache data[0:199] of which [0:31] and [32:63] represent the source and destination IP addresses respectively (Remote nodes IP address hardwired). An IP packet as per the above protocol is a 288 bit array which needs to be passed to Ethernet layer. So it takes 4 Clk cycles for the computation. Every new

35

dataout[0:79]

D

Q

clk

Cachedata[0:199]

outpacket[0:79] counter

Cachedata[0:63]

global_reset IP Header Data [80:159]

2 0 1 D

Q

0

checksum 1

cs

clk

IP Header [0:80]

Figure 2.12: IP layer architecture.

packet to the PE generates the global reset (from Engine Control) at whose edge, the first 80 bits of the header field of 160 bits are evaluated to be put on outpacket at the next Clk edge. At the next Clk egde, the remaining 80 bits of header are computed. Also, at this Clk edge, the prefetch-unit schedules IP addresses of source and destination to be used in the fields, from the SRAM cache memory to avoid possible latency. At the next 2 Clk edges, the 80 bit data from dataout is copied onto outpacket. The data starts to be read by Ethernet module from the global reset edge itself as we have a valid 80-bit output. Like UDP, the IP packet

36 is also transmitted in big-endian format with the top left byte first. Ethernet/MAC Layer

Figure 2.13: Ethernet layer hardware architecture.

Ethernet or MAC layer forms the most intensive and critical block in the network stack and is an area of research to improve speeds to even 100 Gbps. The most commonly used channel is 100 base-TX running at 100 Mbps and networks with higher speeds also have backward compatibility with 100 Mbps protocols. These modern channels like 10 BASE-T, 100 BASE-TX and 1000 BASE-T and fiber optic Ethernet provide separate channels for sending and receiving frames (Full duplex systems). Also, in a totally switched network, nodes only communicate with the switch and never directly with each other [14]. So, ethernet stations can forgo the collision detection process and transmit at will, since they are the only devices that can access the medium. Hence in our current design of 100 Base TX system, we do not implement collision detection mechanism. The IP data is encapsulated into the ethernet frame whose format is shown in Fig. 2.13 to comply with the IEEE 802.3 standards.

37 Note that the bit patterns in the preamble and start of frame delimiter are written as bit strings, with the first bit transmitted on the left. An important consideration in Ethernet is that the all the bytes are transmitted in big-endian format. However, each byte itself except the FCS/CRC field is transmitted in little-endian format with LSB first. 1. The Preamble - This consists of seven bytes, all of the form ”10101010”. This allows the receiver’s clock to be synchronized with the sender’s. 2. The Start Frame Delimiter - This is a single byte (”10101011”) which is used to indicate the start of a frame. 3. The Destination Address - This is the MAC address of the intended recipient of the frame. The addresses in 802.3 use globally unique hardwired 48 bit addresses. 4. The Source Address - This is the address of the source, in the same form as above. 5. Length - This is the length of the data in the data field of the frame. 6. Data- This is the IP data being sent by the frame. 802.3 frames must be at least 64 bytes long, so if the data is shorter than 46 bytes, it needs to be the padded field must compensate. The 802.3 also sets an upper limit of 1500 bytes on the data field as governed by propagation delay. 7. Cyclic Redundancy Check (CRC)/ Frame Check Sequence (FCS) - A 4 byte stream depending on the headers and the MAC data to check and recover from errors at the destination. The algorithm and hardware used are discussed in 3.1.5.

38

clk global_reset

D Eth header[0:63] (Preamble,SOF)

Q

64 clk

696 Cache_data[64:159]

D

Q ethernetdata[0:695]

Eth header[64:159] (src_addr, dest_addr)

j

96 clk

done_flag D

ethernetdata[64:695]

Q

632 32-CRC clk 16 Eth header[160:175] (type)

4 CRC[0:3]

632 j

j 0

D

Q

!0

CRC[28:31] 4 ethernetdata[12:15] ethernetdata[8:11]

clk

ethernetdata[20:23]

4

639 457 458

4

1 D

Q

nibble[0:3] 4

!1 clk 2 D

Q

!2

80

j 3

outpacket[0:79]

ethernetdata[692:695]

630

ethernetdata[688:691]

631

clk

D

Q

!3 idle_flag clk SOP1,SOP2,EOP1,EOP2 Eth header[464:695] (padding)

D

Q

232 clk

Figure 2.14: Ethernet Layer Hardware.

4

39 8. Inter frame Gap (IFG) - 12 bytes of idle characters. For 100 Mb/s Fast Ethernet the IFG duration is 960 ns which is automatically taken care here due to the low clock rate on the RFID interface. The hardware needed for Ethernet is implemented as in Fig. 2.14. It has a datapath of 80 bit input bus outpacket from IP, 4 bit output bus nibble to be passed to Physical Layer, control signals Clk and global reset, idle f lag, done f lag and cache data[0:199] of which [64:111] and [112:159] represent the source and destination MAC addresses respectively (Remote nodes MAC address hardwired). An ethernet packet as per the above protocol is a 728 bit array stored in an internal buffer ethernetdata (with minimal padding) which needs to be passed to physical layer. So it takes about 182 Clk cycles for the computation. Every new packet to the PE generates the global reset (from Engine Control) at whose edge, the header bytes are evaluated to be ready for ethernetdata at the next Clk edge. Also, at this Clk edge, the prefetch-unit schedules IP addresses of source and destination MAC addresses to be used in the fields, from the SRAM cache memory to avoid possible latency. At the next 4 Clk egdes, the 288 bit IP daagram is accepted. The CRC is computed over MAC addresses, length and data fields. As in 3.1.5, CRC is realized using a 32 bit shift register with each message bit being pushed into at every Clk cycle. So, for 632 bits of message, the computation takes about 632 Clk cycles. So, the preceding fields should be scheduled to the output bus from such a Clk cycle such that CRC starts to be transmitted just after the computation is complete (just-in-time computation). So, the first 4 bit nibble from the header is written to nibble at Clk cycle 432 after reset signal. The entire ethernet-data is passed nibble wise to the physical layer stack over the next 182 Clk cycles. The bytes are transferred in big endian format but within each octet, the transmission is little endian and done f lag is asserted at the end. The idle f lag remains high except for the 182 Clk cycles. This enables the Physical

40 layer to send the keep-alive IDLE bits. After a frame has been sent transmitters are required to transmit 12 octets of idle characters before transmitting the next frame. For 10 M this takes 9600 ns, 100 M 960 ns, 1000 M 96 ns. Physical Layer Physical layer is the lowest in the hierarchy of the OSI stack. After the encapsulation is done at various layers like UDP, IP, MAC, Physical layer transmits and receives the data on the physical network bus. We use 100 BASE-TX which is the predominant form of Fast Ethernet running at 100 Mbps on a twisted cable. It runs over two pairs of category 5 Unshielded Twisted Pair (UTP) cable (LAN cable) which consists of 4 pairs of wires of which only 2 pairs are used in any network implementation. Wires 1 and 2 are for the transmit functionality and 3 and 6 for reception. So, in our Network transceiver design, we generate the Tx+ and Tx- to be mapped to wires 1 and 2. The Physical Layer consists of the following 3 hierarchies as in Fig. 2.15(b) which provide a set of functions in Fig. 2.15(a) to generate the final bit-stream to be put on the network bus. Physical Coding Sublayer (PCS) - This sublayer provides a uniform interface to the ethernet layer for all physical media. Carrier Sense and Collision Detect indications are generated by this sublayer. It also manages the auto-negotiation process by which the NIC (Network Interface) communicates with the network to determine the network speed (10 or 100 Mbps) and mode of operation (half-duplex or full-duplex). In the current design, we implement only 4B/5B encoder under this layer because the system is assumed to be full-duplex without collisions. Physical Medium Attachment (PMA) - This sublayer provides a mediumindependent means for the PCS to support various serial bit-oriented physical media. This layer serializes code groups for transmission and deserializes bits received from the medium into code groups.

41

(a) Physical Layer functions

(b) Physical Layer hierarchy of PCS, PMA and PMD

Figure 2.15: Physical Layer stack and functionality

42 Physical Medium Dependent (PMD) - This maps the physical medium to

D Q

clk

5

br[0:3] Physical Encode 4B-5B

D Q

flag

next_flag

txclk global_reset

scrambler_en

post_encode

4

global_reset

txclk

clk

end_of_packet

clk nibble[0:3]

global_reset idle_flag

the PMA. It defines the physical layer signaling used for various media.

tnrzdout D Q Scrambler and Serializer

clk R1

MLT3 Encoder

1 clk R2

post_MLT[0:1]

2

Q D DC clk

global_reset clk

Tx+

Tx-

txclk

R3

global_reset

Figure 2.16: Physical Layer PHY of the Systolic Array.

The transmit subsection of the Physical layer accepts nibble-wide data on the nibble[0 : 3] lines from MAC when idle f lag goes low. It reverses the bits to meet endian requirements and passes data to the 4B/5B encoder which compiles it into 5-bit-wide parallel symbols. These symbols are scrambled and serialized into a 125 Mbps bit stream, converted by the MLT encoder into a MLT-3 waveform format, and transmitted onto the UTP wire using differential coding (DC) as Tx+ and Tx- transmission bit stream [15]. The overall physical architecture is depicted in Fig. 2.16 which has pipeline registers R1, R2, R3 to meet txclk requirements. The control inputs, data-inputs and outputs are as shown and the individual blocks are described in the following sections:

43 Table 2.1: 4B-5B Look up Table Symbol 0 1 2 3 4 5 6 7 8 9 A B C D E F I J K T R

5B symbol 11110 01001 10100 10101 01010 01011 01110 01111 10010 10011 10110 10111 11010 11011 11100 11101 11111 11000 10001 01101 00111

4B nibble 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Inter Packet Idle Symbol (No 4B) 1st Start of Packet Symbol 0101 2nd Start of Packet Symbol 0101 1st End of Packet Symbol 2nd End of Packet Symbol and Flow Control

i) 4B/5B encoding 4B/5B encoding is the first process in physical layer where every set of 4 bits or a nibble from the ethernet is replaced by a 5 bit stream according to the Lookup-table 2.1. This encoding is preferred over Manchester encoding to reduce the extended durations of high or low signals which could result in loss of clock recovery at the receiver but is done at the expense of a higher signaling rate of 125 MHz to maintain 100 Mbps. The 4B/5B encoder complies with the IEEE 802.3u 100 BASE-TX standard. Note that the preamble bits at the beginning of a packet (0101) are encoded as the JK symbol, other pieces of data according to the 4B/5B lookup table, and TR code is added after the end of the packet. This switching

44 is done by control signals SOP 1, SOP 2, EOP 1, EOP 2 which are inputs to the encoder. Also, this module is important as it sends the keep alive IDLE stream of 11111 when idle f lag is high. 4 bit nibble is taken from MAC layer and the corresponding 5 bit symbol post encode is sent to a 2-1 MUX which has idle f lag as select signal to output either the 5 bit symbol or 11111 as in Fig. 2.17.

SOP1 SOP2 EOP1 EOP2 idle_flag

0

LUT

5B_post_encode[0:4]

nibble[0:3]

1 11111

Figure 2.17: 4B-5B encoder schematic.

ii) Serialization and Scrambling 4B data is converted to 5B data at a clock speed of 25 MHz. This parallel data is now to be transmitted serially on the Tx line. As such, the current and the following modules are clocked at txClk of 125 MHz to meet the 100 Mbps requirement. 5B data is first serialized using a serial-out port from the array of registers. It is further scrambled (randomized) in 100 BASE-TX in order to reduce electromagnetic emissions due to high frequency components resulting from NRZI kind of coding of consecutive 1s in IDLE codes. The energy in the channel spreads across multiple channels reducing the average power in a particular channel which is necessary to meet FCC requirements. Also, the bandwidth can be utilized better, resistance to fading in a channel increases and also decoding in the presence

txclk

global_reset

0000_0000_001

scrambler_en

45

11

parallel load

0

1

2

3

4

5

6

7

8

9

10 s[0:10]

parallel load 5B post_encode 0

0

0

0

0

0 tnrzdout 1 2 3 4

serial_bit

counter clkglobal_reset

Figure 2.18: Serializer and Scrambler.

of Gaussian noise at the receiver is easier (less error rate and lesser power). We implement the 11-bit stream cipher scrambler as adopted by the ANSI XT3T9.5 committee for UTP operation. The cipher equation used is: X[n] = (X[n − 11] + X[n − 9])mod2

(2.1)

In the design, this has been implemented using a Linear feedback Shift Register (LFSR) as shown in Fig. 2.18. The shift register is initialized to any non-zero 10bit seed key when scrambler en is asserted. Then at every tick of counter (at edge of txclk), the 5 bit symbol post encode with idle is serialized into 5 bits of serial bit. Note that the 5 bit symbol is clocked by Clk and remains same for all the 5 txclk cycles. The LFSR gives an output key stream bit s[0] which is XORed

46 with serial bit to give tnrzdout. iii) MLT3 Coding A 100 MHz data stream that uses a 4B/5B encoding results in a 125 MHz signal. However in Fast Ethernet, as compliant to IEEE 802.3 and FCC standards, this 125 MHz signal is itself encoded as a multi-level transmission (MLT) signal using three signal levels -1, 0, +1. This makes the waveform smoother and closer to a sine wave and reduces the transmission bandwidth required on the physical cable to only 31.25 MHz, which is within the specification of the CAT5e cable used in UTP [12]. It also reduces problems due to EM interference in other channels. The MLT-3 encoder receives the scrambled Non-Return to Zero (NRZ) data stream from the scrambler and encodes the stream into the three levels for presentation to the driver. When an input bit of 0 arrives at the input of the encoder, the last output level is maintained (either positive, negative or zero). When an input of 1 arrives at the input of the encoder, the output steps to the next level. The order of steps is -1 to 0 to 1 to 0 which continues periodically. tnrzdout from scrambler is the input data stream clocked at every edge of txclk to MLT-3 block. A mealy FSM implemented as in Fig. 2.19 is used in the M LT top module of our design to achieve this functionality. A sub-module MLT takes the current state which indicate the current signal value. (00 for -1, 01 for 0, 10 for +1). f lag is used to determine if the next state has to be +1 or 1 from 0 depending on the previous state. At every txClk edge, the next state and next f lag are written into current state and f lag respectively depending on tnrzdout and the 2 bit encoded current state is used the output of the stage.

iv) Differential Coding The last step in Physical layer involves differntial coding in which each signal level from MLT3 coding is converted into 2 different waveforms whose difference equals

47 txclk

tnrzdout

global_reset

D

post_MLT[1:0]

databit

Q

clk

Curr_state[1:0]

0 D

Q

00 01

Curr_state[1:0]

01

1

next_MLT_level

Curr_state[1:0]

clk

0

next_flag

00 D

Q

01 clk

10 1

flag

10

10

0

1

00

Figure 2.19: MLT3 Encoding.

the original wave. So, a LUT based design is implemented which outputs the final output of the design T x+ and T x− bits depending on 2 bit MLT output current state as shown in Fig. 2.20. Tx+ is 0 for signal levels ’00’ and ’01’ and 1

48 for 10. Similarly Tx- is 1 for ’00’ and 0 for ’10’ and ’01’. post_MLT[0:1]

Tx+ D

Tx-

Q

clk

global_reset txclk

0

00

0

01

0

01

1

10

0

10

1

00

Figure 2.20: DC Encoding.

v) Elecromagnetics The magnetics module that is external to our design converts Tx+ and Tx- to 2.0 V, as require by the PMD specification. Magnetics modules available from several vendors can be interfaced onto the Physical network. Ex: LF8200A from Delta or PE68515 from Pulse Engineering.

2.4.2

Array Lock hardware

There are a total of 4 PEs in the architecture, each individually computing packet-data. However all of them contend for the shared bus of Tx+ and Tx. A lock variable is used to decide which PE could capture the line to send its packet. However, use of single variable is non-atomic and presents the following problem: Consider a case when lock is free, P E1 sees it free and branches to lock the variable. However, before P E1 could write, another PE P E2 could also see the lock variable free at the same Clk edge and branches too to use the line resulting in errors. To eliminate this problem, we implement an array-lock system used in

49

PE_busy[0:3]

PE_busy[1]

PE_busy[0] PE_busy[2]

txclk PE_busy[1]

not_reset

PE_busy[3]

1 PE_busy[2] PE_busy[0]

D

Q

0 clk

PE_busy[3]

Lock_number

01 10 11 00

00 01 10 11

Lock_number 2

00 01

Tx_out

10

2 11

Tx_PE1

2

Tx_PE2

2

Tx_PE3

2

Tx_PE4

Figure 2.21: Array Lock Hardware.

multi-processors to multiplex between the lines. In this, instead of a single lock variable, there is an array of locks each of which is controlled by the predecessing PE [16]. In the design, this lock is of width p = 4. The lock-acquire involves atomically setting the shared index variable lock to 1 and setting the correspond PE’s location in the array to 1 and others to 0. Releasing a lock involves resetting the corresponding array location and setting the next array location to 1. At every Clk edge, Lock unit checks if any PE finished framing a packet by polling on

50 P E busy[0 : 3] and sets location of next PE in lock number free. The next PE which sees its bit in lock number[0 : 3] free, takes the lock and the control of the shared bus Tx+ and Tx- to transmit its packet frames. The lock hardware is as shown in Fig. 2.21

2.5

Queuing Engine RFID_Data

32

32

cache_in write_en

1-2 Demux write_address

buffer_full

8 byte FIFO reg[0:63]

0

1

2

3

4

5

6

7

PE1_data[0:63] Memory read by RISC

1000

!1000

0100

from DES

3

0010 ! 0010

D Q

clk

clk

PE3_data[0:63]

! 0100

0 1 2

Crypto_data[0:63]

PE2_data[0:63] D Q

D Q

D Q

clk

clk

PE4_data[0:63]

0001 ! 0001

counter

1000 0100 0010 0001

0 1 2

data_enable[0:3] D Q

PE_busy[0:3]/global_reset[0:3]

clk

3

data_read_DES

Figure 2.22: Queueing Engine Architecture.

In high-end architectures, efficient processing and data-flow between modules is very important for high-throughputs. Equally important is the queue-management

51 at various levels in the architecture, buffering and scheduling between Parallel Engines. So, it is necessary to design a Queuing engine specific to an architecture to obtain better throughput and speedup. The hardware is realized as shown in Fig. 2.22. The traffic from RFID reader can be assumed to be considerably slow as the baud-rate is only 9.6 Kbps and the associated latency with SLM UART is around 100 rxClk (9.6 kHz) cycles or around 1ms. On the other hand, the network processor latency has been brought down to about 180 Clk cycles (at 25 MHz) and hence the entire processor’s critical path delay is ≤ 0.1ms. So, it is a fair assumption that the no more than one packet has to be dropped in the worst case (considering RISC enterprise program too). The register array reg buffers the incoming RFID data from the RFID reader. Also, packet data in the traffic from the reader is UDP like and is independent of other packets. For such independent traffic stream with no reliability protocol and no Quality of Service (QoS) requirements, FIFO queuing with Drop-tail mechanism has been shown to have the best performance [17]. In a FIFO queue, the optimum buffer queue size is an important parameter. Too low a space would result in packet losses and a high space would result in unnecessary latency. We have provided buffer space in reg for 8 bytes (capable of holding 2 4-byte packets from RFID). In the highly unlikely case when reg is full, the next arriving packet is dropped and the next packet in queue is scheduled to the next module i.e. DES engine. The queuing engine also allocates packets to the 4 PEs. In the ideal case where all PEs have equal latency and performance, a FIFO method is opted. The Crypto data from DES is pushed into 32 byte N P U buf f er to hold 4 8-byte stream-ciphers from DES into 4 PEs. Suitable nibble (4 bit word) is written into data enable, register array which is read by NPU and the corresponding PE is activated by generation of a local PE signal global reset which asserts P E busy[i]

52 where i is the PE number (only when data read DES is asserted).

2.6

Control Path/ Engine Controller

Every custom processor needs a control path (CP) to generate timing signals like counters, enables, mux-selects, resets, data-valid and very importantly to maintain FSM states. In an ASIC, it is this FSM that drives the IC just like Instruction words do for a general processor. It is customary to have a centralized Engine controller which handles the above tasks by scheduling the timing signals to all the modules. However, in Very large instruction word (VLIW) architectures, [18] proposes that the use of a distributed control path betters in improved performance. So, in the current architecture, each independent module like RFID interface, RISC Core, DES has been designed to have its own control path which had been described in detail in each of those modules. A centralized Engine Controller module however is the highest in the hierarchy of the controls holding all the modules together by FSM and also signaling the most intensive block NPU. Engine controller includes logic for the control signals needed in the SoC as shown in table 2.2 Also, the internal modules which form the NPU systolic array of 4X4 get important contol signals shown in table 2.3.

2.7

Cache Controller and Prefetch Unit

Processors include an array of registers called register file internally to reduce the latency of the computations. In our design too, we have used a register based data path to facilitate computations. The Network Processing unit is computationally intensive and needs a lot of registers. But the spatial and temporal

53 Table 2.2: Control Signals of SoC Control Signal Clk txClk RF ID data enable ntwk switch read DES data reset N P U busy P E busy[0 : 3] data enable[0 : 3] global reset[0 : 3]

Function 25 MHz to run UDP, IP, MAC layers. 125 MHz to run Physical block. Signal to indicate that RFID data is ready and could be read by GSM or DES. Signal to multiplex RF ID data to either GSM or DES. Signal to indicate that DES data is ready and can be processed by NPU. Asynchronous reset signal from control engine to NPU to initialize registers and start computation. Input from NPU that all PEs are busy and cannot handle anymore. 4 bit nibble from NPU to check which PEs are free. Inputs to PEs to indicate that the RFID data is valid and framing has to be done. Individual reset signals for PEs for initialization.

locality is minimal in between packets as they have independent data. However, the header data used in the packets at various hierarchies is common between the packets. If this data is held in a register array like in a register file, the leakage and area increase tremendously (24 transistors per flop). So, here we have chosen to take the commonly used header data to outside the ASIC onto SRAM (6T) at the cost of small latency. Hence, an SRAM controller is necessary to provide access to all the PEs without contention; to write data from boot-memory and also to reduce latency. The architecture is shown in Fig. 2.23. write en goes high at the global reset (from Engine control) along with the right value on Cache in. Header data with IP and MAC addresses is also launched onto Cache data and Cache ready is asserted at global reset which serves as inputs port to NPU. It is a 200 bit parallel load and hence address generation is not necessary. Also, as the entire cache data is made available parallelly to the NPU, cache tags and indices are not used. Prefetching is a technique used in micro-processors to reduce latency due to

54

Table 2.3: Systolic Array Controls Systolic Array Controls Module Signal Function clk 25 MHz UDP global reset Reset at every new packet to initialize headers counter up-counter for local FSM to determine data out clk 25 MHz IP global reset Reset at every new packet to initialize headers counter up-counter for local FSM to determine outpacket clk 25 MHz global reset Reset at every new packet to initialize headers Ethernet j up-counter for local FSM to determine nibble done f lag Signals the completion of Ethernet packet framing (trailing nibble of CRC) idle f lag Signals that the line is idle and needs IDLE 11111 to be put (includes time to send CRC, EOF frames) txclk 125 MHz global reset Reset at every new packet to initialize headers idle f lag Signals that the line is idle and needs IDLE 11111 Physical to be put (includes time to send CRC, EOF frames) scrambler en Enable signal to scrambler to reset the key stream to a non-zero value. counter An up-counter to serialize 5 bit post-encode data at each tick. SOP 1, SOP 2, EOP 1, EOP 2 Signals to indicate start and end of Ethernet frames. f lag Signal to MLT encode module to decide whether to take +1 or -1 from 0 depending on the previous state.

from Queuing_Engine

55

write_address

Cache_data[0:199] Prefetch Unit

write_en 4096K SRAM Cache

Cache_in 32

Cache_ready Cache_read

counter clk global_reset Fetch_cache

from Engine_Control

to NPU

Figure 2.23: Cache-controller and Prefetch unit hardware.

stall time from a request by processor to cache data. This occurs when a processor requests an instruction from main memory before it is actually needed. Once the instruction comes back from memory, it is placed in a cache. When an instruction is actually needed, the instruction can be accessed much more quickly from the local lines. In the current implementation, NPU has to launch a request to SRAM controller at the beginning of packet framing for 32 bit SourceAddr and DestinationAddr and also 48 bit MAC addresses. But by using prefetching as in Fig. 2.23, we have eliminated this latency by making the SRAM cache data available at the NPU by the making Cache data[0 : 199] available at Clk edge immediately after global reset. This is achieved by using the control signal counter from engine control which sends the cache enable F etch Cache to SRAM controller upon assertion of global reset

2.8

Software

The Network transceiver designed here acts as the hardware equivalent of a UDP client with additional security, programmability and interface with devices

56 like RFID reader, GSM module. To make these devices scalable, there should be a central UDP server which continuously polls for packets being transferred. The received packets have to be decoded and the 4 byte information from the SLM RFID reader has to be emailed to a mail- server. To achieve this, we have implemented software to run a host as both UDP server on port 1500 and Email Client on port 25 using Network classes in C. The UDP server opens an IP socket on the port 1500, waits infinitely for RF ID data and branches to email-routine upon arrival of data. The common protocol used for email is called Simple Mail Transfer Protocol (SMTP) on port-25. Typically, email-servers like of IITK are made secure by introduction of additional layer which implements authentication in base-64 coding. The transactions between the remote host, ASIC and email-server are described in the next page.

2.9

Overall Architecture and SoC Function

The independent modules described in the preceding sections are cascaded as shown in Fig. 2.1. The system operates using a handshake protocol between neighbouring blocks as described below. As soon as an RFID card comes into EM proximity of the RFID reader, the tagsta pin is pulled low upon which the interface unit requests for RFID tag data over tx pin. The 4 byte RFID data which needs to be communicated is sent serially by the reader over rx pin. After buffering the data, data enable is pulled high and data launched over RF ID data. if ntwk gsm switch (pre-programmed) is low, the 32 bit RF ID data is launched onto data in of DES and RF ID data ready is pulled high. Note that this signal samples the data enable signal from RFID unit (which is operating a lower clock frequency) and it remains high only for one clock cycle of NPU per one packet from RFID

57

S: 220 smtp.cc.iitk.ac.in C: HELO smtp.cc.iitk.ac.in S: 250 smtp.cc.iitk.ac.in C: AUTH LOGIN S: 334 VxNlc5hbWU6 (Username in base64 encoding) C: a3Rlag== S: 334 UGFzc3dvcmQ6 (Password) C: dGVqYQ== S: 235 Authentication Successful C: mail from: [email protected] S: 250 Ok C: rcpt to: [email protected] S: 250 Ok C: data S: 354 End data with < CR >< LF > . < CR >< LF > C: 3a f5 62 ef C: . S: Ok: queued as A21B09B688

58 reader. If ntwk switch is high, the GSM interface unit’s enable signal is pulled high and the data is copied into its buffer GSM tx data which is then sent as an SMS to a user mobile. The DES unit encrypts the data using the algorithm and pulls the data read DES high for one Clk cycle and launches 64 bit Crypto data. The NPU unit buffers the incoming packets into 256 bit N P U buf f er depending on N P U busy signal which is the bitwise AND of P E busy signals. The input buffer of that PE which is free and is next in line is filled with Crypto data and a global reset signal which is local to the PE is made high for one Clk cycle. This reset signal triggers the functioning of all the OSI layer hardware units and 64 bit Crypto data is gradually put into the IEEE 802.3 Ethernet frame format. This frame is passed serially to the physical layer which takes the ethernet data into 4 bits (nibble) and converts to T x+ and T x− bit streams which can be mapped to RJ 45 LAN cable. This UDP data sent is picked up by the UDP server running remotely on the pre-decided port (1500), and sends the 4 byte data as an email to the Mail server thus completing the functionality flow. Note that the Engine Contoller generates control signals, Queuing Engine allots packets to PEs, Cache controller is used an interface to External SRAM memory, Co-processor is used to speed up mathematical funtions. The RISC Core is exteranl to the above architecture and constantly polls on a memory location and an enable command to execte user-defined instructions in assembly format. So, in future by way of a switch, if RF ID data is written into that memory location and enable command launched, it outputs the final processed data to be used by DES. Thus the chapter, describes the architecture and functionality of the designed custom SoC with details of each module. The next chapters describe how the architecture has been further optimized at macro and micro architecture, circuit levels for both performance and power and the results obtained by doing so.

Chapter 3 Architectural Optimizations Chapter 2 proposed the functional architecture of our design of low power network transceiver. The design principle we adopt here is to meet all the delay and throughput requirements at the architecture level itself and to optimize for power at circuit level. This way, throughput becomes more robust to circuit, device and process variations when it is the governing factor. This chapter deals with the architectural novelties (both at macro and micro levels) we implemented to increase performance and also optimize for power.

3.1

Macro Optimization Framework

Choice of architecture changes the Energy-Delay (E-D) curve of a design as shown in Fig. 3.1(b). The goal of any design is to produce such optimum ED operating points for good performance. On these lines, by using a variety of architectural optimizations discussed in the following sections, the design is moved closer to the global optimum on the E-D curve.

59

60

(a) Effect of parallelism on throughput (b) E-D curve w.r.t Architecture choice and energy-efficiency

Figure 3.1: Macro Architectural Tradeoffs

3.1.1

Parallelization

Parallelism is most efficient when the target delay is lower than the minimum achievable delay of the underlying blocks. Parallelism of level P implicitly relaxes the delay of the underlying logic about P times (minus the flip-flop delay) [1]. Graph in Fig. 3.1(a). shows the energy-performance space for designs exhibiting parallelism from P =2 to P = 5, together with the nominal or reference design, all normalized to a reference design point. The results are obtained from joint supply threshold-size optimization, with the external load of

C Cref

= 32. The data shows

that a parallel architecture provides an increase in performance at a very small marginal cost in energy. For instance, the parallel architecture P = 5 provides about a 3 times performance improvement over the nominal design, compared at unit energy Fig. 3.1(a). Parallelism is also an option for energy (not instantaneous power) reduction, which is a well known result [19]. While parallelism is a very efficient technique for improving the performance, the area of the parallel design is, to a first order,

61 in linear proportion with the level of parallelism P. However, initial simulations in our design showed an area of 0.1 mm2 for NPU which is within acceptable range hence making parallelism feasible. Fig. 3.2 shows that an individual PE has a PE1 ready here Line Utilization~100% over large T PE1 PE1 PE1

11071196 1289 1378

457 650 PE2

PE2

650 832

Tx_line busy

1924 2106 PE2

1378 1560

2106 2288

PE3

PE3

PE3 Tx_line busy 832 1014

Line Utilization~28%

PE

0 457 650

PE

PE

PE

1107 1289

17461928

23852567

clk cycle

1560 1742

PE4

0

2288 2470

PE4

1014 1196

1742 1924

clk cycle

Figure 3.2: Improving the Transceiver Line efficiency by parallelization.

latency of about 650 Clk cycles after DES unit asserts data read DES and makes Crypto data available. As observed from the previous section, the block with the most delay is ethernet due to the nature of the computation of CRC over the data. The 4 bit nibble is passed to Physical layer which transmits tx+ and tx− bit streams over the network. It can also be noted that the latency for this process is about 935 txclk cycles or 187 Clk cycles. So, after RF data is ready there is a delay of 457 Clk cycles in which only the IDLE stream of 11111 is being sent over the transmission line resulting in only 28% efficiency. The power dissipation of a PE at a frequency of 25 MHz is only 440 µW and could be increased for

62 better line efficiency. The total energy of computation across all packets actually remains constant which makes the design even more attractive. So, to improve the line utilization to near 100%, we parallelize the design by providing 4 PEs which operate parallelly on individal RFID reader packets and continuously contend for the T x line. It is a fair assumption that within 187 Clk cycles (7.2 µ s), RF data will not schedule a new packet (i.e. when all 4 PEs are busy), given the low baud rate on the input side thus bringing down the packet drop rate from RFID reader. So, P=4 has been found optimum for our design and Fig. 3.2 illustrates the line utilization by the PEs before and after parallelization.

3.1.2

Data-path width

clk

global_reset x1

x1

x1

x1

a

b

c

d

x1

x1

x1

x1

a

b

c

d

x2

x2

x2

x2

x2

x2

x

y

z

a

b

c

UDP_out

IP_in

IP_out

Figure 3.3: Data-path optimization of IP Module.

Data path width is an important parameter which directly influences perfor-

63 mance in digital circuit designs. General purpose processors have a standard 8, 16, 32, 64 path widths in accordance to the DRAM address space [20, 21]. In ASIC designs however, especially when the data path involves numerical and data stream computations, the data path can be optimized by the designer according to design specific needs. A wide data- path implies larger number of registers to hold the data in clock cycles which results in higher throughput at the expense of larger leakage power. Even though the dynamic power increases too, the energy per computation remains constant. Too low a data path width results in low efficiency and hence data-path optimization is an important problem in ASIC designs. A variable data-path width depending on data requirement between modules could also be useful. But it should be noted that this will result in data-slack at nodes in the flow-graph and need additional buffer registers to hold data arriving at a faster rate from the previous module than the rate at which it could be removed. So, considering the above issues, we propose the following optimization framework to choose data-path width. Lets consider the datapath between the modules UDP and IP in the NPU as an example as depicted in Fig. 3.3. Let x1 and x2 be the input and output bank widths respectively. The power (both leakage and dynamic) in the module due to datapath registers is therefore proportional to the x1 + x2 (24 transistors per D-flipflop). Without loss of generality let’s assume that x1 6= x2 . From Chapter 2, IP header is about 160 bits which starts to be scheduled to Ethernet at positive edge of global reset. As UDP also starts its operation at the same edge, IP has to buffer the UDP data for

160 x2

− 1 Clk cycles and the buffer register penalty needed

is proportional to x1 ( 160 − 1). So the total leakage penalty of the architecture can x2 be stated as P = x1 (

160 − 1) + x1 + x2 x2

(3.1)

We need to choose both x1 and x2 also considering the fact that any difference

64 in the rate of flow needs to be accounted for in the buffer space i.e. IP output should have no less than x2 amount of data available after every Clk cycle after global reset. x1 ( x1 (

Putting

∂P ∂x1

=

∂P ∂x2

160 − 1) ≥ x2 x2

160 − 1) + x1 ≥ 2x2 x2 ....

(3.2) (3.3) (3.4)

= 0, we obtain, 160 = 0 x2 160x1 1− = 0. x2 2

(3.5) (3.6)

eq. 3.5 cannot be solved because for minimum power we need x2 to be infinity but in reality the optimization is constrained by eq. 3.2 to eq. 3.4. So using them in eq. 3.6, we obtain x1 = 80 and x2 = 113. Considering that x2 has to be a factor of 160, we take x2 = 80 as it gives the closest optimum. So by choosing a datapath width of 80 at both input and output, we can expect reduction in leakage while maintaining performance. A comparison of power dissipation with the unoptimized data-path of 288 and the optimized width of 80 is as shown in Fig. 3.4 with 39% dynamic and 28% leakage power savings. The additional latency is negligible.

3.1.3

Split-Cache

Processors include on-board SRAM caches which hold frequently used data values. However, the most commonly used instruction words are also cached in a part of the memory which is especially attractive in a simple instruction set. In our design, one area of concern is the non-reliability transmission protocol of

65

(a) Dynamic Power

(b) Leakage Power

Figure 3.4: Power Savings in UDP and IP modules by datapath optimization algorithm the SLM015 RFID reader. There is no mechanism to send NAK to the reader or request for retransmission in the event of buffer overflow. However, if we increase the buffer space in the FIFO queue in our receiver, it would result in higher access latency for all transactions even when the flow is under the overflow limit. So, we use the idea of cache-sharing in our design to remove such overflow completely. Chapter 2 already shows how we have decided upon an optimum size of 8 bytes for the FIFO queue. We now configure a 16 byte address space of SRAM

66 to be used as extended FIFO queue space to buffer excess packets. The Queuing Engine (QE) asserts buf f er f ull once packets overflow and the Engine control launches write address, asserts write en and QE launches data onto Cache in as in Fig. 2.22. The performance improvement due to this scheme can be gauged as follows: In queuing theory, the input packet flow in best modeled as a Poisson process N (t) where N (t) = number of arrivals in the interval (0,t) which obeys the probability distribution:

P (N (t) = n) =

(λt)n −λt e n!

(3.7)

where λ = Expected number of occurrences during the given interval t. The packet loss probability L is therefore given by L = 1 − P (N (t) = 0) = 1 − e−λ . By improving the buffer space, we have reduced the average drop rate from λ1 = 1 to λ2 = 0.1 which translates to an improvement of P =

3.1.4

e−λ2 − e−λ1 = 84% 1 − e−λ1

(3.8)

Power Aware Scheduling

Muthyam et al 08 in [22] for the first time emphasize the need to revisit task scheduling in multi-processors. They propose that if a processor has equal performance on all the cores but unequal power, the operating system (OS) must schedule tasks with this knowledge. However, the implementation of such an idea has not been presented till date. In the design, we have parallelized NPU 4 way into PEs. Although at the architecture level, they are assumed to be equal in terms of power and performance, the same cannot be quite true post-synthesis or on the final board. So, when tasks are being scheduled to the individual units, a

67 priority encoding scheme can be followed to allot higher preference to a PE with lesser energy dissipation if more than one PE is available for task scheduling at a given time. So, after designing the architecture, we have observed small differences in power of PEs measured in Synopsys Prime Power as follows: PE P E1 P E2 P E3 P E4

Power (in µ W) 440 434 431 429

We implemented a priority encoding scheme in Queuing engine on the above basis. For example, if P E busy = 0011 then we assign to P E2 instead of to P E1 and so on. We were able to get a maximum energy savings of 260pJ per 1 packet compuation which could potentially translate to higher savings on board. So, surely the idea could be incorporated into future multiprocessor designs.

3.1.5

Accelerating Units

NPU involves some computationally intensive tasks like CRC-32, Scrambling, Hardware Locking and Checksum. Scrambling and locking architectures have already been described in Chapter 2. CRC and checksum on the other hand need custom accelerating units for computational efficiency as described in the following sections [23]. Ethernet CRC-32 Data corruption is the principal problem associated with data transmission and so it is necessary to have a mechanism to detect and correct errors on the receiver side [24]. CRC-32 (Cyclic Redundancy Check) is a one such scheme which adds 32 bits of data to the Ethernet frame such that when the entire message (taken as a polynomial) is divided by the CRC polynomial, it leaves a remainder of 0. A

68 special 32 degree polynomial C(x) is used in Ethernet to achieve the functionality where: C(x) = x32 +x26 +x23 +x22 +x16 +x12 +x11 +x10 +x8 +x7 +x5 +x4 +x2 +x+1 (3.9) The hardware to compute these 32 bits is realized as shown in Fig. 3.5 which is a 32-bit shift register with XOR gates only at those index locations which has a term in C(x). So the 632 message stream is passed serially to CRC and after 632 cycles a parallel read out of CRC-32 gives the output on shif t reg[0 : 31] bus.

Figure 3.5: CRC shift register Hardware.

IP Checksum IP layer includes a field named checksum which places the ones complement of 8 bit addition of its header. This ensures data integrity at this layer. The 8 bit adder hardware is as shown in design Fig. 3.6. Synopsys synthesizes the adder by using Carry-Look Ahead (CLA) computation style to reduce latency.

69

a [0:7]

8 bit Add

sum [0:7]

One's Complement

checksum [0:7]

Figure 3.6: IP checksum hardware.

3.2

Micro Architectural Framework

Parallelism, choice of architecture, datapath and task scheduling are macroarchitectural decisions which affect system design greatly at global level. Microarchitectural framework, on the other hand deals at a lesser granularity of architecture choice of circuit level topologies, pipelining, placement of registers and combinatorial logic. The following sections describe the framework of techniques equally important to bring down

3.2.1

P ower perf ormance

:

Pipelining

Pipelining by using registers in the critical path is an indispensible tool in throughput constrained designs. It is especially useful when the critical path delay distribution is spread and the overall frequency is affected for want of a few paths [25]. The network interface of our design is strictly governed by the 100 Mbps throughput at the physical layer which translates to Clk = 25M Hz and txClk = 125M Hz as seen in the previous sections. Also, as seen in Fig. 2.8, we

70 Minimal to meet clk, txclk requirements of 25 and 125MHz data1 D Q clk

data2

data3 Combinatoral Logic

data4 D Q clk

clk

Figure 3.7: Modular Pipelining Approach in our design for High throughput.

have adopted a systolic array processing scheme with cells processing data packets independently for each of the OSI layers. So, every module has been designed in such a way that there is no excess combinatorial logic in its most critical path as in Fig. 3.7. It has been made possible by not only allowing an input and output bank of registers but by having a synchronous flow where additional pipeline registers are placed with the modules to schedule only small computations like load/store/conditional load/conditional store/ computations in between registers. The result is that we have a very good worst case margin of +3.75 ns (from the block running at txClk = 8ns) in the entire architecture. The depth of pipelining we achieved is not generally possible in commercial network processors as the latency increases tremendously. But in our design, due to elimination of table look up and hash computations required in routers, firewalls, the critical paths latency has been brought down to less than 0.1ms and hence the architecture could be pipelined. Cascading of blocks has been achieved by optimally dividing the available delay in between circuit blocks.

71

Figure 3.8: Latch Based Clock Gating.

3.2.2

Clock Gating

In synchronous design style used for most HDL and synthesis-based designs, the system clock is connected to the clock pin on every flip-flop in the design. This results in three major components of power consumption: 1) power consumed by combinatorial logic whose values are changing on each clock edge; 2) power consumed by flip-flops (this has a non-zero value even if the inputs to the flipflops, and therefore, the internal state of the flip-flops, is not changing); and 3) the power consumed by the clock buffer tree in the design [26]. The system clock is as high as 125 MHz in the NPU design which translates to high dynamic power dissipation. The first component of the power consumption due to combinatorial logic was by far the smallest contributor to the total power consumption. So, only default options for gate sizing have been used for combinatorial power reduction (in Synopsys Design Vision). RTL clock gating had the potential of reducing both the power consumed by flip-flops and the power consumed by the clock distribution network. This works by identifying groups of flip-flops which share a common enable term (a term which determines that a new value will be clocked into the flip-flops). RTL clock gating uses this enable term to control a clock gating circuit to turn off clock ports of all of the flip-flops with the common enable term. So, the flip-flops will consume zero dynamic power as long as this enable term is false. By implementing latch

72 based enable style clock-gating as shown in Fig. 3.8, and also having such common enable terms (like global reset, scrambler en, data enable etc) in our transceiver design, we were able to get huge power savings in all the modules, more so in NPU (41%) and RISC (90%). The Synopsys implementation is as follows: set clock gating style −sequential cell − latch −minimum bitwidth − 2 −num stage2 −positive edge logicintegrated : CKLN QHV T D1 −neg−integrated : CKLHQHV T D1 −control point − bef ore −control signal − scan enable insert clock gating − global

Chapter 4 Circuit Level Techniques 4.1

Overview

In the previous chapter, we have already discussed a wide-range of Macro and Micro architectural optimization techniques and improvements applicable to our custom design of the Network transceiver. It has already been ensured that the required throughput of 100 Mbps is met with enough margins at the expense of power. So, going by the layered optimization of our design, low-power is the key consideration at the circuit level. Low power is important in ICs which need scalability and longer battery life specially when employed as independent systems. This chapter presents methods for efficient energy-performance optimization at the circuit level. By way of a combination of low power techniques, we have successfully brought down both dynamic and leakage powers while giving weightage to delay too in our custom SoC. Optimizations are performed based on the BSIM alpha-power law gate delay model and switching and leakage components of energy. We also use a novel convex optimization algorithm which iteratively uses the ratio of dynamic power to leakage power in a power-constrained module for simultaneous Multi Vdd and Multi Vth

73

74 assignments. The same techniques can be generalized for other large digital circuit designs after architectural-design. Using suitable operating conditions (Vdd and Vth ), each module has been optimized for power without sacrificing performance. It is demonstrated that energy savings of about 57% can be achieved without much delay penalty by way of assignment of supply, and threshold voltages across various modules compared to the reference design with minimum delay.

4.2

Multi-Vdd and Multi- Vth

The ideas of Multi-Vdd and Multi-Vth have evolved independently in efforts to have better tradeoffs between energy and performance. The use of these techniques is motivated by the observation that the path delay distribution of the entire design is spread almost uniformly and that a circuit’s overall performance is often limited by a few critical paths [27]. So, low Vth devices can be used to speed up such paths while the remaining could be under power suppression by high Vth . This optimization space can further be widened by combing multi Vdd assignment too. By running one non-critical cluster of gates (like control path) by a low Vdd and a critical dath path cluster by high Vdd , we can expect high improvements in both performance and delay at the same time. [28] concludes that by using no more than three discrete values of Vdd and Vth , we can move almost to the global optimum on the energy-performance curve. Our current chapter deals with this very aspect of achieving minima for large designs in the presence of limited values and process variations also.

4.2.1

Multi-Vdd

Dynamic power dissipation in CMOS circuits is proportional to the square of the supply voltage Vdd and leakage power to Vdd . Reducing the supply voltage is

75

Figure 4.1: Multi Vdd Circuit design.

straightforward but far from optimal. Since a reduction in Vdd degrades circuit performance, in order to maintain performance in dual Vdd designs, cells along critical paths are assigned to the higher power supply (VddH ) while cells along non-critical paths are assigned to a lower power supply (VddL ). Thus the timing slack available on non-critical paths is efficiently converted to energy savings by use of a second supply voltage as demonstrated in Fig. 4.1. While conceptually simple, the implementation is quite challenging. Different power supplies need to be managed; Level-shifters (LS) are required between different supply domains, increasing die area and necessitating tradeoffs between power reduction and area increase. Level conversion from VddL to VddH is needed (8T Level Shifter cell) at boundaries to eliminate the undesirable static current and also to eliminate bit errors due to insufficient drive. Augsburger et al [28] prove the inefficiency of having more than 2 supplies and hence in our design, we limit to Dual Vdd of 1 V and 0.7 V using the TSMC 90 nm standard cells. (With operating conditions called as NCCOM and NCCOM0D7 at 25o C).

76

4.2.2

Multi-Vth

The use of devices with multiple thresholds offers another way of trading off speed for power. Due to the exponential dependence of leakage current with Vth ,

Figure 4.2: Multi Vth circuit design.

even a 100mV difference causes an about an order of magnitude of difference in leakage current while affecting the active current by 30%. So, by using Low Vth cells in timing-critical paths and High Vth elsewhere as in Fig. 4.2, we can expect delay margins and also leakage power reduction. One benefit of the approach is that Multi-Vth does not require level converters and clustering logic. While Multi-Vth primarily addresses the leakage current, it yields a small reduction in active power too due to a reduced gate-channel capacitance in the off-state and a small reduction in signal swing on the internal nodes of a gate (Vdd − VthH ).

77 Such a reduction is partially offset by increased source and drain junction sidewall capacitances. The overall active power reduction turns out to be only about 4% [29]. In our design, we have used 3 different Vth values from TSMC 90 nm cell library: High, Nominal, Low for each of Vdd =1 or 0.7 V)

4.3

Vdd - Vth Assignment Algorithm

The current section goes into the detail of our framework which uses Vdd and Vth design techniques. While it is becoming established to design different blocks with different transistors, it is very challenging to do this on the level of individual transistors. So, the question to be addressed is the level of granularity at which we can assign multiple values. In our design, depending on the functionality closeness (and switching activity), we have clustered the design into independent modules. If a group of logic gates are computing intensive functions, then there is a larger chance that there are more critical paths in that module which have to meet the system Clk requirements of the design (125 MHz/25 MHz)and hence can be clustered together for Vdd assignments.

Though the idea of combined optimization by simultaneous Vdd and Vth is relatively new, there has been some literature in this direction. Srivatsava et al. [30] propose a genetic algorithm for the most optimal assignments but for a complex design, it is no longer possible to optimize at transistor level for these parameters due to reasons of complexity and also limited voltage values. [31] proposes an algorithm which optimizes for power with Lagrange optimization for a fixed delay but it does not seem suitable here because exact critical path modeling is difficult, the level shifter penalty is neglected and also the process variations could affect such a tight design after fabrication. The above techniques are useful

78 when delay is the major constraint in a given circuit module. When power is the parameter of interest with need for delay margins, a metric which assigns suitable weights to both power and performance needs to be optimized w.r.t the design parameters like Vdd , Vth , W .

4.3.1

Choice of Design Metric

Several methodologies have been proposed in the literature to simultaneously meet the targets of low power and high performance in modern VLSI designs. Design metrics such as power per operation and energy per operation have been shown to be inadequate [32, 33] for evaluating tradeoffs of power and performance. Energy-delay product (EDP) is widely used as an appropriate metric to optimize and compare different designs where both performance and energy are of importance. In [34], Penzes and Martin showed that the P tn metric characterizes any feasible trade-off. Hofstee [35] concludes that optimal metric is not unique for all designs but depends on the desired level of performance. Markovic et al. [36] analyze the ratio of sensitivity of energy to the sensitivity of delay in order to achieve energy-performance optimization.

However, as these designs primarily target design libraries ≥ 100 nm, they do not comprehend the interdependence of thermal and power dissipation issues which become critical in sub-100 nm designs (90 nm here). The subthreshold leakage is exponentially dependent on temperature and the dependence gets stronger with scaling. Also, increase in total chip power causes higher junction temperatures (Tj ), which further increases the subthreshold leakage power, creating a feedback loop leading to electrothermal couplings [37]. Hence, for nm scale technologies, it is critical to consider the impact of thermal effects on design optimization and on the choice of design metrics.

79

Three general design metrics (EDP, PDP, and PEP) have been tested in [38] for such a robust metric. It also proposes a new methodology that allows designers

Figure 4.3: Delay and Temperature variations for various metric based optimizations in sub-100 nm design.

to choose a correct design metric that directly satisfies their design-specific needs. EDP(∝ delay 2 ) prioritizes delay over power and the optimal point will have higher supply voltage and lower threshold voltage, to have higher performance. Since PEP prioritizes power over delay, the threshold voltage should increase to reduce the leakage power dissipation. Fig. 4.3 compares the delay, temperature, and power results of metrics on a given design. PEP leads to the highest delay and lowest power; also has the highest ratio of Pdyn to Pstatic that gives the highest power efficiency of a design. [38] suggests the use of ratio of exponent of delay to that of power µ in a general metric P T µ for optimization. µ of PEP (P 2 T ) is 0.5 and EDP (P T 2 ) is 2. When performance is the primary concern, µ ≥ 1 and when power the primary concern, µ ≤ 1.

80

4.3.2

Optimization Framework

We need to have an algorithm which assigns higher weightage to power, which is robust to process variations and which assigns better margins by automatically identifying delay constrained modules. A novel unified framework on these lines is developed here to calculate Vdd and Vth according to designer’s needs in complex designs. First, we break down the global optimization problem into multiple local optima for each of the modules though, P ((Vth1 , . . . Vthn ), (Vdd1 , . . . Vddn ), (W1 , . . . Wn )) ≤ P ((Vth1 , . . . Vth1 ), (Vdd1 , . . . Vdd1 ) (4.1) where Vthj , Vddj , Wj are the optimum threshold voltage, supply voltage and Width of the j th transistor in the design. So, the breakdown of the architecture into smaller modules is very crucial if an operating point closer to the global optimum be achieved. In our design, we can assume that this has been done at architectural level and also that optimization w.r.t W is done by Synopsys tools. So, the new problem can be stated as: Minimize P ((Vthi , Vthi , . . . Vthi ), (Vddi , Vddi , . . . Vddi ) ∀i with i running across all modules. Pi of each module can be broken down into two main components i.e Leakage and Dynamic power. Leakage power of each transistor has three main constituents: 1. Sub-Threshold Leakage 2. Gate Leakage 3. Short-Circuit Power Of these, Sub-threshold dominates the rest and can be modeled as Pleak = Vdd Ileak

(4.2)

81 The Sub-threshold leakage current of a MOS transistor according to the BSIM model [39] can be approximated as follows: Ileak = Ae

(Vgs −Vth ) nVT

(1 − e

−Vds VT

)

(4.3)

where A = µ0 Cox

Wef f (VT )2 e1.8 Lef f

(4.4)

VT is the thermal voltage, n(sub-threshold slope) is around 1.33, Cox is the gate oxide capacitance per unit area and Vth is the threshold voltage. It can be seen that leakage current is exponentially dependent on threshold voltage. On the other hand, the dynamic power at each gate is Pdyn = Cload Vdd 2 f

(4.5)

where Cload is the output node capacitance and f the frequency of operation. If there are a total of Ni1 gates in the ith module of the design, then the total power dissipation of that module can be written as Pi = Ni1 (Cload Vdd−i 2 f + K1 Vdd e

−Vth−i nVT

)

(4.6)

where K1 is a constant parameter. On the other hand, let’s assume that there are Ni2 gates in the most critical path of the architecture. While there are many different models that can be used, [40] suggests the use of the BSIM alpha-power law model of [39] as a baseline for derivation of the gate delay formula. At each gate, the delay due to charging can be modeled for a choice of Vdd and Vth as: td =

2Cload Vdd β(Vdd − Vth )α

(4.7)

where α is around 1.3 for short channel and 2 for long channel devices. So, assuming that the load capacitances at the gates are equal to a first order, the critical path delay Di of ith module evaluates to Di =

Ni2 K2 Vdd−i (Vdd−i − Vth−i )α

(4.8)

82

280 275

metric

270 265 260 255 250 245 240 0.55 0.5 0.45 0.4 0.35 0.3

Vth

0.25 0.2

0.6

0.65

0.7

0.8

0.75

0.85

0.9

0.95

1

Vdd

Figure 4.4: P T µ metric surface plot with Vdd and Vth .

Clearly, the leakage power decreases exponentially with Vth but the delay increases in doing so. So, the choice of Vth−i and Vdd−i for each module is a convex optimization problem. As seen in the previous section, P T µ with freedom on µ as in Fig. 4.4 is the best metric for our design requirements. Using eq. 4.6 and eq. 4.8, it reduces to P T µ = Ni1 Ni2 K2 (Cload Vdd−i 2 f + K1 Vdd−i e

V−th−i nVT

)(

Vdd−i µ α) (Vdd−i − Vth−i )

(4.9)

To assign the most optimal value Vth , we differentiate P T µ partially w.r.t Vth−i . So putting −

∂(P T µ ) ∂Vth−i

= 0 and solving, we obtain

−Vth−i 1 Vdd−i Pi Vdd−i αµ µ µ K1 e nVT ( ( =0 α) + α) nVT (Vdd−i − Vth−i ) Ni1 (Vdd−i − Vth−i ) Vdd−i − Vt−i (4.10)

which reduces to Vdd−i − Vth−i = nVT αµ(1 + Ri )

(4.11)

83 where Ri =

Cload Vdd−i 2 f V−th−i nVT

(4.12)

K1 Vdd e is the ratio of dynamic power to leakage power in an individual transistor for a choice of Vdd−i and Vth−i and is an important parameter in circuit-power optimization. Similarly, partial differentiating P T µ w.r.t Vdd−i , we obtain, ∂(P T µ ) = 0 ∂Vdd−i V−th−i Pi Vdd−i µα (2Cload Vdd−i 2 f + K1 Vdd−i e nVT ) = ( − µ). Ni1 Vdd−i − Vth−i

(4.13) (4.14)

which can be reduced to 2Ri + 1 = (Ri + 1)(

Vdd−i µα − µ) Vdd−i − Vt−i

(4.15)

or Vdd−i 1 = Vt−i 1−λ

(4.16)

where λ=

µα(Ri + 1) µ(Ri + 1) + (2Ri + 1)

(4.17)

Using Eqns. 4.11–4.16, we can solve recursively for the most optimal set of (Vdd−i , Vt−i ) for each module i. Note that Ri itself is a function of Vdd−i and Vth−i . The parameters needed for this calculation are Ri and µ. [38] states that a value of µ = 0.5 is preferred when the emphasis of optimization is on power than on delay. As a first iteration, we can choose Ri to be the ratio of dynamic to leakage power in the unoptimized case for ith module. Ri is sensitive to switching activity p0−1 at gates and hence a better estimate would be Ri−ef f =

Cload Vdd−i 2 f p0−1 K1 Vdd e

V−th−i nVT

(4.18)

In our current design, Ri−ef f has been obtained by generating a .dump file from the test bench of NPU and using it in Synopsys Prime Power too to generate

84 Ri−ef f for each of the modules UDP, IP, Eth, Phy (i=1 to 4). The values have been tabulated as follows: Table 4.1: Modules and their Power-Efficiencies Module UDP IP Eth Phy DES RISC RS232

Pdyn (µW) 21.05 51.65 145.78 47.01 37.5 102.02 31.05

Pleak (µW) 2.52 12.25 22.68 1.19 4.8 17.14 3.12

Power Efficiency(R) 8.353 4.21 6.42 39.27 7.81 5.95 9.95

After solving for design values for each of the modules, we calculate the penalty due to level shifters needed between modules with different Vdd values. Also, the port sizes of the modules is considered as each line on the bus needs level conversion and our design, as seen from architectural level has larger ports (48, 60 etc) compared to standard 8, 16 or 32 bit vectors. Algorithm 1 summarizes the methodology we suggested in this section.

4.3.3

Process Variations and µ

While assignment of right weightage to power and performance is an important criterion in choosing the metric (or µ here), robustness to process variations arising also pose a major challenge in the design optimization of high performance VLSI circuits, especially for sub-100 nm technologies. Variations arising due to changes in temperature, supply voltages (Vdd and Vth ) or from (L), oxide thickness (tox ) etc can cause a spread in the distribution of voltage values and need to be addressed here. The following mathematical analysis systematically develops such a metric and then uses the metric to further optimize for performance and power. Following the analysis from previous sections, we need a value of µ ≤ 1.

85

Algorithm 1 Calculate Vddopt and Vthopt across the design modules Require: Minimize Pdyn−i + Pleak−i ∀i across K modules Ensure: Sufficient Delay margin for delay critical modules 1:

for i = 0 to K do

2:

Compute Ri =

3:

Solve for Vdd (i) and Vth (i) from eq. 4.11 and eq. 4.16 using Ri

4:

Recompute Ri−new using Vdd (i) and Vth (i)

5:

while Ri 6= Ri−new do

6:

Pdyn Pleak

for ith module

Repeat Steps 3 and 4 using Ri−new

7:

end while

8:

if Vdd (i) 6= Vdd (i − 1) //Level Shifter needed in between modules then

9:

Pdyn−i + Pleak−i ⇐ Pdyn−i + Pleak−i + 8ni (DT + LT ) //8 transistors per Level shifter; DT , LT : Dynamic and leakage powers per transistor; ni : port size between (i − 1)th and ith modules

10: 11:

end if end for

86 Let’s assume that P T µ is the optimization metric we want to choose and let Vddopt and Vthopt be the optimum Vdd and Vth for a given module by way of optimization framework. Due to process variations, let’s assume that there is a Gaussian spread for both Vdd and Vth around the desired Vddopt and Vthopt . So, an important consideration in the choice of µ is to make the variation i.e. D = (P T

µ

− PT (Vdd ,Vth µ

µ

) minimum w.r.t µ. The maximum possible (Vddopt ,Vthopt

variation in P T is obtained when Vdd takes Vddopt + σ1 and Vth takes Vthopt − σ2 with σ1 and σ2 being the standard deviations of Vdd and Vth respectively. It is assumed here that the maximum variation in either of these values do not exceed the standard deviations. Evaluating P T yield A1 y µ and A2 xµ respectively where

µ

and P T (Vddopt +σ1 ,Vthopt −σ2 )

A1 = N1 N2 K2 [p0−1 Cload (Vdd + σ1 )2 f + K1 e Vdd + σ1 y = (Vdd − Vth + σ1 + σ2 )α A2 = N1 N2 K2 [p0−1 Cload Vdd 2 f + K1 e Vdd x = . (Vdd − Vth )α So, putting

∂D ∂µ

−

µ (Vddopt ,Vthopt )

(Vth −σ2 ) nVt

(Vdd + σ1 )]

(4.19) (4.20)

V

th − nV

t

Vdd ]

(4.21) (4.22)

= 0 yields,

∂ (A1 y µ − A2 xµ ) = 0 ∂µ 2 log( A logy x) A1 . µ = log( xy )

(4.23) (4.24) σ

Where

y x

can be simplified from eq. 4.20 and eq. 4.22 to

1+ V 1

be written as: 1+R

σ2 2σ R(1+ V 1 )+e nVt dd

dd σ1 +σ2 α ) dd −Vth

(1+ V

σ

(1+ V 1 ) dd

Where R is the dynamic to leakage power ratio as in eq. 4.18.

and

A2 A1

can

87 As we have seen in the previous section, the value of R is typically around 6 in the current design of the modules. Taking a 10% variation in both Vdd and Vth while working around a base case of Vdd = 1V and Vth = 0.5V and substituting to solve eq. 4.24 for the most optimum value, we get µ ≈ 0.35. Graphically, this optimum µ = 0.35 in light of variation of metric can be seen in Fig. 4.5. So, theoretically

Variation of PTu metric wrt u

12

10

8

6

4

2

0

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

u Figure 4.5: Variation of P T µ metric with µ.

P T 0.35 is the most robust metric w.r.t process variations in the current design. So, PEP (P T 0.5 ) can be considered to be reasonably robust to the variations which also gives suitable weightage to performance and also power (both dynamic and leakage). However in the current case, this new metric P T 0.35 is more appropriate both due to its robustness and also the importance to power in the power constrained design. The same analysis can be repeated for any given design and the most appropriate metric in light of the Power Efficiency (R), variations and other parameters can be chosen.

88

Substituting µ = 0.35 into eq. 4.11 and eq. 4.16 and solving for Vdd and Vth for each of the modules, we get the following Vdd and Vth values for different modules as tabulated below. Note that the switching factor p0−1 is included in Pdyn Table 4.2: Power Efficiency and Optimum Supply, Threshold voltage values Module UDP IP Eth Phy DES RISC RS232

4.3.4

Power Efficiency(R) Vdd−opt (mV ) 8.353 740 4.21 400 6.42 570 39.27 3000 7.81 724 5.95 710 9.95 791

Vth−opt (mV ) 590 320 443 2400 512 492 611

Heuristic Voltage Clustering

The above formulation assumes that it is feasible to assign any arbitrary values of Vdd and Vth to each module. But in reality, technology limits the number of voltages in design. Let us assume that we have n1 distinct supply voltages and n2 threshold voltages (here n1 = 2; n2 = 3 i.e. Vdd−high , Vdd−low , Vth−high , Vth−nominal , Vdd−low ). So, it is necessary to cluster the obtained values into one of the available values. We propose the algorithm 2 to achieve this goal:

Essentially, the algorithm assigns boundary values to both Vdd and Vth if they both exceed or fall below the max/min available values. Otherwise, if Vdd is within 10% of available Vdd , it is assigned that value and Vth to its nearest value. Else, it becomes impossible to take Vdd to its nearest value without disturbing eq. 4.11 and eq. 4.16 which form the core of the optimization analysis. But, as dynamic power is more important than leakage (Note Vdd affects dynamic power significantly and

89

Algorithm 2 Cluster Vdd (i) and Vth (i) into available values across all K modules 1: for i = 0 to K do 2: 3: 4: 5: 6:

if Vdd (i) ≥ Vdd−high && Vth (i) ≥ Vth−high then Vdd (i) = Vdd−high && Vth (i) = Vth−high else if Vdd (i) ≤ Vdd−low && Vth (i) ≤ Vth−low then Vdd (i) = Vdd−low && Vth (i) = Vth−low else if

|Vdd−nearest −Vdd | Vdd

7:

Vdd ⇐ Vdd−nearest

8:

Vth ⇐ Vth−nearest

9:

≤ 10% then

else

10:

Vdd ⇐ Vdd−nearest

11:

Use eq. 4.16 and substitute Vdd = Vdd−nearest to compute new value of Vth−new

12: 13: 14:

Vth−new ⇐ Vth−nearest end if end for

90 also leakage power to an extent), we take eq. 4.11 which computes the relation to be satisfied for optimal Vdd , as the governing relation and compute Vth appropriately (Leakage power not optimized). The following table shows the final Vdd and Vth values assigned using the clustering technology:

Table 4.3: Clustered Voltage values Module UDP IP Eth Phy DES RISC RS232

4.4

Vdd (mV ) 740 400 570 3000 724 710 791

Vdd−cluster (mV ) 700(Low) 700(Low) 700(Low) 1000(High) 700(Low) 700(Low) 700(Low)

Vth (mV ) 590 300 443 2400 512 492 611

Vth−cluster (mV ) (Nominal) (High) (Nominal) (High) (High) (High) (High)

Results and Analysis

We have used TSMC 90 nm libraries in Synopsys to compute the power savings due to the algorithm. In NPU engine alone, we have been able to achieve savings of about 56% in dynamic and 57% in leakage powers compared to the base case of Vdd =1 V and Vth =1 V as in Fig. 4.6. Other modules power savings have also been calculated. The algorithm rightly recognizes the power efficiencies of the modules and scope for changes in Vdd and Vth in all the modules to bring down power considerably. An even more significant result is that the Physical Layer module, which is the most delay constrained (running at txClk = 125M Hz) has been assigned a High Vdd which translates to better delay margin of about 5%. Also, the critical path distribution graph Fig. 4.8 shows that after the optimization, the spread in critical path histogram is lowered from 4 ns to 2.75 ns (More paths are closer to critical; Ideal scenario being- all paths in the circuit critical). One

91 more important observation is that the Ethernet block has been assigned low Vdd and nominal Vth contrary to heuristics (High Vdd and High Vth to improve performance and power in the largest block). This is due to the LS penalty due to high port size on either side if Ethernet had been assigned a different Vdd . The performance degradation had been countered by using a lower Vth . RISC core also shows signifacnt power savings of 95% dynamic and 92% leakage power savings by the use of all the low power techniques as shown in Fig. 4.7. Thus, we have used a robust optimization framework at the circuit level to improve the battery life and performance. When used together with architectural improvements, it becomes possible to achieve an energy efficient design of the Network Transceiver that can be used in a variety of applications.

92

(a) Dynamic Power reduction first by multi Vdd , then multi Vth

(b) Leakage Power reduction first by multi Vdd , then multi Vth

Figure 4.6: Power savings in NPU by using the algorithm

93

(a) Dynamic Power saving (in µ W)

(b) Leakage Power saving(in µ W)

Figure 4.7: Power savings in RISC Core by using the algorithm and clock gating

94

(a) Timing slack histogram (Number of paths Vs Slack) of NPU before M ulti − Vdd and M ulti − Vth

(b) Timing slack histogram (Number of paths Vs Slack) of NPU after M ulti − Vdd and M ulti − Vth

Figure 4.8: Reduction of spread of timing slack histogram by M ulti − Vdd and M ulti − Vth

Chapter 5 Simulation and Synthesis The current chapter deals with the process of behavioral verification of the designed architecture by using simulations. The throughput requirements are also verified in ModelSim 6.3 f. Testbenches are written for each of the sub- modules and the overall architecture. The results are verified by using Xilinx FPGA core and also by simulating the remote node software and sending the hardware generated data using IP sockets and checking the reception of email. The functionally correct design is then synthesized using Synopsys Design Vision to measure dynamic and leakage power, switching activities, check for timing violations (setup and hold) and also modify non-synthesizable statements and which could potentially cause race-conditions.

5.1

SoC Design Flow and Tools Used

Fig. 5.1 illustrates the complete SoC design flow as well as FPGA mapping of a designed architecture and the tools used in the process. Briefly, the following are tools and technologies we used in the current design:

95

96

FPGA Flow

ASIC/SoC Design Flow Architectural improvements

Vitrex-5 FPGA map

Design .dump file for switching activity

Mentor Graphics

Synthesis

Modelsim 6.3f

.v design

.sdf file Xilinx Inbuilt synthesis

.ucf file

.xnf file

Xilinx translate

PrimePower

clk specs

.v files for TSMC standard cells

.ngd file Xilinx Map

Xilinx Place and Route .ncd file Xilinx bit-generation .bit file to FPGA

.v netlist after synthesis

Synopsys Design Vision

Power information

.db files (with Vdd, Vth information)

TSMC

.ncd file

Synopsys

files

.v netlist (for backannotation)

design netlist .v file

Library Files

.lib file .lef file TSMC Physical Library Place and Route

.sdc file (timing constraints)

.sdf file (Physical Delay)

Cadence SoC Encounter

RC information

.captable file

.def file Layout

Figure 5.1: Complete SoC Design Flow, Xilinx Mapping and Tools Used.

Tools: 1. Mentor Graphics Modelsim 6.3 f for HDL architecture design and its improvements; also for backannotation after synthesis. 2. Synopsys Design Vision for synthesis of design files to generate netlist. 3. Synopsys Prime Power for dynamic and leakage power analysis.

97 4. Cadence SoC Encounter for Layout of the netlist obtained from Synopsys. 5. Xilinx ISE 9.2i to map the architecture onto a Xilinx Vitrex-5 FPGA Technologies: TSMC 90nm library files: 1. tcbn90ghptc.db for Vdd =1 V; Vth =nominal 2. tcbn90ghptc0d7.db for Vdd =0.7 V; Vth =nominal 3. tcbn90ghphvttc.db for Vdd =1 V; Vth =High 4. tcbn90ghphvttc0d7.db for Vdd =0.7 V; Vth =High 5. tcbn90ghplvttc.db for Vdd =1 V; Vth =Low 6. tcbn90ghplvttc0d7.db for Vdd =0.7 V; Vth =Low 7. tcbn90ghp 6lmT1.lef as Physical library for layout After using Modelsim for design, the design is synthesized in Synopsys which generates the design netlist file using TSMC 90 nm library files. This netlist is taken to Cadence SoC Encounter to complete the layout and finally produce .def file (output file after layout). Backannotation of netlist after synthesis is done in Modelsim using library files of TSMC standard cells. This completes ASIC/SoC Design flow. Parallelly, the architectue is synthesized, translated, mapped, placed and routed using Xilinx ISE 9.2i tool to finally generate the bit file to program FPGA.

5.2

Simulations

Using MATLAB Simulink for system design is gaining ground in complex DSP system design. However with the absense of special modules for networking appli-

98

Figure 5.2: Software running on remote node which receives RFID data and forwards to email-server.

99 cations, the modules need to be custom written to follow the right data path and control path flows. Hence, the design described in Chapter-2 along with the architectural improvements from Chapter-3 have been written in verilog. Each module in the tree of the hierarchy is then tested by writing individual test benches and generating possible test-cases. In the systolic array architecture discussed for NPU, each PE is tested by giving test input Crypto data and control signals. Waveform in Fig. 5.5 depicts the output waveforms of Physical layer of the SoC i.e. T x+ and T x− which are the transmission bit streams necessary to put on network line via RJ-45 connector. The functional correctness of these is verified at various hierarchies. The design was mapped onto a Xilinx Vitrex-5 FPGA and the process of sending an email at the application level has been covered in [5] using Microblaze Processor which uses SMSC ethernet connector. However, as the Physical layer signals are internal to the Processor, the process only verifies application layer and the final design could fail due to errors in lower levels of hierarchy. To test the UDP and IP layers, a UDP client is written in software using IP sockets. The wave form data generated in Fig. 5.5 is used to send to a UDP remote server listening on 1500 port which then forwards it to the IITK SMTP server as in Fig. 5.2. The email was successfully received showing the correctness at these layers. To functionally verify Ethernet layer, Xilinx Logicore Tri mode Ethernet MAC v3.3 core as in Fig. 5.3 is instantianted (using our requirements) in Coregen tool. The Ethernet MAC Core takes ethernet frame data generated from the current design as input , verifies for functional correctness and transmits using Physical layer. A test bench design which uses 2 such cores where one is transceiver and the other as receiver is used. So, the same frame data transmitted is received back and decoded with all bad frames being dropped in the process. We were able to obtain all the transmitted ethernet frames using the data generated by our design as in Fig. 5.4(a) thus verifying the Ethernet layer. In Fig. 5.4, the data has been

100

(a) Ethernet Tri MAC Core for Ethernet verification

(b) Generation of Core

Figure 5.3: Ethernet TriMAC Core in Xilinx Coregen intentionally corrupted which resulted in Fig. 5.4(b). To test the Physical layer and also to test the entire hierarchy, a MATLAB system level model is designed

101 to compare the bit streams T x+ and T x− which are the final outputs and are matched. Note that idle stream is sent over the line when there is no packet data. The simulation results of DES engine also showing the outputs after the 5 internal pipeline stages is shown in Fig. 5.6. Similarly, the RS232 interface and the GSM module have also been tested by using the simulations as shown in Fig. 5.7 and Fig. 5.8 respectively. In RS232, it can be seen that the design generates the right transmission bit stream for the command on tx after which it listens for RF ID data[0 : 31] on the rx bit stream and passing the data to DES/GSM by asserting data ready. Similarly, GSM module sends the right commands discussed in Chapter 2 on the tx bit stream. Note that the rx pin is not active. Fig. 5.9 shows all the blocks cascased together in the Embedded IC and how a real time 4-byte RF ID data is passed onto to DES and then to NPU block finally to be converted to T x+ and T x− streams.

5.3 5.3.1

Custom Design and Synthesis Synthesis

The modules verified for functionality are synthesized using custom cells of TSMC 90 nm technology in Synopsys tool. The modules are checked for synthesizability and suitable modifications are done to the verilog code. The synthesized modules are shown in Appendix. The key issues in our design are timing constraints (given txclk of 125 MHz) and low power (given the ultra-lower target). The pipelining approach ensured that the worst timing slack in the entire design is +3.75 ns (in Phy block with txclk of 8 ns target) which ensures that the 100 Mbps throughput is met with sufficient margin. The base case of synthesis has been done at Vdd = 1V and Vth = V T N ominal. The Multi Vdd and Multi Vth algorithm described in Chapter-4 has been shown

102 Table 5.1: Dynamic Power savings in Vth Module Dyn. Power (in µ W for base case) NPU 4170.2 DES 292.3 RISC Core 554.9 RS232 35.30 GSM 57.64 Overall 5110.34

Various modules by use of Multi Vdd , Multi

Table 5.2: Leakage Power savings in Vth Module Leak. Power (in µ W for base case) NPU 183.54 DES 4.02 RISC Core 93.14 RS232 7.13 GSM 9.7 Overall 297.53

Various modules by use of Multi Vdd , Multi

Dyn. Power (in µ W % savings after optimization) 1835.2 55.9 140.17 52.0 26.8 95.1 16.85 52.2 27.36 52.5 2046.4 59.9

Leak. Power (in µ W % savings after optimization) 79.63 56.6 2.00 50.2 7.44 92.0 0.559 92.1 0.77 92.0 90.39 69.6

already in Fig. 4.6. The figure shows the power saving results obtained by Vdd first, then with both Vdd and Vth and finally with clock-gating also applied for NPU. As seen, it results in a total of 56% dynamic and 57% leakage and 5% delay improvements in the most critical block thus satisfying our power-performance goals. After applying the same approach to all modules, the results for all the modules along with power saving improvements have been depicted in table 5.1 and table 5.2. The total power consumed is 2.13 mW and the projected area of the design is about 0.1 mm2 . So the Power/Area density is about 2.1 W/cm2 which is well within the ITRS requirements.

103

5.3.2

Back-Annotation

Back-annotation is important in large designs because with the increase of physical paths, the complexities due to delays and race conditions that could not be detected at the simulation level also increase. After encountering such undefined states and races, the original design has been modified by making suitable changes. The netlist of NPU is then saved and resimulated in Modelsim 6.3 using the library module files of all the custom cells used by TSMC 90 nm technology (like AN3D8, AOI222DI1, DFF, MUX, BUFFD16 etc) using .sdf (standard delay format) file and the results have been matched with the ideal T x waveforms as shown in Fig. 5.10. The small physical delay in assignments at clock edges is shown in Fig. 5.11

5.3.3

SoC Place and Route

The final step in the system design flow is Placement and Routing of the designed architecture to engineer the physical arrangement on the board. We use Cadence SoC Encounter tool to do floor-planning, power planning, placement design and routing of the netlist obtained from Synopsys Design vision. The .sdf file (containing physical delay information) and .sdc (delay constraints) files are also generated from Synopsys to be used in SoC encounter. TSMC 90 nm library files (.lef files) and RC information (.captable files) are also used as inputs. As our design has large number of standard cells, there are potentially higher geometry (DRC) violations. To eliminate them, the core-utilization ratio has been decreased to 0.3 (at the expense of slight increase in area). Also, a single-supply single threshold design is used for the layout. The power planning involves setting up the voltage rings and stripes. Then the netlist is placed and further the clock tree is synthesized. The final step involves a nano-route on the placed cells. This is done here by specifying a congestion driven routing option to further reduce

104 DRC violations. The routed design is verfied for DRC violations, net-connectivity problems and timing violations which are zero in our design. The final routed design is shown in Fig. 5.12. The design statistics are as follows: Table 5.3: Layout Summary Statistic Number of standard cells Number of transistors Area Internal power Switching power Leakage power Total power

5.3.4

Value 13183 ∼ 79098 (∼ 6T per cell) 0.1015mm2 1.646 mW 0.278 mW 0.2188 mW 2.143 mW

Xilinx Place and Route

The design is also mapped onto Xilinx Vitrex-5 FPGA Prototype to check for possible timing constraints due to additional Place and Route(PAR) delay. The device utilization summary and RTL top level schematics after synthesis, Translate, Map and PAR from the FPGA for various modules are included in the appendix. There were no additional timing violations in any of the modules and hence the design meets the 100 Mbps throughput.

105

(a) Design generated frames successfully transferred and received by Physical layer

(b) Intentionally corrupted Ethernet frame

Figure 5.4: Ethernet data integrity check in our design

106

Figure 5.5: Test Bench results showing TXpos and Txneg bits of NPU.

Figure 5.6: Test bench showing DES Encryption results with internal pipeline stages also.

Figure 5.7: SLM 015M RFID interface simulation (Commands received over tx and sent over rx.)

107

Figure 5.8: AVM GSM Module interface simulation (SMS command sent over tx).

Figure 5.9: Overall Simulation showing cascaded RFID Interface, DES Security and NPU engine. Data stream is finally placed over Tx bus. The synchronous handshake signals (enables) are used to pass data from one layer to another

108

Figure 5.10: Back Annotation result of NPU showing Tx bits.

Figure 5.11: Slight physical delay in assignament at clock edge after synthesis.

109

Figure 5.12: SoC after Placement and Routing.

Chapter 6 Conclusions and Future Work 6.1

Conclusions

In the current work, we have started with the goal of designing an embedded network transceiver architecture which could interface with RFID reader, a GSM module, provide security and some flexibility and also provide network connectivity to send an email to an email server. All the goals have been met by breaking the problem into smaller individual designs. Apart from implementing a custom cell design using a cluster of low power techniques, a layout of the architecture has also been completed which could be taped out. Also, the hardware for SMS functionality has been included in GSM module. The network transceiver provides scalability by using UDP protocol to send the RFID data to a remote server which runs software to send the received data as an email to an SMTP server. We used a novel framework which uses a wide-variety of macro-micro architectural and circuit techniques to improve on the other requirements i.e. power, area, throughput, reliability and latency. Although the design is less flexible and cannot be deployed in routers and firewalls, the architecture scores in performance and power due to design novelties implemented for the specific architecture and

110

111 also due to circuit novelties by M ulti − Vdd and M ulti − Vth which are not used in general in Network processors/FPGAs. The fact that the power dissipation reduces drastically from 19 W (If Network Processor is employed) to 2.1 mW in our custom design proves the assumption that a hardwired SoC gives 3 orders of magnitude of power saving if flexibility is traded.[3] Low power in our design also makes it very scalable for large scale deployment.

Figure 6.1: Comparision of FPGA DES throughput against various other implementations.

The DES architecture which has been implemented in an LUT kind of architecture is especially suitable for Xilinx implementation and a throughput of 25.6 Gbps has ben achieved which is the highest reported in literature as shown in Fig. 6.1 (Note that the pipelining made it possible which is otherwise impossible to implement due to higher latency which has been brought down here by reducing the number of DES rounds/ security).

6.2

Contributions

• Designed a low power custom architecture to have on-chip transceiver for RFID readers.

112 • Developed custom interface to SLM 015M RFID Reader at a baud rate of 9600 bps • An improved, reliable communication protocol and its hardware have been included which can be used in future custom made RFID readers • Custom interface to AVM GSM module has been written to relay RFID data as an SMS • Achieved the highest reported throughput of 25.6 Gbps for DES engine by truncating the number of rounds; unrolling and pipelining and an architecture which uses LUT structures • Implemented a Systolic pipelined architecture for NPU to meet 100 Mbps requirement with hardwired parallel PE units which improve performance over parallel Micro-coded engines • Implementation of the idea of a custom Queuing engine to speed up critical paths. • First Known implementation of the idea of Power Aware processor scheduling in allotment of packets to PEs • Migration of the idea of Array lock hardware used in multi processors to parallel units in an ASIC • Demonstration of Split cache idea in reducing packet drop rate while maintaining a low FIFO delay • A novel method of employing M ulti − Vdd and M ulti − Vth along with a novel optimization framework to assign values at a modular granularity and a heuristic clustering into limited technology voltage voltages.

113 • A congestion-driven Place and route of the system architecture to obtain prototype layout.

6.3

Future work

We have implemented the architecture specifically for 100 Mbps line speed. However as a large positive timing slack is available, the design can be extended to 1 Gbps line speed. Also, the other necessary analog requirements like magnetics module, clock synthesizer can also be added to make the unit complete and free of non-idealities due to noise, skew and jitter. To further bring down the latency due to CRC in the current critical path, parallel CRC architecture can be realized. To have a full system running, we implemented an RS-232 interface, its required buffers for SLM015M RFID reader, processing for security, applications and the complete network interface. To further extend to other RFID readers, the middleware USB architecture developed in the work can be used in future custom made RFID readers. At the circuit level, more low power techniques like Dynamic voltage and frequency scaling, power gating can be added to further improve power savings. The layout can be further processed to develop a test-IC that can be tested for deployment.

Appendix A Xilinx Synthesis and PAR

114

Cache_data(0:199)

Crypto_data(0:63)

txclk

clk

reset

RF_enable

Rst

ClkEn

Clk

AND2B1

Out3

Out2

Out1

Out0

AND3B2

AND2B2

AND2

AND3B1

AND2B1

Lock_number_and0003

Lock_number_and0002 Lock_number_mux0000(0)

Block=NPU Sheet=1 Page=1

FDE

Lock_number_not0001

clk data_enable RF_enable txclk

Lock_number_cmp_eq0001

clk data_enable RF_enable txclk

Lock_number_cmp_eq0000 Lock_number_cmp_eq0001

Lock_number_cmp_eq0003

TXpos

TXneg

PE_busy

TXpos

TXneg

PE_busy

global_reset

Figure A.1: RTL schematic of NPU. C

D

C

D

txclk

FD

FD

RF_enable

data_enable

clk

Crypto_data(0:63)

Q

Q

TXpos

TXneg

PE_busy

global_reset

Cache_data(0:199) post_encode_with_idle(0:4)

Crypto_data(0:63)

TXpos(3)

Lock_number_cmp_eq0002

TXpos(2)

TXpos_bit_mux0000

TXpos(1)

TXpos(0)

Lock_number_cmp_eq0003

global_reset

Cache_data(0:199) post_encode_with_idle(0:4)

Crypto_data(0:63)

Lock_number_cmp_eq0000

Lock_number_cmp_eq0002

TXpos

TXneg

PE_busy

global_reset

Cache_data(0:199) post_encode_with_idle(0:4)

txclk

RF_enable

data_enable

clk

Crypto_data(0:63)

TXneg(3)

TXneg_bit_mux0000

Q

Q

Q

Cache_data(0:199) post_encode_with_idle(0:4)

S

FDSE

TXneg(2)

Q

AND4

C

R

FDRE

CE

D

C

CE

D

CLR

FDCE

AND2B1

C

CE

D

TXneg(1)

TXneg(0)

C

CE

D

AND3B1

reset

Lock_number_cmp_eq0001

Lock_number_and0003

Lock_number_and0002

PE_busy(3)

PE_busy(0)

AND2B1

VCC

PE_busy(0:3)

Lock_number(0:1)

post_encode_with_idle(0:19)

TXpos_bit

TXneg_bit

global_reset(0:3)

NPU_busy

data_enable(0:3)

115

116

Block=AES Sheet=1 Page=1 inp(0:31)

out(0:47)

inp(0:31)

out(0:47)

inp(0:31)

out(0:47)

INV

clk

clk

INV

clk

clk

INV

INV

>> >> >> inp(0:5) INV

out(0:3)

INV

INV clk

inp(0:5) INV

out(0:3)

INV

INV clk

INV

INV

VCC

inp(0:5)

out(0:3) INV

clk

inp(0:5) INV

out(0:3)

INV

INV clk inp(0:5)

INV

out(0:3)

INV

INV clk

inp(0:5)

out(0:3)

INV

INV clk

>> inp(0:5)

out(0:3)

INV

INV clk inp(0:5)

out(0:3)

INV

INV clk

FD

INV

RF_data(0:15)

D INV

>>

INV Q INV

C

>> >> inp(0:31)

out(0:47) INV

INV

INV

INV

INV

INV

clk

inp(0:5)

out(0:3)

clk inp(0:5)

out(0:3) INV

clk

>> INV

inp(0:5)

>>

out(0:3) INV

clk inp(0:5)

out(0:3)

inp(0:5)

out(0:3)

>>

out(0:3)

>>

out(0:3)

>>

out(0:3)

>>

out(0:3)

>>

out(0:3)

>>

out(0:3)

>>

out(0:3)

>>

out(0:31)

>>

INV clk

clk

inp(0:5) INV clk inp(0:5) INV clk inp(0:5) INV clk inp(0:5) INV clk inp(0:5) INV clk inp(0:5) INV clk inp(0:5) INV clk inp(0:31) INV clk

inp(0:31) INV

out(0:47)

>> >>

clk

Figure A.2: RTL Schmeatic of DES.

117

Xilinx Design Summary

Module Name:

XILINX_SYNTH_PAR Project Status Xilinx_synth_PAR.ise Programming File Generated Current State: ● Errors: NPU

Target Device:

xc5vlx30-3ff324

Product Version:

ISE 9.2i

Project File:

●

Warnings:

●

Updated:

Fri Apr 17 22:32:42 2009

XILINX_SYNTH_PAR Partition Summary No partition information was found. Device Utilization Summary Slice Logic Utilization Used Available Number of Slice Registers 2,918 19,200 Number used as Flip Flops 2,918 Number of Slice LUTs 4,468 19,200 Number used as logic 4,455 19,200 Number using O6 output only 4,207 Number using O5 output only 124 Number using O5 and O6 124 Number used as Memory 8 5,120 Number used as Shift Register 8 Number using O6 output only 8 Number used as exclusive route-thru 5 Number of route-thrus 138 38,400 Number using O6 output only 128 Number using O5 output only 9 Number using O5 and O6 1 Slice Logic Distribution Number of occupied Slices 1,618 4,800 Number of LUT Flip Flop pairs used 5,542 Number with an unused Flip Flop 2,624 5,542 Number with an unused LUT 1,074 5,542 Number of fully used LUT-FF pairs 1,844 5,542

Utilization 15%

Note(s)

23% 23%

1%

1%

33% 47% 19% 33%

file:///I|/Teja/Xilinx_synth_PAR/NPU_summary.html (1 of 2) [4/29/2009 6:59:14 PM]

Figure A.3: Device Summary for NPU on Xilinx Vitrex5 FPGA.

Appendix B Synopsys Design

118

data_ready

reset

Figure B.1: DES after synthesis.

txclk

Cache_data[0:199]

send_enable

clk

data_enable

RF_data[0:31]

clk

AES

security

tx

in_data[32:63]

in_data[0:31]

Crypto_data[0:63]

data_ready

RFID_data[0:31]

Crypto_data[0:63]

data_read_DES

Cache_

Crypto_

txclk

clk

reset

RF_ena

119

in_data[0:31]

in_data[32]

once_sent_flag

once_rx_flag

in_data[32]

dataout1[0:79]

NPU_buffer[0:255]

Lock_number[0:1]

Cache_data[0:199]

data_enable[0:3]

post_encode_with_idle[0:19]

PE_busy[0:3]

NPU

np1

global_reset[0:3]

TXneg_bit

TXpos_bit

NPU_busy

Crypto_data[0:63]

txclk

clk

reset

RF_enable

RFID_data[0:31]

global_reset[0:3]

dataout1[0:79]

NPU_buffer[0:255]

Lock_number[0:1]

PE_busy[0:3]

data_enable[0:3]

Figure B.2: Synthesized NPU, Packet engines and their internal structures in the following figures. RF_enable

RFID_data_ready

dataout1[0:79]

NPU_buffer[0:255]

Lock_number[0:1]

PE_busy[0:3]

data_enable[0:3]

post_encode_with_idle[0:19] post_encode_with_idle[0:19]

global_reset[0:3]

TXneg_bit

TXpos_bit

NPU_busy

TXneg_bit

120

U37

n20

n3 n4

U36

clk Cache_data[0:199]

n21

n19

U35

~le_reg[2]

pe3

~r_reg[1]

TXpos[3] n26

N11 data_enable[2] N12 N13

txclk

Lock_number[1]

PE_busy[2]

n29

post_encode_with_idle[15:19]

U23

U42

~le_reg[1]

N591 net10307 net10389

n24

dataout1[0:47,64:79]

U40

pe2

n23

n15

n25

n18 n28 TXpos[2]

global_reset[1]

n30 net10383

PE_busy[3]

PE_busy[0]

n14

pe4

data_enable[0]

TXneg[3]

PU_buffer[0:63]

pe1

n27

NPU_buffer[0:255]

g[0]

TXpos[0] TXneg[0]

TXpos[1] TXneg[1]

U39

121

data_enable[1]

TXneg[2] post_encode_with_idle[10:14]

post_encode_with_idle[0:4]

U6

n3

net10389

net10307

txclk

Cache_data[0:159]

U5

Crypto_data[0:63] Crypto_data[0:63]

RF_enable

clk

Cache_data[0:159]

data_enable

U4

UDP

clk

net10389

txclk

data_enable

IP

Cache_data[112:159]

Cache_data[64:111]

local_reset

U3

Eth

global_reset

80 nets(dataout[0],...)

net10307

done_flag

n1

outpacket[0:79]

Phy

j[0:31] data_out[0:3]

idle_flag SOP1 SOP2 EOP1 EOP2

~usy_reg

flag flag_nexttime databit databit

flag

tnrzdin tnrzdin tnrzdout tnrzdout

TXpos TXneg TXneg

TXpos

data_out[0:3]

j[0:31]

EOP2

EOP1

SOP2

SOP1

idle_flag

PE_busy

outpacket[0:79]

local_reset

global_reset

dataout[0:79]

tx_count[0:31]

Current_state[0:1]

post_MLT[1:0]

nibble[0:3]

nibble[0:3] post_encode_with_idle[0:4]

PE_busy

Cache_data[160:199]

122

123

Figure B.5: RFID interface.

124

Figure B.6: GSM interface.

Bibliography [1] D. Markovich, “A power/area optimal approach to vlsi signal processing,” Ph.D. dissertation, UC, Berkeley, 2006. [2] Philips-Semiconductors, “The nexperia system silicon implmentation platform,” IEEE Circuits and Devices Magazine, vol. 1, pp. 20–32, Mar. 1999. [3] J. Rabaey, “Low power silicon architectures for wireless applications,” in ASPDAC Conference, Jan. 2000, pp. 169–172. [4] Y. H. Chee, “Ultra low power transmitters for wireless sensor networks,” Ph.D. dissertation, UC, Berkeley, 2006. [5] N. Agarwal, “Network connectivity for embedded systems based on fpga,” Master’s thesis, IIT Kanpur, 2008. [6] A. Chandrakasan and R. Brodersen, “Minimizing power consumption in digital cmos circuits,” in IEEE Journal of Solid-State Circuits, Apr. 1995, p. 473484. [7] StrongLink-Inc, SLM0151M RFID reader Manual by Stronglink Foundry. [8] G. A. M. manual by Olimex Inc, http://www.olimex.com/dev/pdf/AVR/AVRGSM/SIM300DATC.pdf. [9] M. D. Ciletti, Advanced Digital Design with the Verilog HDL. Prentice Hall, 2003. [10] U. National Institute of Sci-Tech, DES Standard Manual. [11] M. Venkatachalam Intel, Integrated Data and Control Plane Processing Using Intel IXP23XX Network Processors. [12] IEEE-Standards, IEEE 802.3 Ethernet specifications.

125

126 [13] M. A. B. Nam Ling, Specification and Verification of Systolic Arrays. World Scientific Press, 1999. [14] M. A. F. Patrick Crowley, Network Processor Design: Issues and Practices. Elseiver Press, 2003. [15] N. S. by Intel, 82555 10/100 Mbps LAN Physical Layer Interface datasheet. [16] J. S. David Culler, Parallel Computer Architecture. Elseiver Press, 2000. [17] M. baker, “Evaluation of packet scheduling algorithm in mobile ad hoc networks,” in Mobile Computing and Communications, 2002. [18] M. Schlansker, “A distributed control path architecture for vliw processors,” in IEEE PACT, 2005. [19] A. Chandrakasan, “Low power digital cmos design,” Ph.D. dissertation, UC, Berkeley, 1994. [20] Y. Cao and H. Yasuura, “A system-level energy minimization approach using datapath width optimization,” in IEEE ISLPED, Aug. 2001, pp. 231–235. [21] U. MM and C. Y., “An accelerated datapath width optimization scheme for area reduction of embedded systems,” in IEEE Intl. Symposium on System Synthesis, Apr. 2002, pp. 32–37. [22] A. S. Papa and madhu Muthyam, “Power management of variation aware chip multi-processors,” in GLSVLSI, May 2008, pp. 135–139. [23] U. MM and C. Y., “Fpga-based co-processor for singular value array reconciliation tomography,” in IEEE Intl. Symposium on FCCM, Apr. 2008, pp. 163–172. [24] X. Inc, Using CRC Hard Blocks in Virtex-5 FPGAs. [25] T. Sherwood and B. Calder, “A pipelined memory architecture for high throughput network processors,” in IEEE ISCA, June 2003. [26] Y. Luo and L. Bhuyan, “Low power network processor design using clock gating,” in DAC, June 2005, pp. 113–117. [27] Hamada, “Using surplus timing for power reduction,” in IEEE CICC, 2001. [28] S. Augsburger, “Using dual-supply, dual threshold and transistor sizing to reduce power in digital circuits,” Master’s thesis, UC, Berkeley, 2002.

127 [29] N. Kato, “Random modulation: Multi threshold voltage design,” in IEICE, 2000. [30] Tsai, “Total power optimization through simultaneously multiple-vdd multiple-v assignment and device sizing v-th with stack forcing,” in IEEE ISLPED, 2004. [31] Lee, “Algorithm for achieving minimum energy consumption in cmos circuits using multiple supply and threshold voltages at the module level,” in IEEE ICCAD, 2003. [32] M. H. et al, “Low power digital design,” in IEEE ISCA, June 2003. [33] e. a. R. Gonzalez, “Supply and threshold voltage scaling for low power cmos,” in IEEE J. Solid-State Circuits, 1997. [34] P. I. Pnzes and A. J. Martin, “Energy-delay efficiency of vlsi computations,” in Proc. GLSVLSI, 2002. [35] e. a. H. P. Hofstee, “Power-constrained microprocessor design,” in IEEE Proc. ICCD, 2002. [36] D. Markovic, “Methods for true energy-performance optimization,” in IEEE J. Solid-State Circuits, 2004. [37] K. Banerjee, “A self-consistent junction temperature estimation methodology for nanometer scale ics with implications for performance and thermal management,” in IEDM Tech. Dig, 2003. [38] ——, “A thermally-aware methodology for design-specific optimization o supply and threshold voltages in nanometer scale ics,” in IEEE ICCD, 2005. [39] P. B.J.Sheu, D.L.Scharfetter and M.C.Jeng, “Bsim short-channel igfet model for mos transistors,” in IEEE J. Solid-State Circuits, 1987. [40] V. K, “Simultaneous selection and assignment for leakage optimization,” in IEEE transactions on VLSI, 2005.