cÑ 2010 Brent Alan Mochizuki

Viewer
Transcript

c 2010 Brent Alan Mochizuki !

REDEFINING THE HARDWARE-SOFTWARE BOUNDARY IN NETWORKED SYSTEMS

BY BRENT ALAN MOCHIZUKI

THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2010

Urbana, Illinois Adviser: Assistant Professor Matthew C. Caesar

ABSTRACT

While the traditional division between hardware and software development provides a useful layer of abstraction that allows developers to create complex software applications with limited knowledge of the underlying hardware, it has led to an inflexible boundary between the division of labor in hardware and software. This abstraction has been necessary to allow developers to create innovative and complex software applications efficiently, while the underlying hardware evolved separately. Recently, however, hardware design has become much simpler with the advent of high-performance programmable and reconfigurable logic. With these devices, developers now have the choice to develop custom hardware efficiently and cost-effectively. This thesis argues that developers now need to look at the boundary between hardware and software when developing performance-critical applications in networked systems. This thesis shows that, for a variety of applications, using custom hardware can be a better choice than a pure software design running on commodity hardware. To demonstrate this argument, a hardware-amenable Internet routing protocol is developed that outperforms the Border Gateway Protocol (BGP), due to its increased performance in a pure hardware implementation. To demonstrate the generality of this technique, a hardwarebased network simulator is developed that can outperform ns-2 for a specific class of simulations. These implementations show that, with only a moderate amount of design complexity, hardware-aware and hardware-amenable designs can significantly improve performance.

ii

To my family and friends, for their undying support

iii

ACKNOWLEDGMENTS

Thanks to Professor Matthew Caesar for being my adviser and challenging me to think critically and creatively; to Professor Randy H. Katz for inspiring and encouraging me to pursue graduate study; to Firat Kiyak and Eric Keller for their collaboration effort in the HAIR paper; and to Stanley Bak for his advice on the network simulator project.

iv

TABLE OF CONTENTS

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . .

1

CHAPTER 2 HARDWARE-BASED DESIGN CHALLENGES . . . . 2.1 Worsens Protocol Ossification . . . . . . . . . . . . . . . . . . 2.2 Hardware Development is Inefficient . . . . . . . . . . . . . . .

5 5 6

CHAPTER 3 PROTOCOL PROCESSING DESIGN . . . . . . . . . 3.1 BGP Scaling Challenges . . . . . . . . . . . . . . . . . . . . .

8 9

CHAPTER 4 OFFLOADING BGP TO HARDWARE 4.1 Packet Parsing and Session Logic . . . . . . . . 4.2 Trie Management . . . . . . . . . . . . . . . . . 4.3 Routing Table Management . . . . . . . . . . . 4.4 Decision Logic . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

11 11 12 12 12

CHAPTER 5 A HARDWARE-AMENABLE BGP . 5.1 No Total Ordering of Routes . . . . . . . . . 5.2 Complex Lookup . . . . . . . . . . . . . . . 5.3 Required Use of BGP . . . . . . . . . . . . . 5.4 Long and Variable-Length Attribute Strings

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

13 13 14 15 15

CHAPTER 6 HAIR IMPLEMENTATION 6.1 HAIR Processor Design . . . . . . . 6.2 Changes to the Data Plane . . . . . 6.3 Prototype Platform Limitations . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

18 18 22 23

. . . .

. . . .

. . . .

. . . .

. . . .

CHAPTER 7 DEPLOYMENT CONSIDERATIONS OF HAIR . . . 24 7.1 Supporting Standard Routing Policies . . . . . . . . . . . . . . 24 7.2 Inter-Operating with Standard BGP . . . . . . . . . . . . . . 24 CHAPTER 8 EVALUATION OF HAIR 8.1 Methodology . . . . . . . . . . . 8.2 Throughput and Processing Delay 8.3 Sensitivity to Workload Changes 8.4 Properties of Workload . . . . . . 8.5 Evaluation Results Summary . . v

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

26 26 27 28 29 31

CHAPTER 9 A HARDWARE-BASED NETWORK SIMULATOR . . . . . . . . . . . . . . . . . . . . . 9.1 ns-2 Design . . . . . . . . . . . . . . . . . . . 9.2 HW-NS Design . . . . . . . . . . . . . . . . . 9.3 Network Simulator Analysis . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

32 32 33 36

CHAPTER 10 RELATED WORK . . . . . . . . . . . . . . . . . . . 40 CHAPTER 11 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . 42 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

vi

CHAPTER 1 INTRODUCTION

The hardware-software boundary exists as an extremely useful layer of abstraction for developers to divide the tasks of designing electrical circuits and the programmable applications that are run on these circuits. By employing abstraction layers such as the hardware-software boundary, developers are able to create complex and sophisticated applications without having to design or understand the underlying hardware. These layers of abstraction have been instrumental in developing the sophisticated applications and operating systems that run on commodity hardware while additionally allowing for the parallel advance in design for each other abstracted layer. Although abstraction has led to many advances in technology and enables developers to create products with only a limited skill-set and knowledgebase, it can also be detrimental to performance-critical applications. Traditionally, application design has been limited to software, as developing custom hardware was considered too expensive and inefficient a process for design, and only performed as a necessity, when performance hit a wall in software-based implementations. Thus, most developers tend to consider software-based implementations only. Specifically, in networked systems, solutions are typically implemented and processed in software, running in an operating system on a server with a commodity microprocessor. This allows for a simple implementation, and can be acceptable in many cases; however, for performance-critical applications, a generic software solution may not be the best choice. History has shown that when software designs hit a performance wall, it can be overcome by designing custom hardware, because it offers the benefits of allowing the designer to take advantage of the customizability and potential parallelism of the hardware. While custom hardware design is notorious for being tedious, expensive or outright impractical, many recent advances in custom hardware design mitigate these concerns. Advances have been made in computer-aided de1

sign for custom hardware that allows designers to create complex custom circuits without having to design at the transistor-level. Hardware description languages like Verilog, SystemVerilog and VHDL, tools for synthesizing transactional models such as from Bluespec [1], and tools for compiling highlevel languages such as from Synfora [2] ease the ability to prototype and debug hardware. Advances in special-purpose hardware with customized instruction sets for specific applications, such as graphic processing units (GPUs), are gaining increased acceptance and demand. Additionally, reconfigurable and programmable hardware, such as field-programmable gate arrays (FPGAs), allow developers to implement their own custom hardware designs without having to go through the expensive and time-consuming process of fabricating custom silicon, and enable hardware to be updated and patched without physical replacement. Finally, FPGAs are becoming more advanced and affordable than ever, resulting in feasible target platforms for performance-critical applications. These recent advances, however, can only ease implementation as much as is allowed by the design itself, which has a significant impact on the achievable performance and the ease with which it can be implemented in hardware. This thesis argues that networked systems solutions designs should be hardware-aware; they should be designed with an acute awareness of the underlying hardware. When software performance is weak, designs should be hardware-amenable; they should be designed to be implemented in custom hardware with low complexity and high performance. Not only does this simplify a transition to hardware offloading, but with the advent of multi-core processors, the increasing popularity of graphics processor programming [3], and the increasing deployments of resource constrained and embedded devices, developers will need to be increasingly aware of constraints of the underlying hardware, even for software implementations. This thesis examines this strategy by studying protocol processing and network simulation. Specifically it explores Internet routing using the Border Gateway Protocol (BGP) [4]. With its original design, intended to be implemented in software and at a much smaller scale than is currently used, BGP processing is hitting a performance wall. To keep up with scaling challenges while avoiding the complexities of offloading the software-centric protocol to hardware, guidelines on how often routing updates can be propagated are needed, lengthening convergence times. It is shown that, with a few changes, 2

the protocol could have been designed to be processed efficiently in hardware as well as software. With this, a fast convergence time need not be sacrificed in order to scale. This work shows the benefits of designing protocols while considering hardware-based implementations. Additionally, network simulation is studied by analyzing the ns-2 network simulator. As its original design was meant to run in a single software thread, it performs slowly for certain classes of simulation. While the simulator still functions and is useful, productivity can be drastically increased by implementing a more efficient simulator. This work shows that implementing a hardware-based simulator is feasible and is more scalable than ns-2 for certain simulations. This notion of designing custom hardware solutions for networked systems is not new. It is often viewed, however, as a last resort. Often times, custom hardware design is used only if software designs are too slow or consume too much of the processing resources of the systems on which they run. Thus, the term hardware offloading is often used for this transition from a software to hardware design. This thesis argues that using custom hardware design in networked systems should be considered not only as a last resort, but as a means to accelerate performance. Since designing with awareness of the underlying hardware or using custom hardware designs can offer increased performance, and are becoming less complex, they should be considered as a viable platform for performance-critical applications. To show how redefining this hardware-software boundary can improve performance in a networked system, this thesis looks at processing Internet protocols. Specifically, it explores the benefits of using custom hardware in Internet routing. Additionally, it explores the benefits of using custom hardware to build a scalable network simulator. Finally, it relates these works to previous works involving hardware offloading. Roadmap: Chapter 2 lists the benefits of custom hardware designs and addresses common arguments against adopting hardware-only implementations. These benefits are then explored by studying two applications: Internet routing and network simulation. Chapter 3 gives background into the current design for Internet protocol processing, describes the scaling challenges facing BGP, and outlines the argument for implementing protocol processing in hardware. Chapter 4 describes the design architecture for BGP in hardware and Chapter 5 describes this design’s key performance 3

bottlenecks and a proposed variant of BGP, called Hardware-Amenable Internet Routing (HAIR), that is more amenable to hardware offloading. Next, a hardware prototype implementation of a HAIR processor is presented in Chapter 6, along with a discussion of deployment considerations for these HAIR-speaking routers in Chapter 7 and an analysis of performance in Chapter 8. In Chapter 9, the limited scaling abilities of ns-2 and new hardware designs for network simulation are explored. Finally, Chapter 10 summarizes related work and Chapter 11 concludes this thesis.

4

CHAPTER 2 HARDWARE-BASED DESIGN CHALLENGES

Custom hardware-based designs offer great performance benefits over softwarebased implementations. These designs can be made to perform optimally for the given application and utilize great parallelism, as opposed to software implementations that run on general-purpose processors, which are sequential by nature. Given these performance benefits, considering hardware-based implementation when designing networked systems solutions is clearly advantageous. However, there is still resistance in adopting hardware as an implementation platform. The traditional arguments against hardware offloading (some of which are outlined in [5]) are that it harms the ability to update protocols or deploy new ones, and it complicates implementation work. These issues are discussed below.

2.1 Worsens Protocol Ossification The difficulty of changing deployed protocols is one of the roots of many of the Internet’s problems. This makes it hard to deploy new designs with advanced features and functionality. It also makes it difficult to fix problems and bugs in existing protocols. Implementing protocols in hardware makes this problem worse. While low-level network protocols such as Ethernet and 802.11 are relatively fixed and undergo few changes, and hence are typically implemented in hardware, higher-layer protocols undergo more innovation and suffer from a wider array of bugs and vulnerabilities. Changing alreadydeployed hardware is an expensive proposition. While application-specific integrated circuits (ASICs) can operate at high speeds, taping out a design and burning it into silicon costs millions of dollars. Furthermore, once the design is deployed, updating hardware traditionally required physical changes 5

to equipment, further increasing cost. However, modern semiconductor devices called field-programmable gate arrays (FPGAs) can be modified by a customer (potentially multiple times) after manufacture, and even after deployment. An FPGA consists of an array of configurable logic blocks connected together via a programmable interconnect, allowing a user to program the device with customized hardware designs. Thus FPGAs are flexible enough to support any logical functions that can be implemented in traditional silicon and can be remotely updated and patched in the field, just like software-based systems. Although FPGAs have typically been a factor of 10 times slower than ASIC-based designs, they are being increasingly used, even for high-volume applications traditionally dominated by ASIC designs. This is happening due to the very high upfront costs of ASIC development and lowering R&D resources (and hence capacity for high-grade quality-assurance to weed out errors prior to deployment— finding a bug in an ASICs require refabricating the entire design at the cost of millions of dollars and remanufacture of all produced components). Moreover, even though FPGAs are slower than ASICs, they can be orders of magnitude faster than general-purpose processors due to the ability to leverage parallelism.

2.2 Hardware Development is Inefficient General-purpose CPUs have steadily increased in clock speed over many years, with conventional wisdom that new CPUs double their processing rate every two years. However, modern CPUs no longer undergo regular increases in processing speed and instead are adding more cores. In order to take advantage of the processor, applications must be designed to be able to take advantage of the many cores. Because of this, software is getting harder and harder to design to fully utilize the processor capabilities. At the same time, hardware is getting easier to design for. Because FPGAs are parallel to begin with, the same issues of hitting clock frequency walls in processors is not affecting FPGAs. In fact, not only is designing for hardware not getting harder (since the methodology is the same), it is getting easier. Programming hardware has traditionally required knowledge of operation at the gate-level, making building large designs a time-consuming and error-prone 6

process. However, the advent of modern hardware description languages, such as Verilog or VHDL, which are processed by a series of tools into a form that can be loaded onto an FPGA, and more recently the capability to “compile” a high level language, such as C, to an FPGA implementation, alleviates the need to implement designs via low-level mechanisms. Moreover, implementing a new design no longer requires starting from scratch; just as software programming allows use of libraries of code, hardware description languages make reuse of code simple to do with open interfaces. Open-source implementations of hardware design libraries are increasingly made publicly available for commonly implemented logic [6, 7]. Finally, the cost of programmable hardware, and hardware simulators, has dropped low enough for system builders, from students to commercial programmers, to prototype and experiment with their designs in realistic environments. Thus, as processors move from increasing clock speed to increasing parallelism with the advent of multi-core technologies, software implementations are becoming more complicated and the complexity difference between hardware and software implementation is shrinking.

7

CHAPTER 3 PROTOCOL PROCESSING DESIGN

Many Internet protocols are designed to be implemented in the form of software daemons running on commodity microprocessors. When these software designs fail to perform adequately, however, they can be offloaded to hardware to accelerate performance. For example, protocol messages may be pipelined within the same circuit, and functional blocks may be replicated to create multiple pipelines. When it comes to this offloading process, however, if the protocol being processed was not designed well for hardware processing, then inefficient, complex and costly designs must often be used to maintain backward-compatibility with the software-based protocol. Often times the added benefits in terms of performance, which are not as substantial as they could be, are outweighed by these inconveniences, as well as the increased effort in implementing and maintaining a hardware protocol processor. Thus, when designing future Internet protocols, we must be mindful that future scaling challenges may push the processing of such protocols to hardware, and design the protocols to be amenable to hardware-only implementations. As a case study, Internet routing using the Border Gateway Protocol (BGP) is explored. With its original design, intended to be implemented in software and at a much smaller scale than is currently used, BGP processing is hitting a performance wall. To keep up with scaling challenges while avoiding the complexities of offloading the software-centric protocol to hardware, guidelines on how often routing updates can be propagated are needed, lengthening convergence times. It is shown that, with a few changes, the protocol could have been designed to be processed efficiently in hardware as well as software. With this, a fast convergence time need not be sacrificed in order to scale. This work shows the benefits of designing protocols while considering hardware-based implementations. In particular, two contributions are made relating to BGP: 1. An empirical analysis of the challenges associated with offloading BGP 8

into hardware is performed by first proposing an architecture and logical design for implementing a hardware-based version of BGP (i.e., a circuit running directly on a semiconductor device). Then, based on this design, the features of BGP that increase the complexity of the hardware-based implementation are enumerated. 2. Protocol changes are proposed to make BGP more amenable to hardwarebased operation by designing a hardware-amenable variant of BGP. This Hardware-Amenable Internet Routing (HAIR) simplifies offloading into hardware, while retaining BGP’s semantics. In the following analyses of BGP, it is found that modern technologies and a few small protocol changes make a hardware-targeted protocol design a better option. In particular, due to the customizability and increased parallelism achievable in hardware, the processing rate and latency are improved by multiple orders of magnitude, and the hardware-amenable protocol improves performance by an additional order of magnitude.

3.1 BGP Scaling Challenges The Internet is a very large and complicated distributed system. Selecting a route involves a computation across millions of routers spread over vast distances, multiple routing protocols, and highly customizable routing policies. To perform this computation based on up-to-date information, the state of network paths is propagated across Internet Service Providers (ISPs) through the use of BGP. While BGP has performed this job well for many years, offering highly configurable operation and control over propagation and selection of routes, it is facing tremendous scaling challenges in modern networks. BGP-speaking routers must process millions of updates daily over hundreds of thousands of prefixes. Furthermore, update arrivals are quite bursty in nature, providing a highly variable workload for routers, and to avoid triggering outages and routing loops, routers must be provisioned for handling the peak load. Worse still, the number of Internet routes and their churn is steadily increasing, with predictions that current router architectures may be unable to keep up [8].

9

To cope with these loads, some BGP implementations leverage timers to rate-limit update traffic, and only periodically exchange deltas (differences) of routing state. However, timers slow routing convergence by extending the time it takes for routers to learn the current state of a route. Worsening convergence leads to black holes, routing loops, and other anomalies, increasing the potential for packet loss. Alternatively, network operators incorporate flap damping into route selection, by forcing less-stable routes to be artificially “held-down” and withdrawn from use. While flap damping reduces routing instability, it also harms availability, by removing working paths from use. In fact, conventional wisdom is to disable flap damping, as it can be inadvertently set off during path exploration and can sometimes leave a router with no route for some periods of time, leading to black holes [9]. Unfortunately, the availability and convergence issues introduced by damping and timer-based approaches are becoming even more serious, with the increasing levels of Internet traffic and deployment of applications with realtime requirements, such as gaming, virtual worlds and Voice over IP (VoIP). In fact, it has been observed that BGP convergence issues negatively affect VoIP usability as much as network congestion does [10]. Rather than looking at solutions to these scaling challenges that limit the capabilities or performance of the routing protocol itself, this thesis looks to improve how it is implemented, with an eye towards uncovering ways to design protocols that simplify offloading to hardware. Results indicate that future network architectures will benefit from design with the underlying hardware in mind.

10

CHAPTER 4 OFFLOADING BGP TO HARDWARE

 

 



 



 

 

 







 

Figure 4.1: FPGA-based architecture for BGP. BGP is a distributed routing protocol that, to date, has been implemented exclusively in software. This chapter gives an overview of the architecture for a hardware-based BGP implementation. The presented design is intended to be used with a TCP offload solution [5], and implemented on the NetFPGA platform [7], using only SRAM instead of TCAM. The design, shown in Figure 4.1, consists of four main modules, which are described in the following sections.

4.1 Packet Parsing and Session Logic This component maintains BGP sessions to the neighboring routers and parses BGP updates into a concise representation that is used internally. In the case of a received update message, the BGP packet parser extracts path attributes and prefixes and forwards them to the Trie Manager and Routing Table Manager.

11

4.2 Trie Management This component maintains a trie structure which contains references to locations in the routing table (RIB). Since low-latency storage devices are not large enough to reserve space to store routes to every possible IP prefix, a translation between the IP prefix and the physical memory address of the routes corresponding to that prefix must exist. In this design, the Trie Manager uses an IP prefix to traverse the trie structure, resulting in a location in the RIB that contains the set of received routes for that prefix.

4.3 Routing Table Management This component maintains the routing table by writing new routes to the set of previously advertised routes (one for each neighbor with a route), and scanning all of the routes to the prefix. Upon receiving a physical memory address from the Trie Manager, the Routing Table Manager writes an advertised route to the RIB (or, in the case of a withdrawal, deletes the route from the RIB) and scans the RIB for other previously advertised routes to the same destination. As each route is written to or read from the RIB, it is sent to the BGP decision logic component.

4.4 Decision Logic This component implements logic to choose the best route from all advertised routes to the same destination prefix. For each advertised or withdrawn route, the Routing Table Manager will send it multiple routes to the same destination in sequence. The decision logic will then compare these routes and choose the best route to the destination. If the newly selected best route differs from the previously selected one, the best route is sent to the forwarding table (FIB) in the data plane (not shown in Figure 4.1), and to neighboring routers.

12

CHAPTER 5 A HARDWARE-AMENABLE BGP

While the hardware-based BGP design presented above offers improved processing speed over a software-based implementation (discussed in Section 8), achieving this required a moderate level of complexity and the design still suffers from multiple bottlenecks. Namely, it requires several clock cycles to parse update messages, traverse the trie for physical memory addresses in the RIB, and look up current routes maintained in the RIB. These limitations are unavoidable because the protocol was created without considering hardware implementations. To demonstrate the value of designing future protocols while considering hardware-based implementations, this chapter presents the HAIR protocol, which conforms to the principles of BGP, yet is designed for hardware. As described in Chapter 6, this protocol decreases the effort and complexity in designing a hardware-based router, and, as described in Chapter 8, provides an order of magnitude performance improvement over the hardware-based BGP implementation. This chapter enumerates four key design properties of BGP that complicate offloading to hardware, and presents HAIR, which addresses these complexities.

5.1 No Total Ordering of Routes When a new route is advertised in BGP, the BGP router needs to determine if the new route is better than its current best route. The Multi-Exit Discriminator (MED) attribute, used to signal to an immediately adjacent neighboring Autonomous System (AS) which ingress link should be used to send traffic to the local AS, prevents each router from having a total ordering over all possible candidate routes [11], and determining the best route requires a complete scan of all existing routes whenever an update is received. To address this, HAIR provides a configurable flag to enforce a total order13

ing across routes so that a newly received route only needs to be compared to the current best route. A network operator may enable this flag to achieve faster processing, yet still enable typical uses of MED.

5.2 Complex Lookup Routers must maintain a data structure to look up the set of advertised routes associated with an IP prefix. This is often done by the use of a trie data structure, which allows lookup of keys of length n in O(n) time. However, implementing a trie in hardware has some disadvantages. First, implementing data structures with pointers in hardware is complex and requires advanced memory management. Second, a single IP prefix lookup takes a substantial number of cycles since a traversal for an IP prefix visits multiple nodes of the trie, requiring each step a separate lookup from the memory in sequential order. While software routers also suffer from lookup delay, it becomes much more apparent in hardware where other parts of the design are running much faster. To address this, instead of propagating IPv4 routes, HAIR operates in a virtual address space where each destination network is enumerated with a fixed identifier. Here, it is assumed that each host on the Internet has an address in the form of (virtual supernet ID, virtual subnet ID). HAIRspeaking routers propagate routes to supernets, and HAIR border routers use an IGP to reach hosts internal to its attached subnets. This addressing scheme has the advantage that the hardware implementation uses the virtual supernet ID to directly index into memory for relevant routing information in constant time, without the need to traverse a trie. Since the address space would not be fragmented like the IP prefix address space and the number of unique virtual supernet IDs would be small, the routing table can be directly addressed by these IDs and still be small enough to fit in a moderately-sized memory. While changing the Internet’s addressing structure would require substantial work to deploy, several next-generation routing techniques propose routing on fixed network identifiers rather than prefixes (including AIP [12] and HLP [13]), and the virtual address space can be directly translated to AIP or HLP’s network identifiers. If changing Internet addressing is not desir14

able, virtual supernet IDs may be translated to IPv4 prefixes. This is done through use of an auxiliary protocol that propagates this mapping, and requiring routers to store this mapping in a local table (in a manner similar to HLP’s AS-to-prefix mapping protocol [13]).

5.3 Required Use of BGP The BGP RFC [14] requires protocol messages to be exchanged using TCP, to provide resilience to loss and packet reorderings. However, TCP provides numerous features that complicate its design and, hence, increase the complexity of hardware implementation. For example, it performs congestion control, requires de-encapsulation and segmentation logic, and must remain backward-compatible with existing implementations. Routers, however, are connected with point-to-point connections that are often given high priority, and, therefore, are not affected by congestion, and are highly reliable. To simplify offloading, TCP is eliminated and a lightweight procedure that directly acknowledges HAIR messages is used instead.

5.4 Long and Variable-Length Attribute Strings BGP update messages have dependencies between fields that can introduce complexity in update processing. Specifically, the path attributes section includes a list of attributes and each attribute is in the form of an triple. The size of the attribute value field is variable and depends on the attribute length field. Variable-length, dependent fields clearly limit performance, since these fields must be parsed in sequential order and the hardware implementation cannot take advantage of its parallel processing capabilities. These variable-length path attributes in BGP messages, however, allow for high levels of expressiveness and flexibility by providing extensible attributes. Some examples are community attributes, in which operators may write arbitrary strings, and control operation based on these strings, and AS paths, which provide a list of ASes along the path to the destination. While these extensible fields allow expressiveness, they present two additional problems:

15

their contents are highly redundant (the same field is often sent multiple times in different update messages to the same peer) and the fields themselves are overly verbose (the information within the field can be expressed in a more compact form). This results in wasted bandwidth, which limits the rate at which update messages can be processed. To address these inefficiencies, BGP is modified to replace variable-length fields with fixed-size labels. Assigning meaning to these fixed-length labels, however, is a challenge, as they represent information that is intuitively variable-length. To deal with these challenges, label assignment is decoupled from routing updates. The resulting protocol consists of three steps. First, a unique fixed-size label is generated for each unique set of values of a set of variable-length fields. Second, the label and the corresponding values of the set of variable-length fields are advertised to the receiver. Finally, the label can be used in the following update messages instead of variable-length fields. The main observation of this scheme is that variable-length fields have to be processed only once, but the corresponding fixed-size label is used multiple times in different update messages.

5.4.1 Avoiding Label Space Fragmentation To avoid fragmentation of the label space, labels only have local meaning between routers, and label swapping [15] is used to translate labels across routers. In more detail: routers receive label advertisement messages (which propagate label to attribute mappings) or update messages (which propagate route changes with attributes represented as labels). Once a label advertisement message is received, the internal router first creates a one-way mapping from the label (i.e. inbound label) to its corresponding variable-length field values in its internal tables. In addition, the router selects one unused label and sends out label advertisements to its neighbors. These advertisements include the selected label (i.e. outbound label) and the corresponding variable-length field values. Finally, the router creates one-way mappings from the inbound label to outbound label. Upon receiving a HAIR update message, the internal router replaces all inbound labels with the corresponding outbound labels for each neighbor and sends out updated HAIR update messages if the best route information is updated.

16

5.4.2 Label Assignments In order to maximize the reusability of the labels, for a given set of attributes two labels are used: the AS Path Label, to represent the AS path attribute, and the Attribute Set Label, to represent the remaining attributes. The main intuition behind this separation is that attributes except the AS path define a local policy between peers and the same policies are used over and over again in multiple update messages, whereas the AS path is only used in loop detection and the best route decision process. Hence, the new protocol defines two new message types for advertising AS Path Labels and Attribute Set Labels. AS Path Label messages are used to advertise AS Path Labels and the corresponding AS Path to a HAIR-speaking router. Similarly, Attribute Set Label messages are used to advertise a label for the remaining set of attributes.

17

CHAPTER 6 HAIR IMPLEMENTATION

This chapter describes the design and implementation of the HAIR architecture, showing how the changes made to BGP to produce HAIR directly affect the simplicity and efficiency of each module in the router design.

6.1 HAIR Processor Design

 

 

 

 

 

 



 



  





Figure 6.1: FPGA-based architecture for hardware-amenable BGP (HAIR). The architecture for a HAIR router was designed for implementation on the NetFPGA. The Network Interface Controller (NIC) and IP router reference designs and Linux drivers provided with the NetFPGA make it useful for prototyping networking-based hardware designs. The HAIR router implementation uses two low-latency SRAMs and the HAIR processor runs at the NetFPGA’s core-clock frequency of 125 MHz. The design of the HAIR processor, shown in Figure 6.1, consists of five main modules, each of which is described below:

18

6.1.1 Packet Parser and Session Logic The HAIR processor still contains a packet parser and session logic, but the packet parsing is much simpler than for the BGP processor. The packet parser is deterministic and has constant throughput for update packets, due to the fixed-length fields in HAIR update packets. Once an update packet is parsed, the virtual address is sent to the Routing Table Manager and the AS Path Label and Attribute Set Label are sent to their respective Label Table Managers.

6.1.2 Label Management Two separate label tables are maintained: one for AS Path Labels and one for Path Attribute Set Labels. Before an update message containing references to an AS Path Label or an Attribute Set Label can be received at a HAIR router, label advertisements must be received first. When a label advertisement is received, the inbound label and interface and its corresponding AS path or remaining attribute set values are sent to the label management logic. Here, the variable-length fields are parsed and stored and outbound labels are assigned. For AS paths, the label management logic checks for loops in the AS path and records this in the AS Path Label Table Entry, which has format . Whenever a label corresponding to an AS path with a loop is seen, the update packet is immediately discarded. For Attribute Sets, the label management logic separates the local pref attribute from the others and receives the ranks of the router’s preference of the local pref value and remaining attribute set values against all other received Attribute Set Labels from the Label Processor. This is recorded in the Attribute Set Label Table Entry, which has the format . Additionally, if any rank for other labels change, these entries must also be updated in the Attribute Set Label Table. Once labels have been advertised, update messages containing those labels can be sent. When an update message is received, the AS Path Label and Attribute Set Label are extracted by the parser. Next, these inbound labels are converted to outbound labels via a label-swapping step, in which a lookup 19

is performed by the label table managers. Assuming there is no AS path loop, the Label Management logic then sends these outbound labels, AS path length, local pref rank and remaining attribute set rank to the Routing Table Manager.

6.1.3 Label Processing The computations performed with label advertisements include checking AS paths for loops, computing AS path lengths, and ranking local pref and the remaining attribute set values. Since label advertisements occur rarely, the processing of advertisements and related computation may occur in a lowerperformance environment. In this implementation, this computation and the storage of the variable-length fields is performed in the NetFPGA’s host machine. When label processing is necessary, the hardware design sets a flag in a register that is continuously polled by the host machine. Upon seeing this flag, a label processor C++ program performs the necessary computations, and sends the results back to the NetFPGA to update the Label Tables. This provides a simple and flexible programming interface regarding routing policy; however, since these calculations are performed rarely, there is minimal performance degradation due to the slower processing rate from using a software program running on a commodity host. It is noteworthy that these routing policy decisions are not made for every advertisement as in BGP, but only for each new unique attribute set value. This greatly reduces the amount of computation needed to compute the best route for a given destination. This function could be performed in hardware as well to ensure maximal processing speed.

6.1.4 Routing Table Manager The Routing Table Manager updates the newly advertised or withdrawn route in the routing table. Since the virtual address space used in HAIR is much smaller than the space of all possible IP Prefixes, the address can be used to directly index into the routing table without the need for a conversion between address and physical memory address like the trie structure in our BGP implementation. This frees up more memory for label tables and greatly

20

reduces design complexity and processing delay. This implementation’s routing table stores advertised routes from all interfaces and a summary of all advertised routes to a destination. The Routing Table Entry has the format

for a router with N neighbors. The summary includes the AS path length, local pref ranking, and remaining attribute set ranking from the current best route to the destination. It also contains the interface ID of the current best route, and a list of interface IDs that have routes to the destination. The Routing Table Summary Entry has format . When a route is advertised to the HAIR router, the Summary Entry is read first, and, if the advertised route is better than the current best route, only the Summary Entry needs to be changed, and a scan of all routes to the destination does not need to be performed. Likewise, if the advertised route is worse than the current best route and it comes on a different interface, then, again, no scan of all routes to the destination is necessary, as the current best route does not change. Finally, if a route is withdrawn and it was not the current best route, no scan of all the routes to the destination is necessary. Thus, such a scan of all routing table entries to a given destination is only necessary if an advertisement comes on the same interface as a current best route and is worse than the current best route, or the current best route is withdrawn. This saves processing time and memory accesses, particularly for routers with many HAIR neighbors and multiple routes to choose from to the same destination. Note that this type of implementation can only be performed with a protocol that supports a total ordering of all routes. If a scan of all routes to a given destination is necessary, the Routing Table Manager reads the route entry corresponding to each interface id indicated in the Valid Interfaces field of the Routing Table Summary Entry. Due to the small width of the SRAMs on the NetFPGA, the Routing Table Route Entries only contain details about the AS path, and have the format . Thus, a lookup on the Attribute Set Label Table must be performed to discover the corresponding Outbound Attributed Set Label, Local Pref Rank and Remaining Attribute Set Rank. Once this information is returned from 21

the Label Table Manager, the information is sent to the Route-Comparison Logic to determine the best route.

6.1.5 Route-Comparison Logic As the Summary Entry and routes are read out of the routing table, they are compared to each other, and, if applicable, the advertised route in the Route-Comparison Logic. Contrasting the many-step logic necessary in BGP, there are only three fields that must be compared to compare two routes. A route with a lower local pref ranking is always chosen as the better route. If two routes have the same local pref ranking, the route with a shorter AS path is chosen. If two routes have both the same local pref ranking and AS path length, the route with the lower remaining attribute set ranking is chosen as the better route. After performing these comparisons, the RouteComparison Logic indicates the best route to the Routing Table Manager, which may have to update the Routing Table Summary Entry for that destination. Additionally, if a new best route is chosen, an update packet is sent out to all other HAIR neighbors.

6.2 Changes to the Data Plane When considering the design of a HAIR router, it is important to consider what, if any, changes must be made to the data plane. There are two proposed interactions between the HAIR control plane and the data plane of a router. One such possible interaction is to continue using IP prefixes in the data plane. That would require a conversion between HAIR virtual addresses in the control plane and the IP prefixes in the data plane when updating the forwarding table. This could be done by a simple lookup table, indexed by the virtual addresses with the IP prefixes as entries. Another option would be to change the data plane to use HAIR virtual addresses instead of IP prefixes. This would require HAIR routers to encapsulate data packets with an additional header containing the HAIR virtual address. Although the longer packet size will decrease maximum throughput, the use of virtual addresses would eliminate the need for longest-prefix matching, and the requirement of an expensive TCAM memory or a computationally expensive trie-lookup 22

structure.

6.3 Prototype Platform Limitations Implementing the HAIR router design on the NetFPGA led to some challenges and forced an evaluation of the feasibility of using a NetFPGA in a real deployment. The low-latency SRAMs on the NetFPGA (two 2.25 MB 36-bit wide chips) are not large enough to store the entire AS Path Label Table and routing table. Thus, the virtual address space was constrained to 15-bits. In practice, larger memories that would be able to store routes for all possible destinations in SRAM would be preferable. In terms of chip area, the HAIR processor design (not including support for Ethernet MAC and transport layer support) used 9788 of 23616 FPGA slices. Note that these figures are for the NetFPGA, which has a Xilinx Virtex-II Pro-50 FPGA. This design, however, could be targeted to other, smaller, devices.

23

CHAPTER 7 DEPLOYMENT CONSIDERATIONS OF HAIR

The previous chapter presented a design for a single HAIR router. There are further considerations, however, regarding the deployment of a network of HAIR routers that are presented in this chapter. Specifically, this chapter discusses advanced label assignment and inter-operation between BGP routers and HAIR routers.

7.1 Supporting Standard Routing Policies Most BGP policies can be described as performing an ordered ranking over the set of advertised routes. To increase processing speed of the design further, this ranking information can be embedded within assigned labels. For example, the label itself may correspond to its placement within the ranking of routes, and route selection then simply becomes a matter of performing a numeric comparison to determine the lowest-labeled route. While enumerating routes in this fashion presents challenges, this process is simplified by having a hardware-based RCP [16] compute an enumeration over the set of historically visible routes and apply a ranking. HAIR label advertisements will then be sent directly to the RCP, and the RCP will install label maps into the routers. Then, only newly visible routes that were not assigned rankings need to undergo the full decision process. From parsing BGP updates, it was found that these historical rankings are quite stable, with less than 1 percent of routes in a week not appearing within the previous advertisements.

7.2 Inter-Operating with Standard BGP Unlike the design given in Chapter 4, a HAIR router cannot directly peer with existing BGP routers. This complicates incremental deployment, es24

pecially since HAIR design may only ever be deployed on certain routers or regions of the network where cost concerns or processing requirements are the highest. However, translation between routing protocols has been a widely studied problem in the context of traditional protocols, through techniques known as redistribution. Here, routes from one protocol are re-advertised into another protocol. HAIR routes can be simply redistributed into standard BGP and vice versa since they use the same addressing structures and protocol formats. The main challenge is in converting protocol messages, which requires translation from labels to BGP update contents, and vice versa. Finally, this design is amenable to other deployment strategies, like tunneling (e.g., forwarding updates through GRE tunnels over domains that do not support HAIR), and dual-stack (e.g., routers maintain processing engines for both HAIR and traditional BGP, and demultiplex a message to the appropriate engine based on the version number in the update header).

25

CHAPTER 8 EVALUATION OF HAIR

To evaluate processing performance, BGP updates are replayed against the hardware designs, measuring pass-through time [17], the amount of time required to process a routing update, and throughput, the number of routing updates processed per given unit of time (note that processing time is not the inverse of throughput for the designs that are pipelined, where multiple updates are processed at different stages at the same time). In addition to measuring pass-through time for the entire design, microbenchmarks are performed, where the design is instrumented with counters to determine (for each update) the amount of time it spent in each module of the design. Additionally, the sensitivity of the design to changes in the nature of the workload and the size of the workload over time is discussed.

8.1 Methodology To collect these results for the hardware-based BGP processor, the design is run in the ModelSim FPGA simulation environment. For the HAIR processor implementation, the HAIR processor design is loaded on the NetFPGA and communicated with over the PCI interface. To evaluate performance under realistic workloads, Route Views traces [18] are replayed against the designs. This is done by randomly selecting four vantage points to act as neighbors to the router. Traces collected during October 2008 are replayed, removing all time between updates such that all the updates arrived at the router simultaneously. To eliminate cold-start effects, routing tables are preloaded before replaying updates. In addition to evaluating the hardware-based BGP and HAIR FPGA-based designs, for comparison purposes, black-box results for the Quagga open-source software router and a C++ implementation of a HAIR processor are also collected. 26

8.2 Throughput and Processing Delay The two largest performance indicators are the throughput (update processing rate) and per-update processing delay (pass-through time) of the design. It is important for protocol designs to have high throughput and low processing delay, as this allows them to handle sudden bursts of updates, to accelerate the convergence process, and to reduce the cost of hardware (allowing cheaper and lower clock frequency components). Throughput is measured as the number of updates that are processed per second. Four designs are compared: FPGA-HAIR (the FPGA implementation of a HAIR processor), FPGA-BGP (the design of the standard BGP protocol, running on an FPGA), SW-HAIR (the C++ implementation of a HAIR processor), and SW-BGP (the Quagga [19] open-source software router, running on a single core on a 3 GHz Intel Core2 Duo processor). Comparing designs directly against the Quagga results in an unfair comparison because Quagga contains timers which reduce update throughput at the expense of slowed convergence. To address this, Quagga is optimized to immediately pass-through updates, by disabling timers (SW-BGP-opt), thereby improving its throughput. Figure 8.1 shows throughput (update processing rate) and processing delay. SW-HAIR FPGA-HAIR

FPGA-HAIR SW-HAIR FPGA-BGP

1

1

0.8

0.8 Fraction

Fraction

SW-BGP SW-BGP-opt FPGA-BGP

0.6 0.4 0.2 0

SW-BGP-opt SW-BGP

0.6 0.4 0.2

1

0

10 1e+03 1e+05 1e+07 Throughput [updates/second]

10

1e+03 1e+05 1e+07 1e+09 Processing delay [nanoseconds]

Figure 8.1: Performance results: update processing performance. Overall, offloading BGP to hardware provides more than an order of magnitude improvement in throughput over the Quagga software router, and the HAIR implementation improves upon this by another order of magnitude. Similarly, offloading BGP improves per-packet processing delay, and a HAIR implementation reduces delay further. In addition, the FPGA-BGP design 27

reduces delay variability by over an order of magnitude, and HAIR nearly eliminates variability by attaining a tight upper bound on processing delay across all updates. Additionally, although HAIR is meant to be amenable to hardware-implementations, it can be processed with higher throughput and lower delay in software than BGP can. To localize the bottlenecks, the design is instrumented with counters (Figure 8.2) to measure the amount of time updates spent in each component. For FPGA-BGP, the memory management component that manages the trie data structure is the greatest source of delay, with the parser, which deals with variable-length fields, close behind. HAIR attains its performance gains by mitigating bottlenecks in the parser and decision logic and completely eliminating the trie lookup.

Fraction

HAIR-parser HAIR-label mapping & decision logic FPGA-BGP-dec. logic FPGA-BGP-parser FPGA-BGP-trie lookup 1 0.8 0.6 0.4 0.2 0 0

200 400 600 800 1000 1200 1400 1600 Processing time [nanoseconds]

Figure 8.2: Performance results: microbenchmarks.

8.3 Sensitivity to Workload Changes To evaluate whether these results hold across a variety of workloads, different update traces are replayed against the designs. First, the year in which the trace was collected is varied, by replaying a trace of the same length from April 2001, 2004, and 2008. There is a slight increase in processing delay in traces from later years in FPGA-BGP due to an increased trie size (there is a negligible increase in SW-BGP, as this effect is masked by the magnitude of processing delay). HAIR undergoes no increase in processing delay, as the trie is replaced by a constant-time lookup. Next, the number of neighbors (peers) attached to the router is varied 28

n=2,HAIR n=4,HAIR n=6,HAIR n=2,FPGA-BGP n=4,FPGA-BGP n=6,FPGA-BGP

1

1

0.8

0.8

0.6 0.4

BGP

Fraction

Fraction

n=2,HAIR n=4,HAIR n=6,HAIR n=2,FPGA-BGP n=4,FPGA-BGP n=6,FPGA-BGP

HAIR

0.2 0 1e+04

0.6

HAIR

0.4

BGP

0.2 1e+05 1e+06 1e+07 Throughput [updates/second]

1e+08

0

10

100 1e+03 1e+04 End to end delay [nanoseconds]

Figure 8.3: Sensitivity to workload: effect of varying number of neighbors on processing performance.

(Figure 8.3). All designs undergo some decrease in performance with more neighbors, due to the larger number of routes being processed. However, in the hardware-based versions this effect is small, incurring for example only 0-2 additional cycles per neighbor in HAIR per withdrawal or advertisement.

8.4 Properties of Workload Understanding the fundamental level of parallelism achievable in a protocol is important, as hardware-based technologies such as multicore enable the ability to perform multiple computations at the same time. To evaluate this, update traces are analyzed and the number of updates that could be processed simultaneously is computed. Two updates cannot be simultaneously processed if they read/write the same prefix (Figure 8.4(a)). Interestingly, when “spikes” of updates are received at the router, parallelizability increases by a large amount. This is shown by the long tail in Figure 8.4(a), which extends far beyond the right side of the plot shown. In this case, the 50th percentile is less than 10, but the average over the entire trace is 217. This is important as processing speed is most crucial to avoid worsening convergence during times of elevated load, which happen because link failures cause large numbers of prefixes to be simultaneously withdrawn or advertised, but do not typically trigger multiple updates to the same prefix. While the design presented in this thesis performs parallel 29

processing across updates only to the extent of its pipeline, the high degree of parallelism present in BGP data can be leveraged by replicating the design within a single FPGA, and balancing updates across the replicas. This type 2003

2005

2007

label_table routing_table

1 1 0.8

0.6

Fraction

Fraction

0.8

0.4 0.2 0

0.6 0.4 0.2

1

0

10 100 Parallelism [simultaneous updates]

0

10 20 30 SRAM utilization [%]

40

Figure 8.4: Properties of workload: (a) level of parallelism in update traces and (b) SRAM utilization. of replication can also be useful to access limited shared resources that are not being fully utilized. In the FPGA implementation of a HAIR processor, the SRAM memory usage is around 20-30% for the two SRAM chips (Figure 8.4(b)). Thus, the presented design could benefit from having multiple processing elements sharing the SRAM. To see the effects of this, the effects of replication on throughput and delay are simulated. Figure 8.5 shows that increasing the number of replicas to two or three can effectively increase the throughput without a large increase in delay. n=3 n=4

n=5

n=1 n=2

1

1

0.8

0.8 Fraction

Fraction

n=1 n=2

0.6 0.4 0.2 0 1e+07

n=3 n=4

n=5

0.6 0.4 0.2 0

2e+07 3e+07 Throughput [updates/second]

70

80

90 100 110 120 130 140 Delay [nanoseconds]

Figure 8.5: Effect of multiple replicas on processing performance. Finally, the design presented allows for an RCP-like system [16] to compute label assignments to routers, to further improve performance. To evaluate 30

feasibility of this approach, workloads are studied to capture the number of label changes that occur over time, which corresponds to the number of times the RCP would need to refresh label mapping tables within routers. Over a 10 day trace, out of 3.2 million updates, only 85,895 unique label mappings need to be published.

8.5 Evaluation Results Summary To summarize the results, processing hardware implementations of these routing protocols achieves higher throughput and incurs lower delay than their software counterparts. Furthermore, a hardware-aware protocol, such as HAIR, can be processed with greater performance than a hardware-unaware protocol such as BGP. This holds in direct comparisons for both software and hardware implementations. Additionally, HAIR is far simpler to implement in hardware and scales better with more neighbors than BGP. Finally, routing updates can be processed using parallel processors, as adjacent updates do not typically affect each other.

31

CHAPTER 9 A HARDWARE-BASED NETWORK SIMULATOR

Given these results, it is clear that hardware-aware and hardware-amenable designs can lead to tremendous performance benefits for the processing of Internet protocols, such as BGP. These benefits, however, are not only limited to Internet routing or network protocols in general. In fact, this same technique can be applied to many disciplines related to networked systems, including network simulation. Network simulators are imperative for rapid development of network protocols. By testing the functionality and performance of new or updated protocols in simulation, network researchers are able to gauge the utility of their designs without going through the long and expensive process of deployment. For specific types of simulations, however, the leading network simulator, ns-2, can take hours or days to simulate just a few minutes of network traffic [20]. In these cases, the type of simulation that ns-2 uses, discrete-event simulation, may be suboptimal. In this chapter, the possibility of a hardware-based network simulator is explored. It is shown that, for simulations with high amounts of network traffic, a hardware-based simulator can run in real-time, eclipsing the performance of ns-2. This chapter is structured as follows: Section 9.1 gives some background to the design of ns-2 and sheds some light on its performance shortcomings, Section 9.2 describes a hardware-based network simulator (called HW-NS) and Section 9.3 provides an analysis of ns-2 and the hardware-based network simulator.

9.1 ns-2 Design As the favored network simulator, ns-2 provides a powerful environment for simulating arbitrary networks. It is a discrete-event simulator that runs in a single computational thread, computing each event in the simulated 32

network in the order it would occur chronologically. Simulations are specified using oTcl scripts describing the simulated network’s topology and traffic characteristics. The topology is described by specifying nodes and links in the network, and network traffic is described by identifying traffic sources and sinks, which are attached to the network nodes. Protocol agents are then attached to the traffic sources and are used to send the raw data generated by traffic sources across the simulated network to the sink. The simulation is performed by processing network events in a single thread. Network events can represent the departure, arrival, enqueue or dequeue of a packet. These events are inserted into a priority queue, which keeps track of the event with the smallest (soonest) timestamp. This event is then processed, which typically places the packet in a packet queue or on a network link in the simulated network. If this happens a new event is generated with a later timestamp corresponding to the next time that the packet changes locations in the network. Thus, to ensure that the simulator processes the events in the correct order, the simulation program is executed as a single thread. This sequential design is inefficient for simulations with a large number of events, as CPUs are now capable of processing multiple threads at once and are not becoming faster at processing single threads. There are, however, parallel event-driven simulation designs, but they are complicated due to the need to be able to predict future events or to roll-back from events that were processed too soon, to stay accurate [21]. Additionally, they are limited by the parallelism of the processing platform, and, with a finite number of processing cores, the performance of the simulator still scales with the number of events in the system in a given time period.

9.2 HW-NS Design To leverage this parallelism effectively and to avoid scaling with the number of events in the system in a given time period, a different network simulator design would be more helpful than event-based simulation. More specifically, a simulator that would scale with the simulation time, being able to run in real-time or faster, would be ideal. To design such a simulator, events would need to be processed in parallel and the simulator processing time would need 33

to be proportional to the simulated processing time. Such a design would require an arbitrary amount of parallel processing elements, a feature that is not possible for software implementations. Custom hardware design, however, allows for arbitrary designs. While custom silicon design is too expensive and inefficient for network simulation, FPGAs are a fitting platform for such an application. Combining the customizability and relatively high speeds of FPGAs, fast custom-built network simulators are quite achievable. To demonstrate this, a network-simulation architecture that runs on an FPGA is presented. Specifically for high data-volume simulations, it measures the presence of packets in the simulated network and the loss rates in the network. Since logging the packet data for such a simulation would require writing a large amount of data to disk, the packet contents are ignored and only the properties of packets that control their presence in the network are considered. Specifically, the length and destination of the packet is the only information logged. This allows for a simple implementation, saving chip area and complexity, which allows the simulator to run in real-time for bandwidths up to one byte per clock cycle. Similar to the ns-2 design, this network simulator uses the concept of traffic sources and sinks, protocol agents, nodes and links. Separate hardware modules are created for each, and are described in the following subsections. Additionally, the generation of these modules from an ns-2 style oTcl script and the software interface to the hardware simulator are described.

9.2.1 Sources and Sinks Traffic sources are implemented as simple finite state machine modules that control the generation of packets. For this simple implementation, only constant bit-rate (CBR) sources are used. These sources are configurable with arbitrary packet lengths and arbitrary time intervals between packets. Additionally, packet counters are used to track the amount of data sent from each source and a register interface is used to interactively read these counters during and after simulation. Traffic sinks simply count received packets and discard them.

34

9.2.2 Agents Protocol agents are implemented as finite state machines that send the raw data generated by a traffic source as to the specification of the network protocol. As only a proof-of-concept, using a protocol to send data is not necessary and this implementation uses agents that simply forward the data received from the traffic source toward the destination.

9.2.3 Nodes Network nodes consist of input ports, output ports, traffic sources and protocol agent pairs, traffic sinks, packet queues and a crossbar. The crossbar allows the traffic coming from input ports and protocol agents to be directed toward the appropriate traffic sink or output port. To save processing time and chip area, the crossbar is not implemented as a general switching fabric. Instead, the destination output port or traffic sink are pre-determined based on the source of traffic entering the crossbar. As two incoming packets from different sources can be routed to the same output port, a packet queue is implemented that stores these packets as they are sent one-by-one through the output port. The structure of a node is shown in Figure 9.1







 





Figure 9.1: Design of HW-NS node.

9.2.4 Links Network links, while simple data transmission units, require a fair amount of chip area when designed in hardware. In order to keep up with real-time 35

simulation, these modules must be able to store the same amount of data that an actual network link will be transmitting at once. Thus, for highdelay or high-bandwidth links, a large amount of data needs to be recorded. For this purpose, Block RAMs (BRAMs), which are small single-cycle delay dual-ported memories that can store kilobytes of data, are used.

9.2.5 Hardware Module Generation The creation and connection of these hardware modules is done in software. Generated from the same oTcl script that ns-2 uses, certain network simulations can be implemented using HW-NS. HW-NS contains an oTcl program that parses the network simulation specification to create the appropriate hardware modules. Instead of implementing the entire specified network, the oTcl program saves chip area by only implementing the parts of the network on which data traffic will run. Thus, routing for the traffic is static and is preassigned, before the hardware modules are even generated. Once the partial topology is generated, hardware modules representing nodes, links, traffic sources and sinks and protocol agents are generated, and their connections are specified. This hardware specification is then synthesized and programmed into the NetFPGA.

9.2.6 Software Interface As mentioned before, the hardware simulator does not log the data for every packet simulated, but only maintains a set of counters which track the transmission and loss rates of packets at traffic sources, packet queues, network links and traffic sinks. These registers can be interactively probed during simulation. The software interface for HW-NS periodically polls and resets these registers, and records the packet counts for further analysis.

9.3 Network Simulator Analysis To analyze the performance of the hardware network simulator, experiments with varying bandwidths are tested. The BRITE [22] topology generator is used to randomly generate a network of 50 nodes. Then ten traffic sources 36

(20% of the total number of nodes in the network) are placed randomly on the leaf nodes (those with the smallest degree, as specified by BRITE). From this, an oTcl script is produced and is run in both ns-2 and HW-NS. HW-NS, as implemented on the NetFPGA, runs at 125 MHz, so it can run at real-time for bandwidths up to 1 Gbps, regardless of the amount of traffic being sent through the simulated network. For ns-2, however, the simulation time is dependent on the amount of traffic. This is shown in Figure 9.2, contrasting the constant run-time for HW-NS.

Simulation Time (s)

ns-2 - 1s ns-2 - 10s ns-2 - 100s

HW-NS - 1s HW-NS - 10s HW-NS - 100s

100

10

1 0.1

0.01

0.001 0.0001 Inter-packet Interval (s)

1e-05

1e-06

Figure 9.2: Comparison of HW-NS and ns-2 performance. For varying the inter-packet interval for 500 byte packets, HW-NS scales linearly only to the simulated time and not to the amount of traffic simulated. Since HW-NS is able to achieve real-time simulation speeds, it outperforms ns-2 and its linear scalability to the amount of traffic per unit time. Although the time to generate the hardware on the NetFPGA platform can take anywhere between 30 and 80 minutes, depending on the size of the simulation, this is a constant amount. Additionally, a HW-NS design can easily allow parameters of the traffic sources, links and packet queues to be set by programmable registers. This would allow the interfacing software to test different experiments on the same topology without having to re-generate the hardware modules. While its capabilities are limited, the measurements taken by HW-NS are designed to be identical to those of ns-2. Depending on the type of simulation, 37

different measurement devices can be designed to be used with HW-NS to take only the measurements that are in the interest of the user. Thus, the design might be different for different simulations. Furthermore, there are, in fact, other designs that might perform better than HW-NS for different types of simulation. They are discussed briefly below.

9.3.1 Larger Network Simulation One weakness of HW-NS is that a 50 node network running the experiment above uses up the resources of the NetFPGA. While newer, larger FPGAs exist, they will still run into issues with simulated large networks. If this becomes an issue, an alternate design of HW-NS would include simpler processing elements. Specifically, the packet queues could become much simpler if they are allowed to run slower than real-time, and insert packets into the queue one at a time. Also, simulating the packets that are in the queues takes a lot of space on the FPGA, since they are stored in registers. Again, speed can be sacrificed to use a RAM instead, which would allow for a larger simulated network. Also, instead of using Block RAM to store packets in queues or being transmitted on links, using higher-latency SRAM or SDRAM would save chip space, but slow down the simulation considerably.

9.3.2 Variable Run-Time Speed HW-NS runs at the speed of one byte per clock cycle. For the NetFPGA, this is equivalent to simulating 1 Gbps links. For slower simulated links, HWNS can actually run faster than real-time. For faster links, however, HW-NS will run slower than real-time. To mitigate this effect, HW-NS could simulate real-time speeds for bandwidths greater than 1 Gbps by enforcing packets to be a length that is a multiple of the speed-up, and enforcing that every event in the network occurs at a time that is a multiple of the speed-up as well. For example, 2 Gbps links could be simulated if each one byte signal becomes a two byte signal, and every packet is sent or received at nodes on even clock cycles.

38

9.3.3 Event-Driven Processing Event-driven processing could still be improved upon if it were implemented in hardware. Such a design would include a priority queue of events, a scheduler to assign events to different processing modules, and the processing modules themselves, which would process events and generate new ones. While this design cannot be made fully parallel, it is possible to create replicas of processing modules that allow for multiple events to be processed at once. Also, smart logic that could determine if two events could interfere with each other could be used to allow some events to be processed out of order.

9.3.4 HW-NS Analysis Summary From this analysis, it is clear that hardware-based network simulators can be useful. For certain simulations HW-NS can run in real-time whereas ns-2 has a run-time that scales with the amount of traffic simulated. And while different types of simulations might lead to different designs of simulator, a hardware-aware or hardware-amenable design can take advantage of the hardware’s parallelism. Furthermore, an intelligent simulator could choose the most appropriate design for itself when running to maximize performance for all types of simulation, thus redefining where to draw the hardware and software boundary for each simulation.

39

CHAPTER 10 RELATED WORK

This question, of where the boundary between software and hardware should be, has been a long-standing and widely-investigated question in the field of computer science. The field of hardware-software codesign focuses on generating designs for systems that are composed of both a microprocessor and a hardware-based logic circuit [23]. Co-compilation techniques are used to automatically transform a high-level language into software modules running atop the processor with the rest compiled into logic circuits [24]. While vast advances have been made in this area, additional gains are often attained by leveraging domain-specific information and techniques. The work presented in this thesis is complementary to codesign, as it encourages hardware-aware and hardware-amenable designs, which are meant to perform well on the underlying hardware. Within the realm of networked systems solutions, hardware offloading may reduce computational costs and speed throughput. First, hardware offloading for TCP is argued to be useful to reduce data copy costs in systems where the host bus is the main bottleneck [5]. Several vendors are beginning to provide network equipment to support TCP offloading, including Broadcom, Chelsio, and Neterion. Moreover, some solutions are being designed to be hardware-amenable, such as XTP [25]. Second, hardware technologies are commonly used for monitoring workloads. Hardware-based counters are used for monitoring aggregate statistics of data traffic [26] and characterizing anomalies. Third, a variety of protocols at lower layers of the protocol stack are implemented directly in hardware or firmware, such as MAC and physical-layer protocols. This is done to improve processing speed, to reduce reaction time to outages, and to reduce component cost. Fourth, in the realm of simulation, FPGAs have been used for custom simulations of very specific networks, such as road traffic simulation [27] and microprocessor interconnect network simulation [28]. Datacenter network simulation using FPGAs 40

has been explored; however, the approach taken runs software threads on soft-core processors programmed on the FPGA [29]. While this gives more accurate results than simple simulation like in ns-2, the simulations take significantly longer or use more resources. Additionally, parallel discrete-event processing has been explored [30], but the techniques are still limited to the parallelism of the underlying hardware. Preliminary work has been done in [31] with general network simulation on an FPGA, but the design does not function identically to ns-2 in the experiment presented in this thesis and does not run at real-time speeds. Finally, packet processing such as regular expression matching [32], deep packet inspection [33], pattern matching [34], worm detection [35] and firewalls [36] have been implemented to keep up with line rates. There has also been work on offloading web server traffic [37] and spam email processing [38] to FPGAs. Video processors and digital signal processing chips (DSPs) have been used for years. And while the idea of hardware offloading is nothing new, the attitude toward hardware design is that it should only be used when software fails to perform, and not viewed as an acceleration tool. In addition to hardware offloading, performance may also be improved by other means. Computation time may be reduced by using more efficient algorithms and caching results of previous computations. System-wide bottlenecks may be reduced by increasing bandwidth between devices, incorporating more powerful hardware, or configuring the system to reduce unnecessary processing. For simulation, being selective about what the simulator supports can accelerate simulation time for simulations that do not use all the features of the simulator. For routing specifically, networks may reduce timers and exchange messages at higher rates to improve convergence time and keep state more up to date [39]. Router load can be decreased by giving certain updates higher priority processing [40]. Techniques such as metarouting [41] reduce the likelihood of implementation errors by mapping high-level descriptions to code. These works are synergistic with hardware offloading, and may be used in concert with the techniques proposed in this work. Moreover, new developments, such as the increasing pervasiveness of multicore technologies [42], graphics processing technologies [3], and resource constrained network elements demonstrate the need for greater awareness of hardware issues when designing networked systems solutions. 41

CHAPTER 11 CONCLUSIONS

This thesis challenges the conventional wisdom that networked systems solutions, such as higher-level protocols and network simulators, should be designed for a software-only implementation. Circuits were designed that implement routing protocol processors and a network simulator, demonstrating significant performance improvements. While the BGP processor design led to significant performance improvements, hardware implementation is not considered when designing network protocols like BGP. This limits achievable benefits and complicates implementation. Given the ever-increasing loads on routers, future routing protocols should be developed with hardware in mind. As a first step in this direction, a replacement for BGP was designed and implemented that simplifies design and offers further performance improvements. Additionally, a hardware-implemented network simulator was presented that can perform well when ns-2 performs poorly, and serves as a proof-of-concept that a hardware-based simulator could be extremely beneficial to network researchers. However, this work is only one early step towards developing more hardwareamenable network solutions. It may also be interesting to evaluate a wider array of networking protocols (e.g., storage/filesystem protocols, spam/email and other application services), and to investigate commonalities as a step towards developing a set of shared primitives to simplify hardware offloading. Additionally, this approach could benefit a variety of other networked systems solutions, as was demonstrated with HW-NS and discussed in the related work. Finally, while these hardware-aware and hardware-amenable solutions are not always necessary to obtain correct functionality, the benefits they provide can be substantial enough to justify the added complexity in implementation and design.

42

REFERENCES

[1] Bluespec, Inc. [Online]. Available: http://www.bluespec.com [2] Synfora, Inc. [Online]. Available: http://www.synfora.com [3] S. Tomov, M. McGuigan, R. Bennett, G. Smith, and J. Spiletic, “Benchmarking and implementation of probability-based simulations on programmable graphics cards,” Computers & Graphics, vol. 29, no. 1, pp. 71–80, 2005. [4] F. Kiyak, B. Mochizuki, E. Keller, and M. Caesar, “Better by a HAIR: Hardware-amenable Internet routing,” in Proc. International Conference on Network Protocols, 2009, pp. 83–92. [5] J. C. Mogul, “TCP offload is a dumb idea whose time has come,” in HOTOS’03: Proceedings of the 9th conference on Hot Topics in Operating Systems, 2003, pp. 5–5. [6] OpenCores. [Online]. Available: http://opencores.org [7] J. W. Lockwood, N. McKeown, G. Watson, G. Gibb, P. Hartke, J. Naous, R. Raghuraman, and J. Luo, “NetFPGA–an open platform for gigabit-rate network switching and routing,” in MSE ’07: Proceedings of the 2007 IEEE International Conference on Microelectronic Systems Education, 2007, pp. 160–161. [8] T. Li. Router scalability and Moore’s law. Presented at Internet Architecture Board Meeting. [Online]. Available: http://www.iab.org/ about/workshops/routingandaddressing/Router Scalability.pdf [9] Z. M. Mao, R. Govindan, G. Varghese, and R. H. Katz, “Route flap damping exacerbates internet routing convergence,” SIGCOMM Comput. Commun. Rev., vol. 32, no. 4, pp. 221–233, 2002. [10] N. Kushman, S. Kandula, and D. Katabi, “Can you hear me now?!: It must be BGP,” SIGCOMM Comput. Commun. Rev., vol. 37, no. 2, pp. 75–84, 2007. [11] N. Feamster and J. Rexford, “Network-wide prediction of BGP routes,” IEEE/ACM Trans. Netw., vol. 15, no. 2, pp. 253–266, 2007. 43

[12] D. G. Andersen, H. Balakrishnan, N. Feamster, T. Koponen, D. Moon, and S. Shenker, “Accountable Internet protocol (AIP),” SIGCOMM Comput. Commun. Rev., vol. 38, no. 4, pp. 339–350, 2008. [13] L. Subramanian, M. Caesar, C. T. Ee, M. Handley, M. Mao, S. Shenker, and I. Stoica, “HLP: a next generation inter-domain routing protocol,” in SIGCOMM ’05: Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications, 2005, pp. 13–24. [14] A Border Gateway Protocol 4 (BGP-4), IETF RFC 1771, 1995. [15] Multiprotocol Label Switching Architecture, IETF RFC 3031, 2001. [16] M. Caesar, D. Caldwell, N. Feamster, J. Rexford, A. Shaikh, and J. van der Merwe, “Design and implementation of a routing control platform,” in NSDI’05: Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation, 2005, pp. 15–28. [17] A. Feldmann, H. Kong, O. Maennel, and A. Tudor, “Measuring BGP pass-through times,” in Proc. Passive and Active Measurement, 2004, pp. 267–277. [18] University of Oregon route views. [Online]. Available: //archive.routeviews.org [19] Quagga software routing suite. [Online]. Available: quagga.net

http:

http://www.

[20] D. X. Wei and P. Cao, “NS-2 TCP-Linux: an NS-2 TCP implementation with congestion control algorithms from Linux,” in WNS2 ’06: Proceeding from the 2006 workshop on ns-2: the IP network simulator. New York, NY, USA: ACM, 2006, p. 9. [21] R. M. Fujimoto, “Parallel discrete event simulation,” Commun. ACM, vol. 33, no. 10, pp. 30–53, 1990. [22] A. Medina, A. Lakhina, I. Matta, and J. Byers, “BRITE: An approach to universal topology generation,” in MASCOTS ’01: Proceedings of the Ninth International Symposium in Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2001, p. 346. [23] W. Wolf, “Hardware-software co-design of embedded systems [and prolog],” Proceedings of the IEEE, vol. 82, no. 7, pp. 967 –989, 1994. [24] M. Baleani, F. Gennari, Y. Jiang, Y. Patel, R. K. Brayton, and A. Sangiovanni-Vincentelli, “HW/SW partitioning and code generation

44

of embedded control applications on a reconfigurable architecture platform,” in CODES ’02: Proceedings of the tenth international symposium on Hardware/software codesign, 2002, pp. 151–156. [25] G. Chesson, “XTP/PE overview,” in Local Computer Networks, 1988., Proceedings of the 13th Conference on, Oct. 1988, pp. 292 –296. [26] Q. Zhao, J. Xu, and Z. Liu, “Design of a novel statistics counter architecture with optimal space and time efficiency,” SIGMETRICS Perform. Eval. Rev., vol. 34, no. 1, pp. 323–334, 2006. [27] J. L. Tripp, H. S. Mortveit, A. A. Hansson, and M. Gokhale, “Metropolitan road traffic simulation on FPGAs,” in FCCM ’05: Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2005, pp. 117–126. [28] E. S. Chung, M. K. Papamichael, E. Nurvitadhi, J. C. Hoe, K. Mai, and B. Falsafi, “Protoflex: Towards scalable, full-system multiprocessor simulations using FPGAs,” ACM Trans. Reconfigurable Technol. Syst., vol. 2, no. 2, pp. 1–32, 2009. [29] Z. Tan, K. Asanovi, and D. Patterson, “An FPGA-based simulator for datacenter networks,” presented at the Exascale Evaluation and Research Techniques Workshop (EXERT 2010), at the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2010), Pittsburgh, PA, 2010. [30] R. Bagrodia, R. Meyer, M. Takai, Y. an Chen, X. Zeng, J. Martin, and H. Y. Song, “Parsec: A parallel simulation environment for complex systems,” Computer, vol. 31, no. 10, pp. 77–85, 1998. [31] S. Bak, “Large-scale network simulation scalability and an FPGA-based network simulator,” Tech. Rep., 2009. [Online]. Available: http://www. cs.uiuc.edu/homes/sbak2/fpga netsim/fpga netsim initial report.pdf [32] R. Sidhu and V. K. Prasanna, “Fast regular expression matching using FPGAs,” in FCCM ’01: Proceedings of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2001, pp. 227–238. [33] S. Dharmapurikar, P. Krishnamurthy, T. S. Sproull, and J. W. Lockwood, “Deep packet inspection using parallel bloom filters,” IEEE Micro, vol. 24, no. 1, pp. 52–61, 2004. [34] F. Yu, R. H. Katz, and T. V. Lakshman, “Gigabit rate packet patternmatching using TCAM,” in ICNP ’04: Proceedings of the 12th IEEE International Conference on Network Protocols, 2004, pp. 174–183. 45

[35] H. Song, T. Sproull, M. Attig, and J. Lockwood, “Snort offloader: A reconfigurable hardware nids filter,” in International Conference on Field Programmable Logic and Applications, 2005, pp. 493–498. [36] J. Moscola, J. Lockwood, R. P. Loui, and M. Pachos, “Implementation of a content-scanning module for an Internet firewall,” in FCCM ’03: Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003, p. 31. [37] T. Sproull, G. Brebner, and C. Neely, “Mutable codesign for embedded protocol processing,” in FCCM ’05: Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2005, pp. 299–300. [38] E. Gawish, M. El-Kharashi, M. El-Yazeed, and A. Salama, “Design and FPGA-implementation of a flexible text search-based spam-stopping firewall,” in Radio Science Conference, 2006. NRSC 2006. Proceedings of the Twenty Third National, vol. 0, 2006, pp. 1 –7. [39] Towards millisecond IGP convergence, IETF Draft, 2000. [40] W. Sun, Z. Mao, and K. Shin, “Differentiated BGP update processing for improved routing convergence,” in ICNP ’06: Proceedings of the Proceedings of the 2006 IEEE International Conference on Network Protocols, 2006, pp. 280–289. [41] T. G. Griffin and J. L. Sobrinho, “Metarouting,” in SIGCOMM ’05: Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications, 2005, pp. 1–12. [42] K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Kuetzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, “The landscape of parallel computing research: A view from berkeley,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2006-183, 2006.

46

cÑ 2010 Brent Alan Mochizuki

While the traditional division between hardware and software development provides a useful layer .... Roadmap: Chapter 2 lists the benefits of custom hardware designs and addresses ..... This component maintains the routing table by writing new routes to the ... plementing a trie in hardware has some disadvantages. First ...

Download PDF

603KB Sizes 3 Downloads 119 Views

Report

cÑ 2010 Brent Alan Mochizuki

Recommend Documents

cÑ 2010 Brent Alan Mochizuki