A Trace Driven Study of Packet Level Parallelism

Viewer
Transcript

A Trace Driven Study of Packet Level Parallelism Huan Liu Department of Electrical Engineering Stanford University, CA 94305 [email protected] Abstract—Network processors promise greater flexibility and programmability for routers and switches. They typically process incoming traffic on a packet-by-packet basis. Except the ordering constraint for packets within the same flow, most packets are independent of each other. Thus, several level of parallelism exists: Packet Level Parallelism (PLP), Intra-Packet Parallelism (IPP), and Instruction Level Parallelism (ILP). Most commercial network processor implementations exploit only PLP. In this paper, we quantify how much PLP really exists. Our results shows that blindly adding more processing engines to exploit PLP quickly deteriorate utilization ratio. It suggests that IPP and ILP should also be exploited. Furthermore, we show that adding input reordering buffer is a very effective technique to increase the utilization ratio. Index Terms—Packet Level Parallelism, Network Processor, Processing Engines.

I.

INTRODUCTION

First generation Internet routers use a microprocessor centric architecture, which is very similar to PC or workstation used for general purpose computing. In addition to management functions such as running routing protocol, the microprocessor is also responsible for processing required on a per packet basis, including routing table lookup and queuing management, The explosive growth of Internet leads to rapid increase in transmission capacity. In the last decade, network speed increased by a factor of 240 while CPU clock speed only increased by a factor of 12. A CPU centric architecture no longer can handle the increasing processing requirement. To cope with the rapid growth of network speed, second generation routers use dedicated ASIC hardware to implement function required on a per packet basis, allowing the microprocessor to concentrate only on management function. By moving specialized function into hardware, the processing speed is greatly improved. This worked very well for best effort traffic, since most networking protocol has been fairly stable, fixing these functions in hardware poses little problem. As bandwidth is becoming more and more of a commodity, Internet Service Providers (ISP) are looking for ways to differentiate themselves. Providing Quality of Service for realtime traffic such as video and voice becomes a higher priority. As a result, instead of performing only destination based routing function, a router needs to guarantee latency and jitter, control congestion and manage bandwidth. Complex algorithms [1] have been designed to address these problems, but none has been universally adopted. Converting these functions into hardware would prove to be very inflexible. If protocol is changed, or a new algorithm is designed, the ASIC has to be redesigned from scratch. In addition to its inflexibility, there are a couple of other disadvantages of an ASIC approach. First of all, the development cycle is very long, typically 18 month or more. It

is very hard to react to market changes; even harder to make sure the product is still competitive when finally released. Secondly, it is very costly. Because custom ASIC typically has a very low volume, the high development cost could not be amortized effectively. In response to the shortcomings of the ASIC approach, network processor was proposed. It promises shorter development cycle, lower cost and greater flexibility by allowing the function to be implemented in software rather than custom ASIC. Compared with ASIC, network processors compromised speed for better flexibility. Functions implemented in software are generally slower than those implemented in hardware ASIC. In order to meet the increasing processing requirement, and offer comparable performance as an ASIC approach, architectural innovation becomes the key for the success of network processor. The easiest way to scale to higher speed is to exploit parallelism, i.e., build extra hardware to do things in parallel. In microprocessor design, a variety of techniques have been used to exploit Instruction Level Parallelism (ILP), such as super scalar and VLIW. Trace simulation has shown that there are very limited ILP in integer non-scientific applications, even with reordering mechanism. No wonder expensive hardware based scheme to exploit parallelism such as super scalar are used to extract the most amount of parallelism. Other techniques such as trace processor [2] or multiscalar [3] try to exploit tread level parallelism. The amount of parallelism could be exploited is also very limited. In contrast, network processor faces a complete different set of applications. In addition to ILP, there are a couple of other parallelism that could be exploited. Specifically, they are: 1. Packet Level Parallelism (PLP): Most networking applications process incoming traffic on a per packet basis. Each incoming packet goes through similar processing, yet each of them is independent of one another. Thus, it is possible to employ several processing units in parallel; each operates on a separate packet. The only constraint is that packets belonging to the same flow should depart in the same order as they arrive in the network processor. Although upper layer network protocols such as TCP/IP will function correctly regardless of packet ordering, it is much desirable to keep it in order to avoid resulting lower throughput. 2. Intra-Packet Parallelism (IPP): Within processing required for each packet, typically a number of tasks are independent of each other. This is similar to Thread Level Parallelism in microprocessor design, but it is much more pronounce in networking applications. For example, in a layer 2 switching application, source MAC address learning function is independent of destination MAC address lookup function, thus they could be performed in parallel. Coprocessor has been used commercially to

execute certain specialized functions. This is in effect exploiting IPP. Specialized instruction to execute certain functions more efficiently [4] is also a technique to exploit IPP.

2.

Almost all commercial network processors exploit PLP. They integrate multiple simplified RISC core in a single processor to provide a scalable software programming environment [5]. To allow higher integration, the RISC core is made as simple as possible; typically it has no floating point capability and also has no mechanism such as super scalar to exploit ILP. The RISC core is then replicated many times to create parallel processing complex. For example, Intel’s IXP 1200 supports 6 micro-engines each supporting 4 threads, Vitesse’s I2000 supports 4 micro-engines each supporting 4 threads, IBM’s Rainier supports 16 processing engines each supporting 2 context. The reason such a technique is used is simple: PLP is the easiest to exploit. Blindly adding more Processing Engines (PE) will eventually lead to lower utilization ratio and diminishing return as the ordering constraint starts to limit the number of packets that can be processed at the same time. So the key question is: how much PLP exists? In this paper, we try to answer the question using real life network traces, and also exploit possible direction for future network processor design.

3.

4.

5.

6.

II. DESIGN TRADEOFF A. Number of PEs Given the same silicon area, there are two basic approaches to design an NP. First, one could design the PE to be as simple as possible to reduce their footprint size on the die, and then replicates a large number of them on chip. It is a very attractive design because the design complexity is much lower. Not only the PE is easier to design, the interconnect among all PEs is also fairly simple. Because the communication requirement between PEs is expected to be very low, it could be as simple as a shared bus. As a result of this simplicity, the PE is also expected to take longer to process each packet. This approach is exploiting PLP to its fullest extent. The hope is that there is enough PLP so that PEs could be used effectively. However, this assumption will inevitably become invalid as a large number of PEs are used. The ordering constraint among packets within a flow will prevent packet from being processed even if free PE is available. An alternative design could use fewer number of PEs. But instead of using the simplified RISC core, a more sophisticated PE could be designed using the extra chip real estate to exploit IPP and ILP. The goal is to shorten the processing time for each packet so that it can keep up with line rate even when the number of PEs is small. There are several ways one can achieve a faster PE: 1. Use deeper pipeline. A simplified RISC core typically only use 5 or less pipeline stages. As a result, each pipeline stage takes longer, in return, the throughput is lower compared to a deep pipeline design.

7.

Utilize FPGA reprogrammable hardware to implement packet processing function that otherwise needs to be implement in software in RISC processor as proposed in [6]. FPGA allows functions to be changed quickly, but still provide reasonable speed. Use application pipelining techniques to complete processing for each packet. For example, queuing function can be completed in a separate pipeline stage than classification function. They share minimum state information, thus pipelining them is very efficient. The architecture proposed in [7] would allow a PE to be either used for parallel processing or used as a pipeline stage. Design specialized instruction set [8][4] for networking applications. A specially designed instruction can accomplish what several RISC instructions can do in a single cycle. For example, a shift-and-mask operation will take 2 cycles on a RISC core, but could take only 1 cycle if specialized instruction is available. Employ traditional microprocessor techniques to exploit ILP. For example, scalar or superscalar PE could be designed to execute several instructions in parallel. A reordering mechanism could also be used to issue instructions out of order. Build special purpose hardware to offload certain functions from software implementation in network processor. By converting these functions to hardware implementation, we could speed up commonly used operations. For example, instead of using Trie based software search algorithm [9][10], Ternary Content Addressable Memory (TCAM) [11] could be used to assist routing lookup. The speedup will be several fold as TCAM only require one memory access, whereas Trie based search algorithms typically require at least 4 memory accesses. Use co-processor to execute functions in parallel. For example, CRC calculation is very expensive if performed in software. A specially designed CRC co-processor can offload that function from the PE.

Commercial network processor designs vary in their design trade off in number of PEs used. For example, MMC network processor uses only 2 PEs, but it has integrated TCAM to assist policy lookup or routing lookup. In contrast, IBM network processor uses 16 PEs, but it relies on tree based search algorithm for routing lookup. Both network processors are rated for the same throughput. The reason is because MMC NP can complete routing lookup in a much shorter time, leaving free time to process new packet. In this paper, we use a simple abstraction to model the design tradeoff. If large number of processing engines are used, the processing time for each individual packet will be longer, conversely, if small number of processing engines are used, we assume packet processing time will be proportionally shorter, i.e., the product of number of PEs and the running time is constant. For example, if we assume a design with 4 PEs takes 10 cycles to process each packet, then a corresponding design with 8 PEs will take 20 cycles to process each packet.

B. Input Reordering

C. Output Reordering Output reordering is also a very popular technique to guarantee packet ordering. A packet will be processed as soon as it arrives and a free PE is available. To guarantee correct ordering, an output reordering buffer is used. The output reordering buffer detects packets belonging to the same flow, then based on their sequence number, either hold it in the buffer or send it out if all previous packets have completed processing. Output reordering could potentially increase the PE utilization ratio, in fact, 100% utilization could be achieved by scheduling packet for processing as soon as it arrives. Despite of its potential, there are a couple of drawbacks that limit its usage. First of all, a very large buffer may be required depending on the packet processing time. A packet taking longer to process can hold a large number of subsequent packets in the flow in the output reordering buffer, causing it to overflow. Secondly, output reordering architecture is not suitable for flow based application. Some networking applications not only need to guarantee packets delivered in order, there are also processing dependency such as serialization of access to packet flow data [12], i.e., the next packet in the flow depends on processing result of the previous packet. For example, in a per flow queuing based application, the first packet is needed to create an entry in the flow table. Packets in the same flow could not be processed until the first packet finish processing. Because of these limitations of output reordering, we do not consider it further in this paper. D. Reference architecture In this paper, we consider a design that consists of input reordering buffers, and processing engine complex that exploits both PLP and IPP. The reference architecture is shown in Figure 1. The input buffer is responsible for holding incoming packets that have not been scheduled a PE yet. It has out of order scheduling capability, i.e., a packet arriving later to the network processor can be scheduled before a previous packet

PE

PE

task engine

task engine

PE task engine ...

task engine ...

task engine

...

task engine

...

Instruction reordering is a very popular technique used in microprocessor design to exploit fine grain instruction level parallelism. If an instruction is blocked from execution because of result dependency, subsequent instructions can still be issued. The technique can similarly be used in network processor. If a packet cannot be processed because of flow dependency, subsequence packet could be processed out of order. A buffer is maintained to hold incoming packets. A packet will remain in the buffer either because another packet belonging to the same flow is being processed or no free PE is available. The size of the input buffer determines the maximum number of packets that could be processed out of order.

Input buffer

task engine

task engine

task engine

Figure 1: Reference network processor architecture

if no dependency exists. The PE processes one packet at a time. To exploit IPP, a variety of techniques could be used such as trace processor [2], or multiscalar design [3]. These techniques either dynamically or statically partition a program into multiple treads, and executes them in parallel. Since a networking application is expected to have large amount of IPP, it is possible to simplify these designs by relying on the programmer to extract coarse grain parallelism. Several task engines are replicated inside each PE, each task engine runs one thread. The software designer designates what function runs on which task engine and design software to allow task engines to cooperatively process one single packet. The task engine could further employ dynamic instruction scheduling logic to exploit instruction level parallelism. III. SIMULATION SETUP We used several traces made available by NLANR[13]. We randomly chose several traces from different Internet Exchange Points (IXP). The characteristics of each trace is shown in Table I. Trace auck is from AUCK-II data set which is a link from University of Auckland to US. The other four traces are collected at US IXP points over OC-3 links. TABLE I: TRACE CHARACTERISTICS trace # of packet # of unique flows (based on dest IP)

aix9889 5604 2796

apn9909 18730 725

sdc958 4256728 31642

sdc964 31518464 130164

auck 63444 869

We use the timestamp information to simulate packet arrival process. Assume there are N packets in the trace, and the total trace duration is T seconds, we say a network processor with P processing engines is operating at full load if the average processing time for each packet tf satisfies the following constraint:

tf =

P ×T N −1

Intuitively, full load refers to the maximum average processing time under which the network processor can process incoming packet as they arrive, i.e., at line rate. For

load

utilization ratio

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.8 0.6 0.4 0.2

60

52

44

36

28

20

12

4

0 num of PEs

Figure 2: Utilization ratio v.s. num of PEs for trace Auck, no input reordering buffer 1 utilization ratio

sdc964 aix9889 sdc958 auk apn9909

0.8 0.6 0.4 0.2

60

52

44

36

28

20

12

0 4

example, a 4 PE network processor designed to process 5 million packets per second is operating at full load if it takes 0.8 us to process each packet. Typically, some headroom needs to be built into the application to account for inter packet dependency which will reduce network processor utilization ratio. We model the headroom by load l, which is the packet processing time t as a fraction of tf. In an access router, the line rate is slow, but the task required is quite complex. For example, packet filtering, policing, rate measurement as specified in Diffserv architecture have to be performed at the edge of the network. Thus, the processing time is relatively long. Whereas, in a core router, the processing maybe as simple as looking up MPLS label to determine forwarding decision, thus the processing time is correspondingly small. Using load to model the processing headroom allows our simulation result to be applicable in both cases. Because of inter packet dependency, some packets will arrive when all PEs are busy. They are first buffered in the input reordering buffer. If it becomes full, back pressure is applied to the input to stop packet transfer. Packet transfer resumes as soon as either free input buffer becomes available or a PE becomes available. Back pressure only delay immediate incoming packet, but it does not affect future packet’s arrival time. Effectively, this is assuming there are infinite buffer in the input stage. This is a reasonable assumption because an Internet host will not delay sending future packets because of current congestion in the network.

num of PEs

Figure 3: Utilization ratio v.s. num of PEs for all traces at 70% load, with no input reordering buffer

IV. SIMULATION RESULT B. Effect of input reordering buffer A. Processor utilization ratio We use T0 to denote the total execution time required if there are no packet ordering constraint, i.e., packet are scheduled for processing as soon as a PE is available regardless of whether its scheduling can cause packets to be delivered out of order or not. In this case, the PEs are used as efficiently as possible, so we say the utilization ratio is 100%. We use Te to denote the total execution time if ordering constraint has to be maintained. Then we could define the utilization ratio ρ as Te / T0 . The utilization ratio v.s. the number of PEs at various load is shown in Figure 2 for trace Auck. The graph is similar for other traces, so only one is shown for brevity. As expected, the utilization ratio goes down when more and more PEs are added to the NP. What is surprising is how fast it goes down. The utilization ratio is much lower even when only 8 or 12 PEs are used. Reducing the load, i.e., building in more headroom helps to increase the overall utilization ratio, but it does not help to slow the decrease in utilization ratio as more PEs are added. The utilization ratio v.s. num of PE for all traces at 70% load is shown in Figure 3. All traces exhibit the same decrease in utilization ratio. Trace aix9889 seems to have more parallelism to maintain a relatively high ratio, but all other traces goes down to less than 60% utilization ratio with only 12 PEs. The reason is because trace aix9889 contains a large number of flows.

Reordering buffer allows packets to be scheduled out of order when free PE is available. Naturally, this should increase the utilization ratio. The utilization ratio v.s. number of PEs for different input reordering buffer size for trace sdc958 is shown in Figure 4. As the input reordering buffer size grows, the utilization ratio quickly improves, and saturates as more buffers are added. It seems that a very small buffer is sufficient, further adding more buffer does not greatly improve the result, in fact, it flattens quickly. Although not shown, the larger the number of PEs used, the more pronounce the flattening. A buffer size from 12 to 20 seems to be good enough. The trend of saturation is confirmed in Figure 5, where the utilization ratio is plotted v.s. size of input reordering buffer for various traces with a 32 PE network processor operating at 70% load. Input reordering buffer helps a network processor operating at low load more than one with high load. When load is high, the flow dependency is stronger, so it is harder to improve. Whereas, at low load, the input reordering buffer can quickly help to resolve blocking condition, and thus improve the utilization ratio. C. Relax flow definition So far, we have assumed that flow is defined as packets with the same destination IP address. A more restrictive definition is typically used though. To see its impact, we

Input reordering buffer size

utilization ratio

1

0 4 8 12 16 20 24 28 32

0.8 0.6 0.4 0.2 60

52

44

36

28

20

12

4

0 num of PEs

utilization ratio

Figure 4: Utilization ratio v.s. num of PEs for different input reordering buffer size for trace sdc958 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

sdc964 aix9889 sdc958 auck

ACKNOWLEDGMENT The author would like to thank the MOAT PMA group at the National Laboratory for Applied Network Research (NLANR) for providing free access to the trace data under National Science Foundation NLANR/MOAT Cooperative Agreement (No. ANI-9807479).

apn9909

REFERENCES

0

4 8 12 16 20 24 28 input reordering buffer size

32

Figure 5: Utilization ratio v.s. reorder buffer size for all traces, with 32 PEs working at 70% load

utilization ratio

for packets within the same flow, blindly adding more PEs will lead to very inefficient utilization. In this paper, we quantify the amount of packet level parallelism available using real life Internet traffic traces. The result suggests that the number of PEs used should be minimized. So instead of exploiting packet level parallelism, we should exploit more intra-packet parallelism. Several design tradeoffs are suggested on how to exploit more IPP. We also showed that adding an input reordering buffer is a very effective way to improve overall utilization. A small size buffer is sufficient.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

aix' aix apn' apn

4

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 num of PEs

Figure 6: Utilization ratio v.s. num of PEs for two traces with different flow definition. NP work at 70% load with no input reordering buffer

evaluate the utilization ratio when the flow is defined as packets with the same source IP address, destination IP address, source TCP/UDP port and destination TCP/UDP port. The utilization ratio v.s. number of PEs is plotted in Figure 6. As expected, the ratio is higher when a more restrictive flow is defined, although the difference seems to be small. V. CONCLUSION Commercial network processors try to exploit packet level parallelism by duplicating a large number of simple processing engines in the hope that many packets could be processed at the same time. But because of ordering constraint

[1] R. Guerin and V. Peris, “Quality-of-Service in Packet Networks: Basic Mechanisms and Directions” [2] E. Rotenberg et al., “Trace Processors,” Proc. Micro-30, pp. 6874, Dec. 1997. [3] G.S. Sohi, S.E. Breach and T.N. Vijaykumar, “Multiscalar Processors,” Proc. Int’l Symp. Computer Architecture, pp. 414425, June 1995. [4] X. Nie, et al., “A New Network Processor Architecture for High-speed Communications”, Proc. IEEE workshop on signal processing systems, 1999, pp. 548-57 [5] T. Wolf and J. Turner, “Design Issues for High-performance active routers”, IEEE Journal on Selected Areas in Communications, Mar. 2001, vol. 19, pp.404-9 [6] D. Taylor, J. Turner and J. Lockwood, “Dynamic Hardware Plugins (DHP) Exploiting Reconfigurable Hardware for HighPerformance Programmable Routers”, Proc. Open Architectures and Network Programming, 2001, pp. 25-34 [7] H. Shimonishi and T. Murase, “A Network Processor Architecture for Flexible QoS Control in Very High-speed Line Interfaces”, Proc. High Performance Switching and Routing, 2001, pp. 402-406. [8] P. Paulin, F. Karim, and P. Bromley, “Network Processors: A Perspective on Market Requirements, Processor Architectures and Embedded S/W Tools”, Proc. Design, Automation and Test in Europe, 2001, pp. 420-427 [9] S. Nilsson, G. Karlsson, “IP-address lookup using LC-tries,” IEEE Journal on Selected Areas in Communications, vol. 17, no. 6, pp. 1083-1092, 1999. [10] M. Waldvogel, G. Varghese, J. Turner, B. Plattner, “Scalable High-Speed IP Routing Lookups,” Proc. ACM SIGCOMM 1997, pp. 25-36, Cannes, France. [11] A. McAuley and P. Francis, “Fast Routing Table Lookup Using CAMs”, Proc. Infocom 93, Vol. 3, pp. 1382-91, Mar. 1993. [12] W. Bux, et al., “Technologies and Building Blocks for Fast Packet Forwarding”, IEEE Comm. Magazine, vol. 39, Jan. 2001, pp. 70-77 [13] Passive Measurement and Analysis project, National Laboratory for Applied Network Research. http://moat.nlanr.net/pma

A Trace Driven Study of Packet Level Parallelism

learning function is independent of destination MAC address lookup function, thus ... 1200 supports 6 micro-engines each supporting 4 threads,. Vitesse's I2000 ...

Download PDF

64KB Sizes 1 Downloads 165 Views

Report

A Trace Driven Study of Packet Level Parallelism

Recommend Documents