Revisiting TCP Congestion Control in A Virtual Cluster ...

Viewer
Transcript

1

Revisiting TCP Congestion Control in A Virtual Cluster Environment

Index Terms—Virtual Machine, TCP, Congestion Control

I. I NTRODUCTION Cloud computing allows users to hire a cluster of VMs in an on-demand fashion. To improve the cost-effectiveness of their cloud platforms, cloud providers strive to increase the level of server consolidation, i.e., the ability to host multiple VMs per physical server. CPU sharing is one of the key methods they use. As exemplified by Amazon EC2 cloud, one 3GHz CPU may be shared by up to 3 small instances with each one having the illusion of running at 1GHz speed [1]. In private clouds, running 40–60 VMs per server is not rare, which means 2– 4 VMs per CPU core in a typical 16-core SMP machine; some deployment may even run as many as 120 VMs per host [36]. Although modern multi-core systems may lean towards assigning dedicated cores to heavily loaded VMs, whenever it is possible and beneficial (e.g., during off-peak periods), consolidating multiple VMs onto fewer cores is still needed to improve the utilization of individual CPU cores. Cloud datacenters primarily and unequivocally use TCP for communication. A recent study reveals that 99.91% of traffic in today’s datacenters is TCP traffic [8]. Since its inception, TCP has been successfully deployed in: (i) various WAN environments, where the network delay can be from tens to hundreds of milliseconds, and (ii) physical clusters, featuring sub-millisecond network delay. Unlike a WAN and physical clusters, in which the network delay is relatively stable and

1000

500

800

400 Goodput (Mbps)

Abstract—Virtual machines (VMs) are widely adopted today to provide elastic computing services in datacenters, and they still heavily rely on TCP for congestion control. VM scheduling delays due to CPU sharing can cause frequent spurious retransmit timeouts (RTOs). Using current detection methods, we find that such spurious RTOs cannot be effectively identified, because of the retransmission ambiguity caused by the delayed ACK (DelACK) mechanism. Disabling DelACK would add significant CPU overhead to the VMs and thus degrade the network’s performance. In this paper, we first report our practical experience about TCP’s reaction to VM scheduling delays. We then provide an analysis of the problem which has two components corresponding to VM preemption on the sender side and the receiver side respectively. Finally, we propose PVTCP, a ParaVirtualized approach to counteract the distortion of congestion information caused by the hypervisor scheduler. PVTCP is completely embedded in the guest OS, and requires no modification in the hypervisor. Taking incast congestion as an example, we evaluate our solution in a 21-node testbed. The results show that PVTCP has high adaptability in virtualized environments and deals satisfactorily with the throughput collapse problem.

Goodput (Mbps)

Luwei Cheng and Francis C.M. Lau Department of Computer Science, The University of Hong Kong Email: {lwcheng, fcmlau}@cs.hku.hk

600 400 200 0

300 200 100

100ms 50ms

200ms 150ms 2

4

6

8

10ms 1ms

10 12 14 16 18 20

# of concurrent sender(s)

0

100ms 50ms

200ms 150ms 2

4

6

8

10ms 1ms

10 12 14 16 18 20

# of concurrent sender(s)

(a) TCP incast in a physical cluster. (b) TCP incast in a virtual cluster. Fig. 1. With 6 different RT Omin values, the two groups of experiments are conducted in the same cluster network (more details are in §II-B). In the virtual cluster, we run 3 VMs per CPU core. In Fig. 1(b), to show the performance differences more clearly, we set the range of the y-axis to 500Mbps.

predictable (spikes are infrequent or rare), in virtual clusters, it has been observed that the network delay can vary significantly and tends to be highly unpredictable [11], [46], [20]. Our experiments show that even when there is no workload in the network, the VMs alone can cause frequent latency spikes (§II-C). Whether TCP is able to function well in such a virtualized environment is largely an open question, which calls for a re-examination of TCP’s effectiveness. The fluctuating network delays between VMs deter TCP from learning accurately the physical network condition from round-trip times (RTTs) and RTOs. The spiked RTTs caused by VM scheduling, when mixed with real network congestion, would seriously alter the expected behaviors of TCP. With this regards, we revisit a serious form of network congestion called “incast”, which is commonly seen in large-scale distributed data processing, such as distributed storage [32], MapReduce [16] and web search [8]. For the case of physical clusters, prior works [37], [12], [52] have examined several techniques (increase switch buffer size, limited transmit, reduce duplicate ACK threshold, disable slow-start, randomize timeout value) and several lost-recovery variants (Reno, NewReno, SACK), but they conclude that none of them can fully eliminate the problem; in particular, if “tail loss” happens, TCP can only count on the retransmit timer’s firing. Therefore, significantly reducing RT Omin has been regarded as a safe and effective approach [45]. Even with the use of ECN, a small RT Omin is still desired when the number of concurrent senders is large [8]. However, the effectiveness of this approach in a virtual cluster has not been investigated. In Fig. 1, we conduct two groups of incast experiments with a 21-node physical cluster and a 21-node virtual cluster,

2

under the same network topology. In the physical cluster, as expected, smaller RT Omin delivers higher goodput, regardless of the number of senders. The results from the virtual cluster are very different: a small RT Omin yields very low goodput at the beginning; but when the number of senders increases, its goodput gradually catches up and then surpasses that of a bigger RT Omin . Therefore, TCP users have to choose an appropriate RT Omin that would fit the range of all possible scenarios, which is exactly the problem as noted by Allman and Paxson in [9]: adjusting RT Omin can be a tradeoff between timely response and premature timeouts, and there is no obvious optimal balance between the two. The virtualized environment is clearly a new instantiation of this phenomenon. The contributions of this paper are threefold. First, we identify and quantify that VM scheduling can bring frequent spurious RTOs to TCP, leading to very low goodput. Regarding the causes, we reveal two rather different scenarios where VM scheduling delays appear in either the transmit path or the receive path. Second, as for detecting spurious RTOs, we find that the state-of-the-art method (F-RTO [40]) does not perform well in virtualized environments. The main cause is the uncoordinated interaction with DelACK, a widely used ACK coalescing technique to reduce the host’s CPU and network pressure. Third, we extend ParaVirtualized TCP (PVTCP) which was proposed in our previous work for the senderside problem [14], to tackle also the receiver-side problem. PVTCP is triggered only when a VM scheduling delay has been detected; otherwise it falls back to standard TCP. Our experiments show that with 1ms RT Omin , PVTCP would not suffer the kind of performance loss as shown in Fig. 1(b), and meanwhile it avoids the potential throughput collapse. The rest of the paper is organized as follows. The impact of virtualization on TCP is illustrated in §II. The possible approaches are discussed in §III, with a particular focus on the state-of-art method to detect spurious RTOs. The causes of RTOs in VMs are analyzed in §IV where differentiate between the sender-side problem from the receiver-side problem. We propose the extended PVTCP in §V and describe its implementation in Linux in §VI. The evaluation results are presented in §VII. Some practical concerns about our solution and the future work are discussed in §VIII. §IX gives a brief account of the related work. We conclude our paper in §X. II. BACKGROUND AND M OTIVATION A. Virtualization Basics Virtualization allows multiple VMs to share a single machine, by multiplexing the underlying physical resources such as CPU, memory and I/O devices. Fig. 2 conceptually shows the components of Xen [10], an open-source hypervisor that is widely deployed in public clouds. Xen provides basic mechanisms for its upper-layer domains, such as hypervisor scheduler for CPU proportional share, event channel for asynchronous notification and shared memory for data transfer. For safety reasons, guest domains are disallowed to directly access I/O devices but rely on the driver domain to act on their behalf, which includes real drivers. Some other hypervisors like VMware’s ESX(i) include device drivers as components

The driver domain

Guest domains

Administration utilities

Applications

Bridge or vSwitch

TCP stack

driver

Xen

Hypervisor scheduler

Virtual NIC Shared memory

Event channel

NIC

Fig. 2. The components of Xen and its network I/O virtualization.

(called VMkernel) [4], so sometimes the driver domain is also considered a secondary part of Xen hypervisor. Xen supports two typical I/O virtualization models: (1) paravirtualization (PV) – a slightly modified guest OS uses a splitdriver model for communication; (2) full virtualization (FV) – an unmodified guest OS runs on special processors (e.g., Intel VT and AMD SVM) while the hypervisor uses QEMU to fully emulate I/O devices for VMs. Since the driver domain is responsible for I/O forwarding for all guest domains, it often runs on dedicated CPU cores with guaranteed efficiency. Virtualization can cause performance problems in applications, which do not exist when the applications directly run on physical machines [28], [50], [13]. First of all, since the hypervisor needs to provide virtual I/O devices for VMs, a certain amount of software overhead is incurred, which is typically seen in the extra data movement between guests and the driver domain. Furthermore and more prominently, the hypervisor’s scheduling can significantly degrade a VM’s I/O performance (of PV or FV type) because the scheduling delays would be added to the VM’s interrupt processing. In the following sections, we evaluate how the two types of overhead affect network delays. B. Experimental Settings We conduct the experiments in our own cluster, containing 21 Dell PowerEdge M1000e blade servers. Each server is equipped with two quad-core 2.53GHz Intel Xeon 5540 CPUs, 16GB RAM and two 250GB SATA disks. All servers are connected through a Brocade GbE switch. We avoid using simulators like ns-2/ns-3 [2], because they only focus on higher abstraction levels. In the experiments, we need to consider also factors in other levels including VM scheduling in the hypervisor, interrupt processing sequence in the guest OS, and CPU utilization. In our environment, we can accurately control the VM settings and the guest OS kernels, so that any experimental phenomenon can be reproduced in a deterministic way. To create the VM consolidation scenario, several VMs are configured to run on the same core to compete for CPU cycles, as shown in Fig. 3. Our benchmark is identical to that in [45], which executes a certain number of synchronized reads of 1MB blocks of data from multiple VMs. We use Xen 4.2.0 as the hypervisor and Linux 3.8.8 as the guest OS. We choose TCP NewReno in our experiments instead of the default TCP Cubic because NewReno’s linear

3

VM

CPU

CPU

…

Physical Machine

1

1

0.8

0.8

Ping RTT (ms)

VM

Background VMs (CPU intensive)

Ping RTT (ms)

…

Communicating VMs (under test)

VM

…

VM

0.6 0.4 0.2 0

Physical Machine

0.6 0.4 0.2 0

0

10

20

30

40

50

60

0

10

Time (second)

(a) [PM → PM]

GbE Switch

C. RTTs in a Virtual Cluster

150

150

120

120

90 60 30 0

50

60

(1)

where N is the number of co-located VMs on that core, and T Shypervisor is the time slice used in the hypervisor scheduler. Since Xen’s scheduler [3] uses a 30ms time slice, when N is 3, the maximum scheduling delay of each VM is thus 60ms. The software emulation overhead is sub-millisecond, while a VM scheduling delay can easily be in excess of tens of milliseconds, thus dominating the RTT. It is worth noting that with current hardware support such as SR-IOV [17], the software overhead can be easily avoided by bypassing the driver domain for data movement, which is the so-called “passthrough I/O”. In contrast, VM scheduling delays cannot be eliminated, because CPU sharing among VMs is common in many scenarios, independent of the type of the hypervisor used. In the following section, we explore the impact of the VM scheduling to TCP’s performance. D. VM Scheduling Affects TCP’s Performance Regarding the VM scheduling, there are two main factors that contribute to the decreased network performance: 1 In this paper, we use [mVM(s) → nVM(s)] to denote there are m colocated VMs per core on the sender side (including the sending VM) and n co-located VMs per core on the receiver side (including the receiving VM).

60 30

10

20

30

40

50

60

0

10

Time (second)

20

30

40

50

60

Time (second)

(a) [1VM → 2VMs]

(b) [2VMs → 2VMs]

150

150

120

120

Ping RTT (ms)

Ping RTT (ms)

We first evaluate the software emulation overhead in Fig. 4 . To get a fine-grained view, Linux ping command is used to report RTTs every 100ms. Fig. 4(a) shows that the average RTT of the physical network (two hops, connected to the same switch) is 0.147ms. When using VMs in Fig. 4(b), the average RTT increases to 0.374ms (2.54×). The extra 0.227ms is the software overhead introduced by extra data movements between the driver domain and the guests. We then evaluate VM scheduling delays as shown in Fig. 5. It can be seen that RTT varies greatly without any apparent predictability. For example, when there are 3 VMs per core on the receiver side (Fig. 5(c)), the maximum RTT is about 60ms; but then when this amount of consolidation happens on both sides (Fig. 5(d)), the maximum RTT rises to nearly 120ms. As the scheduling latency of each VM, Latsched vm , is actually equal to its queuing delay in the hypervisor’s scheduling queue, its maximum value is:

90

0 0

1

= (N − 1) × T Shypervisor

40

(b) [1VM → 1VM]

Ping RTT (ms)

Ping RTT (ms)

growth function in the congestion-avoidance phase is easier for us when we analyze the change of the congestion window.

vm )

30

Fig. 4. I/O virtualization overhead in RTTs , using ping command.

Fig. 3. The settings of the virtual cluster in our experiments.

max(Latsched

20

Time (second)

90 60 30 0

90 60 30 0

0

10

20

30

40

50

Time (second)

(c) [1VM → 3VMs]

60

0

10

20

30

40

50

60

Time (second)

(d) [3VMs → 3VMs]

Fig. 5. Hypervisor scheduling delays in RTTs, using the ping command. Note that the units on the y-axis are different from that in Fig. 4.

Idle link, which happens when one of the communicating VMs has been preempted by the hypervisor scheduler, causing TCP flows to stall. • Improper triggering of congestion control in VMs’ transport layer, due to VM scheduling delays. Although the two problems are not mutually exclusive, they happen in different layers. Prior works focus predominantly on modifying the hypervisor layer to resume the concerned VM as soon as possible to serve I/O, e.g., [35], [28], [22], [50], [49]. These methods can alleviate the problem to some extent, but they would bring in another problem: the number of VM context switches would substantially increase, because the hypervisor needs keep swapping the VMs in response to the incoming I/O. Such abrupt VM swapping would also reduce the CPU cache effect, hurting the performance of CPU-bound VMs. That is probably the reason why Xen adopts 30ms as the default time slice [3], and VMware uses 50ms [6]. As long as the CPU cores are not dedicated to single VMs, the scheduling delays caused by VM preemptions cannot easily be avoided. VM preemption can happen on both the sender side and the receiver side; there are four possible scenarios: (1) only the sender VM has been preempted; (2) only the receiver VM has been preempted; (3) both are running; (4) both have been preempted. In the latter two cases, TCP flows either progress normally or are completely stalled. It turns out that when there is only one running VM, the transport-layer protocol would be seriously affected. Fig. 6 schematically explains how VM scheduling delays can cause spurious RTOs. The horizontal dimension is the flow direction and the vertical dimension is •

4

Driver domain

Receiver VM

Data packets, waiting for ACK ACK ACK

wait

wait

Scheduling delay RTO!

Scheduling delay RTO!

deliver

Physical network

(a) The sender VM suffers a scheduling delay.

300 200 100

Physical network

Fig. 6. An illustration of RTOs when VM consolidation happens on the sender side and on the receiver side respectively.

1

2

3

4

SRT Ti = RT T V ARi =

7 1 SRT Ti−1 + M RT Ti 8 8

(2)

3 1 RT T V ARi−1 + |SRT Ti − M RT Ti | (3) 4 4

RT Oi+1 = SRT Ti + 4 × RT T V ARi

(4)

where M RT Ti is the measured RTT at time i calculated using a returned ACK packet, SRT Ti is the smoothed RTT, and RT T V ARi represents RTT variance (mean deviation). When RTT suddenly increases and exceeds the RTO value that has been set, a spurious timeout will occur. In Fig. 7, we use our modified tcp_probe kernel module to report the measured RTTs and the calculated RTO values. We can see that RTT spikes always appear without any prior prediction from its historical measurements. RT T V AR can only reflect the variance of previously measured RTTs. So when a spiked RTT appears, without the protection of a sufficiently large RT Omin , RTO happens before the current RT T V AR can adapt to the change. After each RTO, TCP assumes that there is serious network congestion and then reduces the sending speed and doubles the subsequent timeout values (i.e., exponential backoff [23]). But actually the congestion sensed 2 Note that in Fig. 6, the buffer is invisible to the VMs and can only be configured by the driver domain. Its default length is 32 in Xen (XENVIF_QUEUE_LENGTH), which is too small to accommodate incoming packets when the VM has been preempted. To avoid the negative effect of packet drops on our measured goodput results, in our experiments, we adopt a very large value for this buffer (e.g., 8192).

100

5

0

1

3

4

5

(b) RT Omin = 100ms 500

calculated_rto measured_rtt

400

2

Time (second)

300 200 100 0

the time. Suppose that VM1 is the TCP sender. Taking Fig. 6(a) for example, after VM1 issues some data packets, if the receiver VM can return an ACK within VM1’s scheduling time quantum, then: (1) the retransmit timer will be cleared and reset with a new timeout value; (2) the sending window size will be increased if necessary. Otherwise, if VM1 has been preempted, the ACK will be buffered in the driver domain, which is not known to VM1 until it is scheduled again; this possibly leads to an RTO2 . The situation in Fig. 6(b) is similar: if the receiver VM has been preempted, the sender VM cannot receive the ACK until the receiver VM is scheduled again. Following Fig. 6, we examine how TCP’s RTO estimation can be affected by VM scheduling delays. Standard TCP uses a low-pass filter [23] to estimate RTO values:

200

(a) RT Omin = 200ms 500

(b) The receiver VM suffers a scheduling delay.

300

Time (second)

VM1 run

Within hypervisor

calculated_rto measured_rtt

400

0 0

deliver ACK

500

calculated_rto measured_rtt

400

0

VM3 run

VM1 ACK run Within hypervisor

VM1 run VM2 run

buffer

buffer

500

Measured value (ms)

Sender VM

Measured value (ms)

Data packets, waiting for ACK

VM2 run VM3 run

Receiver VM

Measured value (ms)

VM1 run

Driver domain

Measured value (ms)

Sender VM

calculated_rto measured_rtt

400 300 200 100 0

0

1

2

3

4

Time (second)

(c) RT Omin = 10ms

5

0

1

2

3

4

5

Time (second)

(d) RT Omin = 1ms

Fig. 7. In the scenario of [3VMs→1VM], we collect the measured RTTs and calculated RTO values from the sender VM’s kernel, with various RT Omin values. In the scenario of [1VM→3VMs], the results are similar.

by the VM’s TCP layer is unreal, which we refer to as pseudocongestion3 . Sporadic RTOs may not be a serious problem, but when they happen very frequently (every 10×ms as induced by the hypervisor scheduler), TCP’s performance would clearly show a degradation. III. P OSSIBLE A PPROACHES A. Other TCP Variants to Handle RTT Spikes RTT spikes can also appear in other network types and cause spurious timeouts, for example and notably in wireless environments. In these networks, high bit error rate is another main cause of packet loss, aside from network congestion. Since there are many different access technologies for wireless networks, each TCP solution tends to have its own unique problem to tackle. For example, TCP-Peach [7] considers long propagation delays in satellite networks, ATCP [29] solves the problem of frequent route changes and partitions in ad-hoc networks, and Freeze-TCP [19] focuses on the hand-off problem in cellular networks. Seemingly, it is unlikely that there can be a universal TCP solution that fits all types of networks [43], and to accurately identify the causes of packet loss has become the focal point of TCP design. The virtualized environment, because of its unique cause of unpredictable network delays of very large magnitude (100×) and frequency (every 10×ms), presents another case that needs special adaptation. B. Timestamping Packets in the Hypervisor TCP can use the timestamp option to measure RTTs: the sender timestamps its data packets and the receiver will send this value back to the sender in the returned ACK; the sender then calculates the difference between the current system clock value and the timestamp value in the ACK packet: M RT T = system clock − ACK.timestamp

(5)

3 In the rest of this paper, we use pseudo-congestion to refer to the RTOs caused by VM scheduling delays.

5

TABLE I T HE STANDARD PROCEDURES OF F ORWARD RTO-R ECOVERY (F-RTO)

(2b) (3) (3a) (3b)

ALGORITHM TO DETECT SPURIOUS

RTO S [40].

ACTION When RTO happens, retransmit the first unACK’d segment. Set ssthresh to half of the # of outstanding segments. When the first ACK after RTO arrives: If is a duplicate ACK OR it acknowledges the whole outstanding window, follow conventional RTO recovery. (i) If it advances snd.una, transmit up to two new (previously unsent) segments. Enter step 3. (ii) If the TCP sender cannot send (no new data or window-limited), it is recommended to enter RTO recovery. When the second ACK after the RTO arrives: If is a duplicate ACK, follow conventional RTO recovery. If it advances snd.una (i.e., it acknowledges data that was not retransmitted after RTO), declare the RTO spurious.

As shown in Fig. 6, since the VM cannot know the arrival of the packet until it is scheduled again, the time that the VM receives the packet is not the time when the packet arrives at the driver domain. To solve the problem, one would be tempted to try correcting the timestamp of the ACK on reception in the device driver and exposing the value to the guest OS. This is however problematic. First, in the ACK’s header, the “timestamp” value is actually the echoed send time of the corresponding data packet, rather than the arrive time of the ACK packet itself. Second, since the hypervisor’s system clock value is different from the VM’s, it would be difficult (if not absolutely impossible) for the VM to understand the clock value passed from the hypervisor. C. Detect Spurious RTOs In case a sudden delay occurs in the network, there is no known way to prevent the retransmit timer from expiring. But if the spurious RTOs can be well detected, it would be easy for TCP to restore the sending speed. In general, there are two well-known detection methods: F-RTO [40] and Eifel [30]. FRTO was standardized in RFC5682 and has been implemented in Linux, so we choose it for evaluation in this paper (Eifel will be discussed in §VIII). In Table I, we detail the standard procedures of F-RTO: it checks whether the first two incoming ACKs after the RTO can both advance the sending window in steps (2) and (3); if yes, the RTO is declared as spurious and the sender continues transmitting new data; otherwise, it reverts to the conventional go-back-N behavior: specifically steps (2a), (2b-ii) and (3a), which can be very expensive. We evaluate F-RTO’s performance in the scenarios of both [3VMs→1VM] and [1VM→3VMs]. Unfortunately, F-RTO’s success rate is very low: only ∼30% and ∼5%. Our further study shows most failures fall in step (2a): acknowledge the whole window. Going back to the algorithm, F-RTO actually expects the receiver to return the ACKs in a deterministic way, so as to advance the sending window in steps (2a) and (3b). However, due to the use of DelACK, TCP does not always immediately acknowledge the sender. By definition, DelACK allows the TCP receiver to return one ACK for at least two segments, with the delay not exceeding a predefined timeout value (RFC 1122). Linux TCP implements DelACK in a more conservative way: in case of no out-of-order packet arriving, an ACK is returned only when there is more than one full frame received and at the same time the receive window can be advanced (__tcp_ack_snd_check()). For DelACK timer, Linux TCP adopts 200ms as the maximum timeout

100 Detected and Failed Cases (%)

STEP (1) (2) (2a)

[3VMs → 1VM]

[1VM → 3VMs]

80 60 40 20 0 delack delack w/o delack delack w/o 200ms 1ms delack 200ms 1ms delack (2a) ACK the whole window (2b-ii) Window-limited (3a) Duplicate ACK (3b) Successful detected cases

Fig. 8. F-RTO performs differently under different DelACK policies. We categorize the results according to the rules listed in Table I. The red bar represents F-RTO’s success rate in detecting spurious timeouts. In each test, the sender VM transmits 4GB data to the receiver VM.

value. Recent studies [45], [51] suggest that DelACK can introduce latency problems to some applications in physical datacenters, and propose disabling DelACK or significantly reducing its timeout value, e.g., to 1ms, but some other work [8] supports the use of DelACK to reduce the server’s load. In a virtual cluster, we also care about how DelACK may affect F-RTO’s performance. Besides the default DelACK policy (“delack 200ms”), in Fig. 8, we also examine two other policies mentioned in the above (“delack 1ms” and “w/o delack”). The results show that: (1) reducing DelACK timeout value from 200ms to 1ms does not help F-RTO, and most failures are still in step (2a). FRTO’s argument for this rule is that “the RTO is caused due to lost retransmission, and the rest of the window was successfully delivered to the receiver before the RTO occurred” [40]. However, from our experiments, this rule is very vulnerable to the ambiguity caused by DelACK: return one ACK for multiple segments. (2) Disabling DelACK greatly improves FRTO’s success rate (to ∼90%), regardless of which side suffers the VM scheduling delays. The step (3a) is seldom reached, because there is no packet loss under our experimental settings. As for the cases in step (2b-ii), we attribute them to the VMs’ limited CPU allocations, which prevent TCP from timely processing the packets and lead to frequent window validation (more details are in the following paragraphs and §VII-B). Therefore, we argue that this recommended rule is too strict for TCP in consolidated VMs. As mentioned above, disabling DelACK is hardly a panacea, because it would increase the host’s load. In Fig. 9(b)(c)(e)(f), without DelACK, both the sender VM and the receiver VM incur very high CPU overhead; with the 200ms RT Omin in Fig. 9(a)(d), disabling DelACK yields ∼13% less goodput. In

6

1000

50 RTOmin=200ms

RTOmin=1ms

Receiver VM

Sender VM

600 400 200

30 20 10

0 delack delack w/o 200ms 1ms delack

delack delack w/o 200ms 1ms delack

delack delack w/o 200ms 1ms delack

(b) [3VMs→1VM], RT Omin =200ms

(a) [3VMs→1VM] RTOmin=200ms

RTOmin=1ms

50 Sender VM

Receiver VM

Sender VM

400 200

30 20 10

0 delack delack w/o 200ms 1ms delack

20

0 delack delack w/o 200ms 1ms delack

delack delack w/o 200ms 1ms delack

delack delack w/o 200ms 1ms delack

(e) [1VM→3VMs], RT Omin =200ms

(d) [1VM→3VMs]

30

10

0 delack delack w/o 200ms 1ms delack

Receiver VM

40 CPU utilization (%)

CPU utilization (%)

40

600

delack delack w/o 200ms 1ms delack

(c) [3VMs→1VM], RT Omin =1ms

50

800

20

0 delack delack w/o 200ms 1ms delack

1000

30

10

0 delack delack w/o 200ms 1ms delack

Receiver VM

40 CPU utilization (%)

40 CPU utilization (%)

Goodput (Mbps)

800

Goodput (Mbps)

50 Sender VM

delack delack w/o 200ms 1ms delack

(f) [1VM→3VMs], RT Omin =1ms

Fig. 9. Different DelACK policies lead to different network goodput results, regardless of whether VM consolidation happens on the sender side or on the receiver side. The tests last for a sufficiently long time and the results are quite stable, so we do not plot the error bars. The VMs’ CPU utilization results are recorded using the xentop command in the driver domain every two seconds. With 3 VMs per physical core, each VM only gets 33% of the core, so the reported CPU utilization results are capped by this value. In each test, the sender VM transmits 4GB data to the receiver VM. TABLE II T HE BREAKDOWN OF ACK S WHEN TRANSMITING 4GB [3VMs→1VM]

Total ACKs QuickACK DelACK CleanRcvBuf

delack-200ms 229,650 49,565 38 178,763

delack-1ms 244,757 55,434 7,785 174,645

DATA UNDER DIFFERENT DELAYED

w/o delack 2,832,260 2,832,260 0 0

virtualized clouds, when CPU cycles are shared by multiple VMs, the limited CPU allocation can easily become a bottleneck, which may in turn harm the network performance. For example, with three VMs sharing one core, each VM only gets 33% CPU allocation. Another observation is that when using 1ms RT Omin in Fig. 9(c)(f), the sender VM’s CPU load is much higher while the receiver VM’s CPU load is much lower than that of the 200ms RT Omin in Fig. 9(b)(e). This is because the sender VM is frequently bothered by timeouts and retransmissions, whereas the receiver VM cannot be fully saturated due to the sender’s low transmitting speed. To better understand the relationship between CPU overhead and different DelACK policies, we instrument Linux TCP to report how ACKs are returned by TCP receiver. Table II shows that without DelACK, the number of total ACKs generated by the receiver VM significantly increases: 11×∼13× more. Each ACK would potentially trigger a network interrupt (transmit and receive) on both sides, consuming certain CPU cycles. In details, in our instrumentation, Linux TCP returns ACK packets in three situations: (1) in QuickACK mode, in which Linux TCP refrains from delaying the ACKs for the first few segments [41], so that the TCP sender can quickly increase the congestion window at the beginning of the connection; (2) when DelACK timer expires; (3) when the receive buffer has been cleaned up (in tcp_recvmsg()). It is observed

[1VM→3VMs]

Total ACKs QuickACK DelACK CleanRcvBuf

ACK

POLICIES , WITH

delack-200ms 252,278 39,890 22 209,507

RT Omin = 1 MS .

delack-1ms 262,274 44,256 7,960 201,897

w/o delack 2,832,179 2,832,179 0 0

that even with “delack-1ms”, most ACKs are sent via CleanRcvBuf before the DelACK timer expires. This explains why significantly reducing the DelACK timeout value yields only marginal performance improvement. IV. P ROBLEM A NALYSIS Although VM scheduling can happen to both the sender VM and the receiver VM, their respective situations are somewhat different. Fig. 10 depicts the happening of RTOs in the case of [3VMs→3VMs]. When the sender VM is preempted, although the ACK arrives at the sender’s side before the retransmit timer expires, an RTO still happens after the VM wakes up. On the other hand, when the receiver VM suffers a scheduling delay, the corresponding ACK can only be generated after the receiver VM wakes up, causing RTOs to the sender. When examining the two problems separately, Table III shows that RTO only happens once at a time in the scenario of [3VMs→1VM]; while in the scenario of [1VM→3VMs], successive RTOs are commonly seen even with TCP’s exponential backoff strategy. It appears that the problem on the sender side is more an OS problem than a networking problem, because the RTO is not caused by the ACK’s arrival time but its received time. As shown in Fig. 11(a), after the sender VM wakes up, both the timer interrupt and the network interrupt are pending. Ideally, TCP should drain the packets in the receive queue before

7

9.1

snd.una: the first sent but unacknowledged byte.

x106 snd.nxt

9

snd.nxt: the next byte that will be sent.

7. ACK(s) can be returned only after the receiver VM wakes up.

8. Receive two duplicate ACKs

snd.una

8.9

8.7

6. RTO happens twice, before the receiver VM wakes up.

1. The sender VM has been stopped.

8.8

5. The receiver VM has been stopped.

2. An ACK arrives before the sender VM wakes up.

snd.nxt

8.6 snd.una

8.5

4. A dupACK is returned after receiving an old segment.

3. RTO happens just after the sender VM wakes up.

8.4 0

10

20

30

40

50

60

70

80

Time (ms) vs. sequence number (from the sender VM)

0

10

20

30

40

50

60

70

80

Time (ms) vs. ACK number (from the receiver VM)

ACK packet

ACK delivery

wait .. data packets

Receiver VM

Retransmitted packet

Buffer Expiry time

Timer IRQ: RTO happens! retransmission

1

Timer

The VM is preempted

Scheduling delay

Driver domain

Network IRQ: fetch the ACK; spurious RTO!

2

Sender VM

The VM is rescheduled

Cumul. Distrib. Freq. (CDF)

Fig. 10. A microscopic view of TCP’s behavior when VM scheduling delays happen to both sides successively [3VMs→3VMs], using tcpdump. 1 0.8 0.6 0.4 0.2 0 0

50

100

150

200

250

Time interval (microsecond)

(a) The ACK is received just after the TCP timer fires.

(b) The time interval between Step 1 and Step 2.

Fig. 11. When the sender VM has experienced hypervisor scheduling delays, after it wakes up, there are OS reasons for the RTO.

making retransmission decisions. However, the OS by design would first serve the timer softirq before all the other softirqs, because timer is used to create timed events for many other kernel services; in contrast, much of the network’s processing is delayed to the bottom-half. In Linux, timer interrupt has a higher priority than network interrupt. Therefore, when both are pending at the same time, the retransmit timer always fires before the ACK is fetched. Fig. 11(b) plots the CDF diagram of the interval between the ACK’s arrival time and the RTO’s firing time: about 80% of the intervals are within [10µs, 20µs], with the remaining 20% not exceeding 120µs, which is beyond what the kernel’s timer (whose granularity is 1ms in Linux) can distinguish. TABLE III RTO S IN DIFFERENT SCENARIOS , TRANSMITTING 4GB

Time taken 1× RTOs 2× RTOs 3× RTOs 4× RTOs

[1VM→3VMs]

[3VMs→1VM]

88 seconds 1061 0 0 0

164 seconds 677 673 196 30

DATA .

When the receiver VM suffers a scheduling delay, we argue that RTO(s) should happen justifiably because the generation and the return of the ACK are indeed too late. In this regard, we perceive it as a non-OS problem, by considering this delay on the receiver side just a part of the overall transmission “delay” which happens to have a large variance. This can be likened to the situation of an unstable wireless path.

V. T HE PVTCP D ESIGN We propose the design of a ParaVirtualized TCP (PVTCP) to address the problems mentioned above. As VM scheduling delays cannot always be avoided when one CPU core is shared by multiple VMs, our direction is to redesign the transport protocol so that it can automatically tolerate such delays without bothering the hypervisor. Fig. 12 shows the architecture of PVTCP which we will explain in detail in the following subsections. Our solution is practical in several ways: (1) PVTCP is completely embedded in the guest OS, requiring no modification to the hypervisor; (2) PVTCP does not introduce any extra state or transition phase to standard TCP semantics; (3) PVTCP takes effect only when a scheduling delay from the hypervisor has occurred, and in case of no such delays, it will automatically switch to standard TCP. A. Detect VM Scheduling Delays The more information about current network conditions available to a transport protocol, the more efficiently it can use the network to transfer its data. – Allman and Paxson [9] Our analysis shows that pseudo-congestion only happens when the VM has experienced a scheduling delay. Therefore, if we are able to detect the moment when the delay happens and let the guest OS be aware of it, we have a chance to deal with the problem. Without the help from the hypervisor, it is impossible for a VM to know in advance the moment at

8

PVTCP in the sender VM

PVTCP in the receiver VM

Standard TCP

Standard TCP

RTT Estimation

PVTCP

RTO Management

Detect Spikes

Explicit Delay Notification

Check timers

+1 Periodic timer interrupt

Physical Machine

Delayed ACK Management

Detect Spikes

PVTCP

Update System Clock +10x VM scheduling delays

Algorithm 1: The method to detect VM scheduling delays.

Update System Clock +10x VM scheduling delays

Xen

…

+1 Periodic timer interrupt

Xen

Physical Machine

Hardware Switch

Global variable: jiffies; Per-vCPU variables: local_jiffies, prev_local_jiffies, is_vcpu_wakeup; for each virtual timer interrupt of one vCPU do /* Any vCPU can touch the global clock */ Update the global jiffies if needed; /* The local clock is touched only by the local vCPU */ local_jiffies ← jiffies; if local_jiffies > prev_local_jiffies+1 then is_vcpu_wakeup ← true; else is_vcpu_wakeup ← f alse; prev_local_jiffies ← local_jiffies;

Fig. 12. PVTCP is a slim layer residing between the standard TCP and the system timer interrupt.

..

...

Virtual timer INTs (every 1ms)

VM is not running jiffies += 60

VM is running jiffies++ jiffies++ jiffies++

Guest OS . Hypervisor

jiffies++ jiffies++ jiffies++

VM is running

...

VM is unable to set clock events to the hypervisor

...

Time

Virtual timer INTs (every 1ms) The backlogged event is delivered after the VM wakes up

Fig. 13. VM scheduling delays bring spiked jiffies to the guest OS.

which it will be preempted. However, it is possible to detect the scheduling delay after the VM wakes up. Virtualization introduces a division of CPU time beyond the control of the VMs, and thus the timekeeping problem [5]. Xen hypervisor uses a one-shot timer to provide the clock source to each VM. Normally, when a VM is running, it periodically registers clock events and receives virtual timer interrupts from Xen to continuously update its system clock (jiffies in Linux). However, if the VM has been descheduled, it is unable to register clock events to the hypervisor and the preregistered event cannot be received until the VM is granted CPU cycles again. As shown in Fig. 13, when there are 3 VMs sharing one CPU core, the maximum scheduling delay can be 60ms, resulting in a sudden change of 60 increments to jiffies (suppose that the kernel’s tick rate is 1000 Hertz, which is commonly adopted in modern OSes especially 64-bit ones). Therefore, if jiffies is found to be not continuously increasing, it is a strong indication that there was a backlogged clock event and the VM has just experienced a scheduling delay. The problem that hypervisor scheduling disturbs the guest OS to update its system clock, is generic and not specific to Xen hypervisor and Linux guests. Monitoring spiked jiffies is simple and straightforward, which is adopted in our prior work [14]. However, there is one limitation of this approach: it cannot be applied to SMP-VMs that have multiple virtual CPUs (vCPUs). Since the hypervisor schedules vCPUs mostly in an independent manner (strict co-scheduling is unnecessary for SMP-VMs [6]), and every vCPU can update the system clock, the scheduling delays experienced by one vCPU cannot be completely reflected by

the changes of the global jiffies. To address this problem, we introduce per-vCPU clock in the guest OS: after one vCPU wakes up, aside from updating the global clock, it also (and only) updates its local clock. In this way, the scheduling delays of the network-receiving vCPU can be detected by monitoring its local clock. The details are described in Algorithm 1. B. Handle the Delays on the Sender Side When the sender VM incurs a scheduling delay, our solution is not about how to detect the wrong RTO, but how to avoid the RTO by leveraging the help from the OS kernel. In this way, TCP does not have to go through the complicated paths to detect and then recover from the spurious RTO, which is apparently more expensive. The sender-side solution has already been introduced in [14]. 1) Avoid Spurious RTOs: Suppose that TCP’s retransmit timer is activated at time Tbegin , and is set to expire at time Tend , its active period can then be expressed as [Tbegin , Tend ]. In the case of no scheduling delays from the hypervisor: Tend = Tbegin + RT O

(6)

Otherwise, the timer’s expiry time would be postponed: Tend = Tbegin + max(Latsched

vm , RT O)

(7)

When the VM has detected a scheduling delay from the hypervisor, in order to avoid the problem of “RTO-ACKSpuriousRTO” as shown in Fig. 11(a), PVTCP slightly extends the expiry time of the retransmit timer by a mere 1ms: Tend ← Tbegin + max(Latsched

vm , RT O)

+ 1ms

(8)

This should be enough because the ACK’s delay caused by the late execution of network IRQ is less than 120µs, as shown in Fig. 11(b). In this way, if the network interrupt does contain an ACK, the ACK gets an opportunity to reach the sender VM before the retransmit timer fires. After that, the retransmit timer will be reset with a new timeout value. On the other hand, if the ACK is not for the TCP flow that the expired timer is hinting at, or no ACK has arrived at the driver domain due to some real network congestion, TCP can know the packet loss at jiffies+1 via an RTO event, and then retransmits the lost segments. Compared with the delays from the hypervisor scheduler (10× ms), the overhead of this temporary 1ms postponement is negligible.

9

Algorithm 2: The actions of PVTCP, when VM preemption happens to the sender VM. Per-vCPU variable: is_vcpu_wakeup;

2) Filtering Out Contaminated RTTs: Having avoided the spurious RTO, the received ACK will be used to measure the RTT by the sender VM. However, its arrival has been delayed by the hypervisor, and therefore it cannot truly reflect the congestion condition of the physical network: vm , trueRT T )

(9)

Since Latsched vm is usually 10× ms, which can be orders of magnitude greater than the physical network delay, it can cause acute increase to both SRT T and RT T V AR (Equations 2 and 3). As a result, the calculated RTO value would seriously deviate from the reasonable value (Equation 4). In our solution, since we are able to detect the moments when spiked RTTs happen, the unreliable measurements can be identified and then filtered away. PVTCP adopts a conservative way to measure the RTT when the VM experiences a scheduling delay, by reusing the previously calculated “smoothed RTT”: M RT Ti ← SRT Ti−1

32

Fig. 14. TCP header with the EDN flag extension.

RTTMeasurement:: for each received ACK packet do if is_vcpu_wakeup is true then mearsured rtti ← smoothed rtti−1 ; else Follow standard TCP to measure RTT;

M RT T = max(Latsched

8 16 24 Source Port Number Destination Port Number Sequence Number Acknowledgement Number E CEUAPRSF Data N D WC R C S S Y I Window Size Offset S N REGKHTNN TCP Checksum Urgent Pointer Options Rsved

RTOManagement:: for each retransmit timeout in the guest OS do if is_vcpu_wakeup is true then Reset the timer to expire at jiffies+1; return; else Follow standard TCP to process RTO;

0

(10)

Once the contaminated RTTs are removed, TCP’s low-pass filter functions gracefully to suggest RTO values. The details are described in Algorithm 2. C. Handle the Delays on the Receiver Side When the receiver VM has been preempted, the sender VM has no way to know this event because there is no explicit notification from the receiver’s machine. Even if there was such a mechanism, it would be difficult to predict the receiver VM’s wakeup time due to the hypervisor’s scheduling dynamics; besides, the notification must go through the physical network which adds an uncontrollable delay. Therefore, RTOs are inevitable on the sender side, and the sender VM has to detect spurious cases and then recovers from them. 1) To Help the Sender Detect Spurious RTOs: Though disabling DelACK on the receiver side can help the sender’s FRTO to detect spurious cases, the significantly increased CPU overhead can degrade the network’s performance, as has been shown in Fig. 9. However, since such spurious RTOs on the sender side only happen during the receiver VM’s preemption

periods, DelACK only needs to be temporarily disabled after the receiver VM wakes up. Recall that to declare spurious RTOs, F-RTO only checks the first two incoming ACKs (Table I). As such, PVTCP returns ACKs in a just-in-time fashion: after the receiver VM wakes up, for the first three data segments, the corresponding ACKs will be returned immediately without any coalescing. After that, DelACK is enabled again so that the total number of generated ACKs would not substantially increase, and the extra CPU overhead would also be small. 2) Explicit Delay Notification (EDN): In §IV, we have observed that when the receiver VM is preempted, successive RTOs are very common on the sender side (Table III); in each RTO, the earliest unACKed segment (snd.una) is retransmitted (Fig. 10’s step 6). After the receiver VM wakes up, for each retransmission, a duplicate ACK will be returned even if there is no packet loss; once the sender receives three duplicate ACKs, it assumes that a segment has been lost and then enters the fast-retransmit phase. This creates a problem in real congested situations: the sender VM would find it difficult to distinguish between whether the duplicate ACKs are caused by the retransmissions or by real packet losses. One possible approach is to silently drop the duplicate retransmitted segments (snd.una) when the receiver VM just wakes up, rather than acknowledging the sender in a way that can be confusing, as what has been suggested in RFC793: “if an incoming segment is not acceptable, an acknowledgment should be sent in reply”. However, some algorithms may assume lost retransmission by counting duplicate ACKs [25]. To help the sender identify unwanted duplicate ACKs, PVTCP adopts an additional bit in the TCP header’s flag, called EDN (Explicit Delay Notification), as illustrated in Fig. 14. This idea is inspired by ECN (Explicit Congestion Notification): when the packet queue in the hardware switch has built up beyond a certain threshold, the packets will be marked with ECN to indicate the congestion events, and the receiver will return this indication to the sender; similarly, if the receiver VM has been delayed by the hypervisor scheduler, after it wakes up, it will mark the backlogged packets with EDN to indicate the delay events, and this indication will also be returned to the sender VM. Once the sender VM receives a duplicate ACK with EDN, it knows that the ACK indicates the arrival of a previous retransmission instead of a real packet loss; therefore, the sender will not enter the fast-retransmit phase to halve cwnd and ssthresh. The details are described in Algorithm 3. It should be noted that for the first three ACKs that are returned for the original

10

Algorithm 3: The actions of PVTCP, when VM preemption happens to the receiver VM.

500 400 Goodput (Mbps)

Per-vCPU variable: is_vcpu_wakeup; Receiver-side action:: if is_vcpu_wakeup is true then For the first three received data segments, return EDN ACK packets immediately; if the end_seq is before rcv.nxt then Return a duplicate ACK with EDN flag; else Follow standard TCP’s DelACK routines; Sender-side action:: if the received duplicate ACK contains EDN flag then Do not enter fast-retransmit, discard the packet;

300 200 100 0

TCP-200ms TCP-10ms PVTCP-1ms 2

4

6

TCP-100ms TCP-1ms 8 10 12 14 16 18 20

# of concurrent sending VMs

Fig. 15. [3VMs→3VMs]: with 1ms RT Omin , PVTCP is able to deal with both pseudo-congestion and real network congestion.

VII. E VALUATION transmissions, we also add the EDN flag to the ACKs. This is just for integrity checking after the sender enters F-RTO: in a virtual cluster, the major source of sudden delays that cause spurious RTOs should be VM scheduling delays, so the ACKs received by F-RTO are also expected to contain the EDN flag. Regarding RTT measurement, since standard TCP does not use the ACKs for retransmitted segments to calculate RTTs (Karn’s algorithm), we do not need to deal with it specifically. VI. I MPLEMENTATION We have implemented PVTCP in Linux 3.8.8 (serving as the guest OS). The implementation aims at a minimal footprint by reusing existing Linux code as much as possible, including less than 400 lines of code. Inside a Xen guest OS, each vCPU handles virtual timer interrupts from the Xen hypervisor via the registered handler xen_timer_interrupt(). In this handler, the kernel calls do_timer() to update the global system clock jiffies. To keep track of the local clock, we update our per-vCPU variable local_jiffies in this same function. After updating the system clock, the kernel will raise TIMER_SOFTIRQ to check whether there are expiry timers on the local vCPU. TCP’s retransmit timer is of course one of them. If a VM scheduling delay has been detected, PVTCP sender does not retransmit any segment in tcp_retransmit_timer(), but just resets the retransmit timer to expire at jiffies+1 and then returns. Soon after that, if an ACK is received (which might have been delayed by the hypervisor’s scheduler), PVTCP sender will not use it to measure RTT in tcp_rtt_estimator(). When the PVTCP receiver suffers a VM preemption, after it wakes up, it will count the number of received data segments in tcp_rcv_established(). For the first three segments, the corresponding EDN ACKs will be returned immediately, without checking DelACK’s condition in __tcp_ack_snd_check(). If an obsolete data segment is received in tcp_validate_incoming(), the returned duplicate ACK will also be marked with EDN flag in tcp_send_ack(). When PVTCP sender receives such an EDN duplicate ACK in tcp_ack(), the fast-retransmit will not be triggered.

In our experiments, we set up as many as 20 sender VMs to simultaneously transmit data to one receiver VM. Each VM is configured with 512MB memory and one vCPU. Other experimental settings are already described in §II-B. We simply vary RT Omin between 200ms and 1ms to compare the performance of PVTCP with standard TCP: • TCP-200ms/100ms/10ms/1ms: a large RT Omin value is better in tolerating pseudo-congestion than a small value, but it is less capable to deal with throughput collapse when the number of VM senders is large. • PVTCP-1ms: it is expected to be able to deal with both problems, regardless of the number of VM senders. DelACK is enabled in our experiments to reduce CPU overhead, but its timeout value is set to 1ms to avoid sporadic long delays. As for the response to a spurious RTO, Linux TCP implements two policies: (1) the conservative response is to reduce both the congestion window (cwnd) and slowstart threshold (ssthresh) by half; (2) the aggressive way is to restore cwnd and ssthresh to their original values. In order to avoid unnecessary performance loss, we adopt the second response policy. A. Goodput When VM consolidation happens on both the sender side and the receiver side, we evaluate PVTCP’s effectiveness to deal with goodput problem in Fig. 15. With standard TCP, it is difficult to choose a suitable RT Omin that can fit the range of all scenarios: a value too small suffers from pseudo-congestion and a value too large misses the real congestion. In contrast, PVTCP-1ms outperforms TCP-200ms/100ms/10ms/1ms in almost all cases. This feature is very important in practice as users are freed from the worry of an improper RT Omin . 1) Sender-side solution: With only the sender VMs being consolidated, we evaluate the approaches proposed in §V-B. Fig. 16 shows that without the RTT filter, PVTCP-1ms performs even worse than TCP-1ms as the number of senders increases: there is 7.2% less goodput with 20 VM senders. Selecting the case of one VM sender (with only pseudocongestion), Fig. 17 illustrates how the RTT filter works. When VM scheduling delays are included in the measured RTTs, the calculated RTO values (based on TCP’s low-pass filter [23]) are seriously inflated in Fig. 17(a). This inflation would make

11

TCP-200ms TCP-100ms TCP-10ms TCP-1ms PVTCP-1ms, w/o RTT filter PVTCP-1ms, w/ RTT filter

Goodput (Mbps)

800 600 400 200

400 300 200

0

2

4

6

8 10 12 14 16 18 20

2

# of concurrent sending VMs

200

200 Measured value (ms)

calculated_rto measured_rtt

150 100 50 0

calculated_rto measured_rtt

400

600

800 1000

Time (ms)

6

8 10 12 14 16 18 20

100 50

0

200

400

600

Fig. 18. In the scenario of [1VM→3VMs], by temporarily disabling DelACK on the receiver side, the un-coalesced ACKs help the sender detect spurious RTOs and then restore the sending speed; further, EDN helps avoid false triggering of fast-retransmit on the sender side. TABLE IV T HE EFFECTIVENESS OF EDN, WITH TRANSMITTING 4GB DATA . EDN HELPS THE SENDER AVOID ENTERING FAST- RETRANSMIT FALSELY. [1VM→3VMs] w/o EDN w/ EDN

150

0 200

4

# of concurrent sending VMs

Fig. 16. In the scenario of [3VMs→1VM], PVTCP avoids RTOs when the sender VM wakes up; further, RTT filter helps eliminate the unreliable RTTs that have been contaminated by VM scheduling delays. Measured value (ms)

TCP-200ms TCP-100ms TCP-10ms TCP-1ms PVTCP-1ms, w/o EDN PVTCP-1ms, w/ EDN

500

100

0

0

600 Goodput (Mbps)

1000

800 1000

Time (ms)

Throughput RTO events Retransmitted segments Times of fast retransmit

512.5 Mbps 1257 1464 25

534.2 Mbps 1173 1173 0

(a) RT Omin =1ms, w/o RTT filter. (b) RT Omin =1ms, w/ RTT filter. Fig. 17. RTT filter helps PVTCP avoid the sudden inflation of retransmit timeout values, by eliminating the unreliable RTT measurements.

the TCP sender wait longer to arrange retransmissions when real packet loss happens. In contrast, with our RTT filter in Fig. 17(b), the measured RTTs comprise only packet transmission delays and packet processing delays, and PVTCP delivers the highest performance in the whole range. 2) Receiver-side solution: With only the receiver VM being consolidated, we evaluate the approaches proposed in §V-C. Fig. 18 shows that with temporarily disabling DelACK, most of the performance loss can be avoided; but without EDN, there is a slight but visible performance decrease in the whole range. Table IV shows the scenario of one sender: without EDN, the goodput decreases by about 4%, with 25% more retransmissions. In Linux, every duplicate ACK will cause the TCP sender to enter tcp_fastretrans_alert() to infer packet loss; when the same segment has been retransmitted for over three times (due to successive RTOs, as shown in Fig. 10 and Table III), the returned duplicate ACKs will cause the sender to enter the fast-retransmit phase, leading to false retransmissions; these false retransmissions will in turn trigger more duplicate ACKs from the receiver. With EDN, these duplicate ACKs will be marked, so the sender will not view them as the congestion signal. It should be stressed that using the EDN flag has a deployment problem, requiring both the sender and the receiver to be modified. But unsurprisingly, even without the use of EDN, PVTCP does not lose out too much in performance, because the timely returned ACKs have helped the sender side’s F-RTO detect most spurious RTOs. B. Congestion Window To better illustrate the behavior of PVTCP, we select the tests of one sender to show cwnd and ssthresh. When the sender VM is consolidated in Fig. 19(a), cwnd is unable to

grow over 30 with TCP-1ms, because of the sharp reduction after each RTO; worse still, the reduced ssthresh causes the sender to enter congestion avoidance too prematurely, preventing cwnd from growing more quickly. While with PVTCP-1ms: (1) the sender never enters congestion avoidance even after experiencing VM scheduling delays, because all RTOs have been properly avoided; (2) since the sender is in the slow-start phase all the time, ssthresh keeps its initial value (in Linux, the value of TCP_INFINITE_SSTHRESH is 0x7fffffff), which exceeds the plot range of Fig. 19(a); (3) cwnd stays in a plateau, limited by congestion window validation (RFC2861). This is because the sender VM cannot fully utilize its congestion window due to the limited allocation of CPU cycles. This phenomenon is quite different from the familiar sawtooth pattern, in which the output queue of the hardware switch grows until the packets are dropped. In virtualized datacenters, the limited CPU allocation of one VM can easily become another bottleneck for network processing. In Fig. 19(b), when RTOs are caused by VM scheduling delays on the receiver side, PVTCP can help the sender detect most of the spurious RTOs, so ssthresh and cwnd can be properly restored instead of being cut down frequently as has happened in TCP. After F-RTO has declared one timeout to be spurious, the sender continues transmitting new data following the congestion avoidance algorithm. The plateau value of ssthresh reflects the receiver VM’s capability to process network packets. Slow-Start Restart (SSR). VM scheduling delays can cause the flow to stall for a certain period. RFC2861 suggests to use SSR after experiencing a long sending pause in an open connection, which reinitializes cwnd: if ((now - last_send_time) > RTO) tcp_cwnd_restart(); In the scenario of [3VMs→1VM], we find that SSR brings 4%∼6% decrease to goodput with PVTCP-1ms. As shown

12

200 tcp-cwnd tcp-ssthresh pvtcp-cwnd 150 pvtcp-ssthresh

Size (the number of segments)

Size (the number of segments)

200

150 tcp-cwnd 100 tcp-ssthresh pvtcp-cwnd pvtcp-ssthresh 50

100

50

0

0 0

200

400

600

800

1000

0

200

400

Time (millisecond)

600

800

1000

Time (millisecond)

(a) [3VMs→1VM], RT Omin =1ms.

(b) [1VM→3VMs], RT Omin =1ms.

Fig. 19. When there is VM consolidation on the sender side, PVTCP sender can avoid RTOs; when VM consolidation happens on the receiver side, PVTCP receiver can help the sender detect spurious RTOs. 50

50 Sender VM

Receiver VM

Sender VM

150

100

30 20 10

50

0

200

400

600

800

1000

Time (millisecond)

Fig. 20. TCP congestion window variation with SSR enabled, [3VMs → 1VM].

in Fig. 20, the congestion window is usually restarted after the sender VM wakes up, because the VM’s blocking period (10×ms) can be much larger than the predetermined RTO value. For this reason, we did not enable SSR in the above experiments. In physical datacenters, it has been suggested that one should disable SSR to reduce the unnecessary delay when applications transfer a large amount of data after an idle period over a persistent connection [51]. In the scenario of [1VM→3VMs], the negative impact of SSR is much less apparent. This is because the sender often experiences several RTOs before the receiver VM wakes up (Fig. 10 and Table III). After each timeout, the next RTO value will be doubled (exponential backoff). Besides, the last_send_time will also be advanced after each RTOtriggered retransmission. C. CPU Overhead Fig. 21 shows the CPU overhead with one VM sender and one VM receiver. In the tests of [3VMs→1VM] with PVTCP1ms, the CPU overhead of both the sender VM and the receiver VM is almost identical to that of TCP-200ms because there are no RTOs. In the tests of [1VM→3VMs], the sender VM of PVTCP-1ms consumes slightly less CPU cycles than that of TCP-1ms but more CPU cycles than that of TCP-200ms. We attribute the increased overhead to the unavoidable RTOs. For the receiver VM, in both Fig. 21(a) and (b), TCP-1ms is not able to transmit data in a high speed due to its small sending window, and therefore the CPU utilization is low. In contrast, both TCP-200ms and PVTCP-1ms can utilize CPU cycles more efficiently by feeding the receiver VM with as much data as they can transmit.

20

0 TCP 200ms

0

30

10

0

pvtcp-cwnd

Receiver VM

40 CPU utilization (%)

40 CPU utilization (%)

Size (the number of segments)

200

TCP 1ms

PVTCP TCP 1ms 200ms

TCP 1ms

PVTCP 1ms

TCP 200ms

(a) [3VMs → 1VM]

TCP 1ms

PVTCP TCP 1ms 200ms

TCP 1ms

PVTCP 1ms

(b) [1VM → 3VMs]

Fig. 21. The CPU utilizations of PVTCP and TCP, when RT Omin =1ms. TABLE V T OTAL ACK S SENT BY THE RECEIVER VM,

TCP-200ms TCP-1ms PVTCP-1ms

TRANSMITTING

4GB DATA .

[1VM→3VMs]

[3VMs→1VM]

192,587 244,757 192,863

194,384 262,274 208,688

Table V shows the total ACKs generated in different scenarios. Compared with TCP-200ms, TCP-1ms triggers 27.1% and 35.2% more ACKs in the tests of [3VMs→1VM] and [1VM→3VMs] respectively, because (1) there are many unnecessary retransmissions and each retransmission will induce a duplicate ACK; (2) the sender enters the slow-start after each RTO, so the receiver has to switch to QuickACK mode frequently. As for PVTCP-1ms, it brings almost the same amount of ACKs as TCP-200ms in the scenario of [3VMs→1VM], and only 7.4% more ACKs in the scenario of [1VM→3VMs] due to the temporarily disabling DelACK. Compared with the overhead of completely disabling DelACK (11∼13× more ACKs, Table II), PVTCP is very light-weight. VIII. D ISCUSSION AND F UTURE W ORK Eifel [30] uses TCP timestamp to distinguish between the ACKs for the original transmissions and the ACKs for the retransmissions. TCP timestamp option has been widely deployed and is enabled in Linux by default. However, it has been reported that Eifel performs much worse than F-RTO and conventional go-back-N recovery when bursty loss happens [40]. Besides, it is not chosen by Linux because it introduces a rather complex set of equations to calculate RTO, which makes it difficult to evaluate the possible side effects in different network scenarios [41]. More importantly, it also suffers the retransmission ambiguity arising from DelACK, because the

13

receiver does not return the timestamp for every segment, but only the timestamp of the earliest unACKed segment. PVTCP is independent of the detection algorithm used by the sender. By returning ACKs in a deterministic way after the receiver VM wakes up, it can assist both Eifel and F-RTO to eliminate the retransmission ambiguity. The main focus of this paper is to study the impact of VM scheduling delays on TCP’s goodput when there is network congestion. Intuitively, the improved goodput should also lead to reduced flow completion times. Actually, typical datacenter traffic contains predominantly small flows, and there are various contributors to a flow’s overall delay, such as retransmission delays from the transport layer, VM scheduling delays from the hypervisor and queueing delays from various buffers (hardware switch buffer, Fig. 6’s buffer, etc.). It will be meaningful in our future work to investigate to what extent our solution can benefit small flows using more realistic application workloads. IX. R ELATED W ORK A. Datacenter Network Congestion Control Application-layer approaches. The incast congestion problem can be prevented in applications, typically by restricting the number of synchronous requests [27], [38], [39], [33]. However, since a datacenter needs to support a large number of applications which cannot be easily modified, a solution at the transport layer is more preferred in many scenarios. TCP-layer approaches. A microsecond-value (200µs) for the retransmit timer is proposed in [45], but there is a safety concern about this approach: we find that this value interacts poorly with Linux kernel NAPI implementation in the bnx2 network device driver, which easily leads to a kernel panic. Therefore we adopt 1ms in our work. ICTCP [48] adjusts the receive window to limit the bandwidth of each TCP sender before there is a chance for congestion to happen; however, in distributed and anonymously shared networks (e.g., public clouds), since each host has no idea of how the others would behave, network congestion is essentially inevitable. In [42], authors propose to customize TCP at the phases of initiation, continuation and termination, which are too complicated and also lack general applicability. GIP [53] observes that RTOs mainly occur at the boundary of the stripe units, it then proposes to redundantly transmits the last packet of each stripe unit, but the redundant transmissions may be unnecessary when network congestion is not serious. Network-assisted approaches. Increasing the switch buffer size is one possible solution, which is however too expensive and any particular switch configuration would impose a limit on the maximum number of servers that can send simultaneously before throughput collapse occurs [37]. DCTCP [8] leverages the ECN capability in hardware switches to perform a fine-grained congestion window adjustment; but their experiments show that when the number of senders is large, DCTCP also needs to work with a small retransmit timeout value. D2 TCP [44] is built upon DCTCP but considers flow deadlines when modulating the congestion window. D3 [47] and PDQ [21] require rate limiting at switches to implement

deadline-aware transmission control. QCN is also a promising technique, but when multiple flows share the bottleneck link, the unfairness may cause TCP throughput collapse [54]. In virtualized environments, we face a unique challenge to deal with network congestion: due to VM scheduling delays, TCP in VMs cannot obtain reliable congestion information of the physical network from RTTs and RTOs. Therefore, we believe that the solutions designed for physical clusters should not be directly applied to virtual clusters, without properly addressing the problem of VM scheduling delays. B. TCP in Virtual Machines In [34], the authors propose to offload the whole socket layer into the hypervisor. However, this would greatly complicate a key feature of virtualization: VM live migration [15], because it brings in the problem of residual dependencies, as has been well identified in process migration [31]. PVTCP does not have this problem as the hypervisor is oblivious of it. To shorten the period of idle link aforementioned (§II-D), vSnoop [24] lets the driver domain acknowledge the sender on behalf of the receiver VM. As a counterpart, vFlood [18] encourages the sender VM to overfeed packets into the driver domain, which will transmit them on behalf of the sender VM. Their methods are very similar to that of split-TCP [26], by dividing the TCP connection between VMs into two connections with the driver domain serving as the proxy point. Our work complements theirs in that we address the problem in another layer: VM scheduling delays can also distort the transport-layer’s behaviors. X. C ONCLUSION Although virtualization technology has been in development for many years, many of its performance problems are still not fully understood or satisfactorily tackled. Unlike previous efforts that modified the hypervisor to hide the virtualization reality from the guest OS, here we try to look at the problem from a different but simpler angle: we let the guest OS face the reality that each vCPU can be preempted at any time, and then think about how it can live with such reality? Focusing on the transport layer, we illustrate how VM scheduling may distort network congestion control. With a detailed analysis of pseudo-congestion, we propose an approach to paravirtualizing TCP. PVTCP can effectively filter out the negative effects produced by VM scheduling on RTT measurement and RTO triggering. The experimental results with our prototype implementation in Linux guests show that PVTCP is more capable to deal with spurious RTOs, which in turn helps avoid the goodput collapse. We believe PVTCP has the potential to support more efficient communications in virtualized clouds. R EFERENCES [1] [2] [3] [4] [5]

Amazon EC2 instance types: http://aws.amazon.com/ec2/instance-types/. ns-3 network simulator: https://www.nsnam.org/. Xen credit scheduler: http://wiki.xen.org/wiki/credit scheduler. The architecture of VMware ESXi. VMware White Paper, 2008. Timekeeping in VMware virtual machines. VMware Information Guide, 2011. [6] The CPU scheduler in VMware vSphere 5.1. VMware Technical White Paper, 2013.

14

[7] I. F. Akyildiz, G. Morabito, and S. Palazzo. TCP-Peach: a new congestion control scheme for satellite IP networks. IEEE/ACM Transactions on Networking, 9(3):307–321, 2001. [8] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data Center TCP (DCTCP). In SIGCOMM, 2010. [9] M. Allman and V. Paxson. On estimating end-to-end network path properties. In SIGCOMM, 1999. [10] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In SOSP, 2003. [11] S. K. Barker and P. Shenoy. Empirical evaluation of latency-sensitive application performance in the cloud. In MMSys, 2010. [12] Y. Chen, R. Griffith, J. Liu, R. H. Katz, and A. D. Joseph. Understanding TCP incast throughput collapse in datacenter networks. In WREN, 2009. [13] L. Cheng and C.-L. Wang. vBalance: using interrupt load balance to improve I/O performance for SMP virtual machines. In ACM SoCC, 2012. [14] L. Cheng, C.-L. Wang, and F. C. M. Lau. PVTCP: Towards practical and effective congestion control in virtualized datacenters. In ICNP, 2013. [15] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In NSDI, 2005. [16] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, 2008. [17] Y. Dong, X. Yang, J. Li, G. Liao, K. Tian, and H. Guan. High performance network virtualization with SR-IOV. JPDC, 72(11):1471– 1480, 2012. [18] S. Gamage, A. Kangarlou, R. R. Kompella, and D. Xu. Opportunistic flooding to improve TCP transmit performance in virtualized clouds. In ACM SoCC, 2011. [19] T. Goff, J. Moronski, D. S. Phatak, and V. Gupta. Freeze-TCP: A true end-to-end TCP enhancement mechanism for mobile environments. In INFOCOM, 2000. [20] Z. Hill, J. Li, M. Mao, A. Ruiz-Alvarez, and M. Humphrey. Early observations on the performance of Windows Azure. In HPDC, 2010. [21] C.-Y. Hong, M. Caesar, and P. B. Godfrey. Finishing flows quickly with preemptive scheduling. In SIGCOMM, 2012. [22] Y. Hu, X. Long, J. Zhang, J. He, and L. Xia. I/O scheduling model of virtual machine based on multi-core dynamic partitioning. In HPDC, 2010. [23] V. Jacobson. Congestion avoidance and control. In SIGCOMM, 1988. [24] A. Kangarlou, S. Gamage, R. R. Kompella, and D. Xu. vSnoop: Improving TCP throughput in virtualized environments via acknowledgement offload. In SC, 2010. [25] B. Kim and J. Lee. Retransmission loss recovery by duplicate acknowledgment counting. IEEE Communications Letters, 8(1):69–71, 2004. [26] S. Kopparty, S. Krishnamurthy, M. Faloutsos, and S. Tripathi. Split TCP for mobile ad hoc networks. In GLOBECOM, 2002. [27] E. Krevat, V. Vasudevan, A. Phanishayee, D. Andersen, G. Ganger, G. Gibson, and S. Seshan. On application-level approaches to avoiding TCP throughput collapse in cluster-based storage systems. In PDSW, 2007. [28] M. Lee, A. S. Krishnakumar, P. Krishnan, N. Singh, and S. Yajnik. Supporting soft real-time tasks in the Xen hypervisor. In VEE, 2010. [29] J. Liu and S. Singh. ATCP: TCP for mobile ad hoc networks. IEEE Journal on Selected Areas in Communications, 19(7):1300–1315, 2001. [30] R. Ludwig and R. H. Katz. The Eifel algorithm: making TCP robust against spurious retransmissions. SIGCOMM CCR, 30(1):30–36, Jan. 2000. [31] D. Milojiˇci´c, F. Douglis, Y. Paindaveine, R. Wheeler, and S. Zhou. Process migration. ACM Comput. Surv., 32(3):241–299, 2000. [32] D. Nagle, D. Serenyi, and A. Matthews. The panasas activescale storage cluster: Delivering scalable high bandwidth storage. In SC, 2004. [33] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, et al. Scaling memcache at facebook. In NSDI, 2013. [34] A. Nordal, A. Kvalnes, and D. Johansen. Paravirtualizing TCP. In VTDC workshop, 2012. [35] D. Ongaro, A. L. Cox, and S. Rixner. Scheduling I/O in virtual machine monitors. In VEE, 2008. [36] B. Pfaff, J. Pettit, K. Amidon, M. Casado, T. Koponen, and S. Shenker. Extending networking into the virtualization layer. In HotNets, 2009. [37] A. Phanishayee, E. Krevat, V. Vasudevan, D. Andersen, G. Ganger, G. Gibson, and S. Seshan. Measurement and analysis of TCP throughput collapse in cluster-based storage systems. In FAST, 2008.

[38] M. Podlesny and C. Williamson. An application-level solution for the TCP-incast problem in data center networks. In IWQoS, 2011. [39] M. Podlesny and C. Williamson. Solving the TCP-incast problem with application-level scheduling. In MASCOTS, 2012. [40] P. Sarolahti, M. Kojo, and K. E. E. Raatikainen. F-RTO: an enhanced recovery algorithm for TCP retransmission timeouts. SIGCOMM CCR, 33(2):51–63, 2003. [41] P. Sarolahti and A. Kuznetsov. Congestion control in Linux TCP. In USENIX ATC, 2002. [42] A. S.-W. Tam, K. Xi, Y. Xu, and H. J. Chao. Preventing TCP incast throughput collapse at the initiation, continuation, and termination. In IWQoS, 2012. [43] Y. Tian, K. Xu, and N. Ansari. TCP in wireless environments: problems and solutions. IEEE Communications Magazine, 43(3):S27–S32, 2005. [44] B. Vamanan, J. Hasan, and T. N. Vijaykumar. Deadline-aware datacenter TCP (D2 TCP). In SIGCOMM, 2012. [45] V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger, G. A. Gibson, and B. Mueller. Safe and effective fine-grained TCP retransmissions for datacenter communication. In SIGCOMM, 2009. [46] G. Wang and T. Ng. The impact of virtualization on network performance of Amazon EC2 data center. In INFOCOM, 2010. [47] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron. Better never than late: meeting deadlines in datacenter networks. In SIGCOMM, 2011. [48] H. Wu, Z. Feng, C. Guo, and Y. Zhang. ICTCP: incast congestion control for TCP in data center networks. In CoNEXT, 2010. [49] C. Xu, S. Gamage, H. Lu, R. Kompella, and D. Xu. vTurbo: Accelerating virtual machine I/O processing using designated turbo-sliced core. In USENIX ATC, 2013. [50] C. Xu, S. Gamage, P. N. Rao, A. Kangarlou, R. R. Kompella, and D. Xu. vSlicer: latency-aware virtual machine scheduling via differentiatedfrequency CPU slicing. In HPDC, 2012. [51] M. Yu, A. Greenberg, D. Maltz, J. Rexford, L. Yuan, S. Kandula, and C. Kim. Profiling network performance for multi-tier data center applications. In NSDI, 2011. [52] J. Zhang, F. Ren, and C. Lin. Modeling and understanding TCP incast in data center networks. In INFOCOM, 2011. [53] J. Zhang, F. Ren, L. Tang, and C. Lin. Taming TCP incast throughput collapse in data center networks. In ICNP, 2013. [54] Y. Zhang and N. Ansari. On mitigating TCP incast in data center networks. In INFOCOM Mini-Conference, 2011.

Luwei Cheng received his B.Eng. degree from Harbin Institute of Technology in 2009 (rank 1st out of 186), and his M.Phil. degree from the University of Hong Kong in 2012. He is currently a Ph.D. candidate in Department of Computer Science, the University of Hong Kong. His research interests are mainly in virtualization technologies for cloud datacenters, including the hypervisor and the guest operating system design. He received Best Student Paper Award in IEEE/ACM UCC conference 2011, Hong Kong PhD Fellowship Award in 2012 and Microsoft Research Asia Fellowship Award in 2013.

Francis C.M. Lau (SM) received his PhD in computer science from the University of Waterloo in 1986. He has been a faculty member of the Department of Computer Science, the University of Hong Kong since 1987, where he served as the department chair from 2000 to 2005, and is now an Associate Dean of Faculty of Engineering. He was a honorary chair professor in the Institute of Theoretical Computer Science of Tsinghua University from 2007 to 2010. His research interests include computer systems and networking, algorithms, HCI, and application of IT to arts. He is the editor-in-chief of the Journal of Interconnection Networks.