Lowering Inter-Datacenter Bandwidth Costs via Bulk Data Scheduling

Viewer
Transcript

Lowering Inter-Datacenter Bandwidth Costs via Bulk Data Scheduling Thyaga Nandagopal

Krishna P. N. Puttaswamy

National Science Foundation Arlington, VA, USA

Bell Labs, Alcatel-Lucent Murray Hill, NJ, USA

Abstract—Cloud service providers (CSP) of today operate multiple data centers, over which they provide resilient infrastructure, data storage and compute services. The links between data centers have very high capacity, and are typically purchased by the CSPs using established billing practices, such as 95-th percentile billing or average-usage billing. These links are used to serve both client traffic as well as CSP-specific bulk data traffic, such as backup jobs, etc. Past studies have shown a diurnal pattern of traffic over such links. However, CSPs pay for the peak bandwidth, which implies that they are underutilizing the capacity for which they have paid for. We propose a scheduling framework that considers various classes of jobs that are encountered over such links, and propose GRESE, an algorithm that attempts to minimize overall bandwidth costs to the CSP, by leveraging the flexible nature of the deadlines of these bulk data jobs. We demonstrate the problem is not a simple extension of any well-known scheduling problems, and show how the GRESE algorithm is effective in curtailing CSP bandwidth costs.

I. I NTRODUCTION A Cloud Service Provider (CSP) operates the cloud infrastructure over multiple data centers (DCs). Connectivity between these data centers is provided by very high capacity links. While some large providers such as Google or Amazon might entirely own some of the links between their data centers, in bulk of the cases the links are leased by the CSPs from network operators. The cost model of the bandwidth consumed follows the 95th percentile model [1], which in essence charges the CSP by the value of the 95-th percentile value of the peak bandwidth consumed on the link, or other types of billing models [2]. Traffic on such links form a well-known diurnal pattern [1]. Given that CSPs are paying for peak usage1 , a lot of bandwidth capacity is left unused over the inter-DC links. In order to see how this might be avoided, let us consider the nature of the traffic over such links. The links between the data centers are used to carry both client traffic and DC-DC traffic (referred to as D2D traffic in [3]). Based on a study of five Yahoo! data centers, the authors in [3] find that D2D traffic is pretty much constant over the duration of the day and occupies as much as 40% of the interDC link bandwidth. The client traffic is something that the CSP has no control over and exhibits a strong diurnal pattern. 1 Unless explicitly specified, the term peak bandwidth can refer to either the 95th percentile or the absolute maximum bandwidth.

Fig. 1.

Overview

Such traffic has to be sent over the link as soon as it arrives. However, the D2D traffic is primarily composed of delay tolerant jobs, such as backup traffic, database operations, video processing, analytics, etc. These jobs, while delay tolerant, are typically very large and have still to be completed within a reasonable amount of time, say within a span of few hours to days. We call this traffic as Bulk Data (BD), and the notable characteristic of such jobs is their large deadlines. One way of utilizing this large deadline aspect of BD jobs is to move them into the off-peak durations on the inter-DC link, as shown in Figure 1. This is the practice known as water-filling, and several papers have attempted to solve this problem [1], [4]. They assume that the BD jobs are elastic and pre-emptable, i.e., any single job can be stopped and started at will, and can be commanded to occupy any amount of bandwidth specified by the CSP at will. It is easy to argue that there is no practical example of an application that is truly elastic and pre-emptable especially at the level of data-centers, where the bandwidths span several gigabits per second and the time-scales are of the order of several minutes. For example, it is well-known [5] that TCP throughput is upperbounded √1 , where p is the packet loss probability, MSS is by MSS RT T p maximum segment size and RTT is the round-trip delay. This implies that the maximum TCP throughput for a connection between New York and Los Angeles, with an MSS of 1460 bytes, RTT of 40ms, and loss rate of 0.01% (p = 0.0001) is about 29.2 Mbps. Apart from backup traffic, which could be considered somewhat elastic and pre-emptable within the maximum throughput, most of these BD workloads are not completely elastic jobs. Many of these jobs (e.g., video processing, analytics) are constrained by the server resources at either end, and can only be served at a specific rate. If more bandwidth is given to such

2

inelastic jobs, they would not be able to consume it, while if less bandwidth is given to them, the server resources on either end will starve and operate at less than peak efficiency. Consider a scenario where a movie rental company (such as Netflix) is running a transcoding application in a multi-tenant data center (such as Amazon) [6]. This application is streaming the raw movie from a media company and transcoding it on the fly into different formats of different quality and writing out the final movie. If the movie is streamed at a near-constant rate, which an instance of this application can process, then all the resources are ideally utilized on the receiver side. Streaming data at a larger rate will require a complicated application architecture that can store the data temporarily (thus costing more to the tenant), and similarly streaming at a lower rate will also waste the resources costing more to the tenant again. Especially in a multi-tenant data center, providing a continuous and near-constant rate can lead to a high utilization of all the resources and reduce the cost for a large number of tenants performing such transcoding, analytics or processing on data stream from a different data center. In a data center setting, the bandwidth that an instance can handle is governed by various events such as the VMs colocated in the same host, or connected to the same switch, or by the disk or memory bandwidth that is available to the instance. In such settings even jobs that are typically considered to be elastic (such as backup, or data movement) tend to have a limit beyond which they cannot handle. For example, a FTP session cannot arbitrarily switch between periods of very high and zero bandwidths, as this will clearly lead to timeouts and subsequent underutilization of the bandwidth and the VM resource. Therefore, the amount of bandwidth changes during successive intervals needs to be bounded for specific applications, and such a characterization has to be respected by a BD scheduling policy. Unfortunately, none of the prior job-scheduling approaches are able to schedule jobs of these new classes. In this paper, we propose GRESE (Greedy Residual Scheduler), an algorithm that attempts to minimize overall bandwidth costs to the CSP, across these new class of jobs. Our contributions are as follows. 1) We identify specific characteristics of jobs that describe how elastic the bandwidth assigned to them can be. We show these can be used to represent bulk data transfers between data centers. 2) We show why existing algorithms are insufficient to handle such jobs based simply on predicted bandwidth. We demonstrate how our GRESE algorithm can handle this diverse set of jobs while accounting for bandwidth prediction errors, using qualitative and quantitative analysis. 3) We show the performance of GRESE over real-life traces gathered from a CDN, as well as representative synthetic data-sets. We show that GRESE reduces the bandwidth consumption by a significant margin. We first take a look at related work in this space next.

II. R ELATED W ORK Job scheduling of bulk data is a very well-known problem, and has been extensively studied before in the context of parallel processing, as well as water-filling traffic to use up wasted bandwidth. Given the vast amount of work in this space, we present a summary of only the most relevant work here. In this discussion, the term job is meant to describe a data-transfer between two hosts in different data-centers. The problem considered here is that of scheduling continuous job streams, where the classical work of Bender et.al. [7] defines the notion of max-stretch. Max-stretch is defined as the slow-down factor of a job relative to the time it takes to complete on an un-loaded system. Among those that provide bounds on the max-stretch for a given set of jobs, the work related to this problem is by Goel et.al. [8], where they consider how much additional capacity is needed to guarantee a certain value of max-stretch. While they do not consider deadlines, the problem can be mapped to deadline-scheduling by converting the max-stretch into a fixed deadline for all jobs. In order to use max-stretch as a metric, the capacity available to complete the job in an unloaded system must be a nonincreasing function for the duration of the job. In the problem that we consider, this is not the case as the residual capacity left over for BD jobs can vary in any manner. The other class of papers related to the scheduling problem discussed here are the water-filling algorithms [1], [4], where there is a residual capacity function available for scheduling BD jobs. In [4], the authors have an objective of determining the minimal excess bandwidth needed to schedule all the (elastic) jobs before their deadlines. They assume that the exact residual capacity is known, and that there is no errors in this prediction. In [1], the authors compute the maximum amount of data that can be pushed using the residual capacity. The BD jobs considered are elastic and have infinite deadlines, and they account for prediction inaccuracies by allowing for a certain amount of slack to be deducted from the residual capacity. However, they do not have a mechanism for accurately predicting the optimal amount of slack bandwidth needed. Given a fixed capacity function, a simple strategy to meet the deadlines is the Earliest-Deadline-First (EDF), which is optimal assuming that all jobs are pre-emptable [9]. If the capacity is insufficient to meet the deadlines, then additional capacity can be added to the extent needed to meet deadlines of existing jobs. There are three basic assumptions underlying these bandwidth filling schemes, which are violated in data center environments handling BD jobs, thereby rendering such algorithms less useful. 1. Any job can consume all the bandwidth available: The common assumption among the Bulk Data scheduling papers is that bandwidth can be elastic regardless of the number of jobs. As described in Section I, this is not true for a wide class of jobs. 2. All jobs can be preempted if necessary: Pre-empting a D2D data transfer job implies server resources on the other

3

end also have to be re-scheduled to avoid resource wastage, requiring a greater level of end-to-end cooperation with the job scheduler. 3. End hosts are free: Data centers operate over virtual machines, and each virtual machine instance has a cost. The VM instances should receive the right amount of bandwidth to avoid under-utilization. Since the above three assumptions do not hold in practical situations encountered in BD transfers, EDF is no longer optimal, the job-scheduling algorithms in [7] and [8] have really poor approximation ratios O(n), where n is the number of jobs to be scheduled, and the solutions in [1], [4] will perform poorly. In this paper, we consider additional characterization of BD jobs beyond the plain elastic nature assumed hitherto in related work, and also provide robustness to measurement errors in the estimation of residual capacity. Our algorithms rely on the notion of packing to accommodate these jobs. Bin packing is another extensively studied area of research in algorithms [10], however, our problem is the equivalent of finding the minimum height of augmentation for variable size bins given variable size balls that cannot be packed in certain bins (i.e., beyond the deadline). We are not aware of such a problem being studied in bin-packing/knap-sack algorithm literature. In [11], the problem of scheduling job transfers in a network to minimize completion times is studied, where the goal is to increase the maximum concurrent flow in the network, given fixed capacity constraints. In [1], [12], based on the observation that the demand of specific geographic areas follows strong diurnal patterns, the authors focus on how to utilize the excess bandwidth for nonreal-time bulk data transfers. Though in the same setting, we are targeting a totally different problem: minimizing the overall peak (maximum link load). In [13], [14], the authors studied the traffic characteristics within single data centers. In [13], the authors conducted a study of temporal and spatial variations in traffic volumes by analyzing the SNMP data from data centers. They find that traffic at data center edge can be characterized by ON-OFF patterns where the ON and OFF periods and packet interarrival times with an ON period follow log-normal distributions. The authors of [14] also explored traffic characteristics inside data centers. They show that the traffic changes quite quickly over time within a data center, with some spikes transient but others lasting for a while. Such transitions are also expected in the inter-DC link for customer traffic, which leads to rapid changes in available capacity for BD jobs. III. J OB M ODEL We present a general formulation of the types of jobs we expect to see in inter-data center transfers. A job j has a minimum rate Bjmin and a maximum rate Bjmax . In addition, each job also has a flex parameter, ∆j , which specifies the maximum absolute variation of the rate of a job between successive epochs. In other words, if rj (t) is the rate of a

job in epoch t, then the following are true: |rj (t + 1) − rj (t)| ≤ ∆j Bjmin ≤ rj (t) ≤ Bjmax The above formulation captures a wide-ranging set of jobs. CBR traffic will be modeled by Bjmin = Bjmax and ∆j = 0. A true elastic job can be specified by Bjmin = 0, Bjmax = ∞ and ∆j = ∞. Jobs using TCP will be limited by the flow/congestion windows and RTT, and would prefer that the variation in rates be gradual. For such jobs, we could set the min and max bandwidths appropriately to reflect the flow and congestion window requirements, and then set their flex parameter to not more than, say, 10% of their maximum bandwidth. Some jobs may not want to be pre-empted at all, and for those jobs, we could select the minimum bandwidth Bjmin > 0. Thus, we capture nearly all types of practical flow requirements in this model, and this is better than the pure elastic traffic models considered in literature until now. A. Capacity Minimization Problem Given this formulation, we have to determine what is the minimum bandwidth required to ensure completion of all the jobs before the deadline. Our goal is to keep the overall capacity purchased by the CSP as low as possible. In practice, the purchased capacity is measured by one of several metrics, e.g., 95th percentile or max capacity based billing. The difference between the 95th percentile and max-capacity billing models is visible only when there is significant traffic variations between the peak and the 95th percentile. In an inter-DC link, where BD transfers tend to smooth out such variations, one can expect very little difference between these two billing models. Any flow that is not scheduled by the provider is a Real Time (RT) job, since their deadlines are unknown, we have to serve them right away. Let the total bandwidth consumed by these RT jobs in any epoch t be Z(t). From this value, we have the residual capacity estimate, x(t) = C(t)−Z(t), where C(t) is the link capacity purchased by the provider for this epoch. Our goal is keep this capacity arg max C(t) to be as low as possible. Note that we can model other billing scheme by replacing the above objective (max capacity) with one that reflects the billing scheme. Define the set of jobs to be J. Let each job j have a size sj bytes, an arrival time of aj and a deadline of dj . Define Fj = aj + dj and T = maxj Fj over all jobs j. Let each job j be assigned a rate of rj (t) during time epoch t. If we assume an oracle that knows future arrivals in advance, then we can compute an optimal schedule that will minimize capacity costs. However, we can show that even modeling such an oracle requires the use of an mixed-integer linear program (ILP), and is NP-hard to solve2 . Therefore, we resort to using a heuristic algorithm, described in the next section. 2 An

ILP is necessary since there are jobs with Bjmin > 0.

4

IV. GRESE: G REEDY R ESIDUAL S CHEDULER We use the fact that the CSP knows the capacity that it is paying for, either on a max-capacity or on a 95th percentile basis, in each completed epoch. The CSP also knows the BD jobs scheduled in the past measurement epochs and what bandwidth was consumed by them. Therefore, it is able to determine the actual bandwidth consumed by the RT flows that are mostly client-driven, and thus, determine residual capacity. We have to handle four different design constraints in developing an algorithm for the scheduler: (a) jobs have min and max rate guarantees with individual flex values, (b) only some (or no) jobs might be preempted at any time, (c) the potential error in relying on predicted bandwidth estimates, and (d) the unpredictable nature of BD job arrivals in the future. In order to address these constraints, we use the following guidelines: • Among the set of outstanding jobs, schedule nonpreemptable jobs first. A non-preemptable (NP) job j is one which cannot be given zero bandwidth in this current epoch, i.e., rj (t − 1) − ∆j > 0. The motivation for this is very clear. Since these jobs are the most constrained, it is better to schedule them at the earliest for the maximum flexibility. • Schedule all pending jobs anew in every time epoch. We (logically) preempt all running jobs at the end of every epoch, and re-schedule them as if they just arrived. We provide special attention to the non-preemptable jobs by always scheduling them once they have been started regardless of the available bandwidth. • Pack all jobs as early as possible. Since a BD job can arrive at any given time in the future, we can only assume the worst, and try to use as much bandwidth as we can as early as possible, within the bounds of the currently available residual capacity. This points to a greedy strategy of scheduling. • Measure prediction errors to allow some slack. If the predicted residual values are not accurate and are in fact less, then our peak consumed bandwidth by RT and BD jobs will increase. This trend when repeated over several epochs (not necessarily successive), can cause an unbounded increase in peak bandwidth that the CSP pays for. Therefore, we need some slack bandwidth to account for such prediction errors. Clearly, we need just the right amount of slack, since a large slack will lead to bandwidth wastage. We use prediction error measurements to determine the optimal value of slack. The GRESE algorithm, shown in Function 1, uses these guidelines to schedule jobs, taking into account prediction errors from measurements. In each epoch, this algorithm takes newly arrived jobs and adds them to a queue of waiting jobs. Non-preemptable jobs that are already running are immediately scheduled. Then it verifies that all other waiting and running jobs can be scheduled in the residual capacity that is available (according to the current prediction) (lines 10-12).

Function 1 GRESE(initLinkCapacity) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:

currentLinkCapacity = initLinkCapacity linkIncrement = 1% of initLinkCapacity waitingJobList = runningJobList = errorList = {} {empty} for each time epoch t do for each newly arrived job j do compute a deadline for j waitingJobList(t) += j Rate in current epoch, rj (t) = 0 end for while ScheduleJobs(0) == false do currentLinkCapacity += linkIncrement; end while ScheduleJobs(1) for each job j in runningJobList(t) do currentLinkCapacity -= allocatedCapacity j.size -= allocatedCapacity if j.size == 0 then runningJobList -= j end if end for SlackEstimator() end for

Then it schedules the jobs, reduces the size of the jobs that just ran for an epoch, and updates the prediction error for the next epoch and the slack. Function 2 ScheduleJobs(flag) 1: schedList = jobs with rj (t − 1) > ∆j , sorted in increasing order

2:

3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

of their flex, ∆j ; tie is broken by increasing order of Deadline, and further ties are broken by decreasing order of remaining job size otherJobList = {Remaining jobs in the system sorted in the increasing order of their deadline, tie is broken in the increasing order of flex} schedList ← append otherJobList to end of schedList x = current residual capacity for each job j in schedList do rj (t) = FindMinBWneeded(j) if (rj (t) == −1 ) OR (rj (t) > x) then return false else Allocate rate rj (t) to job j in this epoch x = x − rj (t) if flag == 1 then runningJobList(t) +=j end if end if end for if flag == 1 then AllocateExcessCapacity(x) end if return true

The ScheduleJobs function (Function 2) checks if the residual capacity is sufficient for all the current jobs. It iterates through all the jobs, starting with the NP jobs, first in the increasing order of their flex, and then in the increasing order of their deadline and then by decreasing order of size. For each job scheduled, we check if the job remains feasible given current estimates of capacity (Function 3). If the capacity is

5

insufficient, we increment the capacity, C(t). Note that C(t) is non-decreasing function, since once we purchase this capacity, we might as well use it until the billing period is complete. This algorithm returns a value of true when there is sufficient link capacity to finish all the jobs before their deadlines, and false otherwise. Any excess residual capacity is then allocated in Function 4 by favoring jobs that finish early and with large flex first. Function 3 FindMinBWneeded(Job j) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:

rj (t − 1) = bandwidth assigned to j in previous epoch Fj = Time by which job j should be complete s′j = Remaining size of job j t = current time Bj (t) = max(Bjmin , rj (t − 1) − ∆j ) δj = Bjmin − Bj (t) while δj <= ∆j do // Schedulability check for job j PF j if i=t min(Bjmax , rj (t − 1) + δj + ∆j .(Fj − i)) < s′j then // Job j cannot complete by its deadline δj = δj + ǫ end if end while if δj > ∆j then return -1 else return (rj (t − 1) + δj ) end if

Function 4 AllocateExcessCapacity(Capacity x) 1: sortedJobList = {jobs in runningJobList sorted in the increasing

order of deadlines and break ties by deceasing order of flex} 2: τ = length of epoch 3: for each job j in sortedJobList do 4: if x > 0 then 5: rj (t) = bandwidth assigned to j in current epoch 6: s′j = Remaining size of job j 7: if τ rj (t) >= s′j then 8: ej = min(rj (t) + ∆j , s′j /τ, x) 9: rj (t)+ = ej 10: x = x − ej 11: end if 12: end if 13: end for

The SlackEstimator function (Function 5) computes the prediction error, adds it to the list of prediction errors in the last 30 days, and picks a slack that is equal to the 95th percentile of the error. It then updates the predictions for the next epoch by using a running average of the errors, biased heavily towards the error in this epoch. We will show in evaluation that this slack and error estimation significantly improves our GRESE algorithm’s performance. Function 5 SlackEstimator() 1: 2: 3: 4:

error = actualCapacityOfRTJobs - predictedCapacityOfRTJobs errorList += error use error to update currentLinkCapacity for next epoch slack = 95th percentile of errorList

V. U NDERSTANDING

AND USING THE

CDN

DATA SET

To understand the performance of GRESE, we used data sets from a commercially deployed live CDN in an European ISP. This CDN is specifically designed to serve video content to the users and hence serves huge amounts of data to users. We collected client-initiated flow statistics from this CDN for 14 days from a CDN site. Over the 14 days period, we monitored about 7.7 million flows, and recorded the data transferred by the flows, their arrival times, and the average rate of data transfer the flows experienced. Overall, these flows transferred over 208TB of data during this time. While this CDN data is not exactly the same data as that of the traffic between the data centers, we believe this data is representative for our evaluation purposes. For instance, if the CDN node we monitored did not have the content to serve to the user cached, it would fetch that data from a different CDN node before serving to the users. As a result, if we were to treat this CDN node as a data center, the flows between different CDNs would be the inter data center traffic. In addition, as we show next, the characteristics of the flows in this trace is very similar to the characteristics of the inter-data center traffic, as shown in a recent paper [3]. Figure 2 shows the overall utilization due to the traffic going out of the CDN node towards the users. Clearly, there is a diurnal pattern in the link utilization – the utilization during the nights is very low compared to the utilization in the mornings. Our trace starts at the start of a Thursday, and as the graph shows, on link utilization during the weekends is especially low. But the traffic has a much higher peak on Monday morning compared to the other weekdays. In this trace, the highest peak utilization is about 3.4Gbps. A. Real-time and Bulk Data jobs For our evaluation we will need to identify jobs that can be treated as real time (RT) jobs and bulk-data (BD) jobs (so that they can be delayed). We first plotted the size of the flows – the total amount of data transferred across a flow – in our data set. As the top plot in the Figure 3 shows, a majority of the flows are small in size. About 85% of the flows transferred less than 10MB of data. Only about 10% of the flows transferred 50MB or more of data. Based on this we decided to divide the flows into real time and bulk-data based on the data they transfer (and hence how long they last). So we considered the flows that lasted longer than 30 minutes and had a data transfer rate of more than 137kbps as a bulk-data flow in our analysis, which translates to approximately 30 MB of data in size. The rest of the flows are considered real time flows. This divided the trace into 9.75% of flows as BD flows and the rest as RT flows. This division also led to roughly equal amount of data to be transferred via RT and BD jobs over the 14 day period. As the bottom plot in the Figure 3 shows, more than 80% of the BD flows transfer more than 50MB. This strikes interesting similarity to the traffic seen between data centers – flows are either short or long. This is the scenario where GRESE can be very helpful:

6

Fulltrace

2.5 2

CDF

Traffic (Gbps)

3

1.5 1 0.5 0 5000

10000

15000

20000

Time (in Secs)

Fig. 2. The traffic pattern of the flows in our 14-days CDN trace.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

CDF

3.5

All Job Sizes in the Trace Backup Job Sizes in the Trace 0

50

100

150

200

250

300

350

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

400

Synthetic Backup Real All Real Backup 0

1

Flow Size (MB)

Fig. 3. The distribution of the amount of data transferred by the flows (flowSize) in our trace.

moving the larger flows to off-peak hours and treating the smaller flows as real time flows. Next, we looked at the length of the time the flows were alive (f lowSize/f lowRate). Figure 4 shows the distribution of this flow length. The top curve in the figure shows the flow length for all the jobs in the data set. Clearly, smaller jobs, which constitute 90% of the jobs last a lot shorter than the BD flows (shown in the bottom curve). B. Synthetic trace We observed some peculiar characteristics in our trace when we plotted the length of the jobs. It turned out that most of the BD jobs have almost the same rate (140kbps), while a few had very low rate (10kbps) which skewed their alive time significantly. Normally, we expect data center traffic to request a bigger flow rate. As a result, we generated a synthetic trace which is exactly the same as the real trace except the flow rate – we generated a rate that is chosen uniformly at random from (50, 500). This leads to a new plot, that is in the center of Figure 4 that show the length of the jobs for the synthetic trace. This trace has a more uniform distribution of request rates and hence is ideal to evaluate our algorithms. We use both the real and the synthetic traces in our evaluation. VI. E VALUATION Given the vast types of jobs that fit within our model, for evaluation purposes, we consider two extreme types of jobs that are most common. The first type of job, referred to as constant continuous or CC job, has a constant amount of bandwidth requirement, i.e., once started, it cannot be stopped. Essentially, ∆j = 0. The second type of job we consider is the elastic job, which we described earlier, with Bjmin = 0 and Bjmax = ∆j = ∞. Next we set out to compare the performance of GRESE with respect to other algorithms. The first and the default choice is the basic First-Come-First-Served (FCFS) algorithm. The other potential choices are the max-stretch algorithm in [7], the BDTS algorithm [4] and Earliest Deadline First. However, these algorithms assume only elastic jobs and therefore are unsuited for the mix of CC and elastic bulk-data jobs. Therefore, we considered a variant of GRESE for comparison: one with perfect knowledge of future arrivals, and therefore knows how much residual capacity is available at all times in the future. We call this algorithm the Oracle. Since the main problem with

2

3

4

5

Job Length (Hours)

Fig. 4. The distribution of the time required to finish the jobs (joblength) in our trace. Joblength for a flow joblen = f lowSize/f lowRate.

a scheduler operating under a peak-pricing model is imperfect knowledge of available capacity, we know that the Oracle will perform much better than GRESE. In evaluating the performance of GRESE and Oracle on our trace data, the key questions we consider are: (a) How does the peak bandwidth vary for GRESE and Oracle schemes as deadline increases? (Figures 5 and 6), (b) What happens when the % of CC jobs in the input set of jobs vary? (Figure 7), (c) How does the link capacity increase in GRESE compared to Oracle? (Figure 8), (d) How quickly do the jobs finish compared to the deadline? (Figure 9), (e) How important is it to choose the right value for the slack? (Figure 10). In this evaluation, unless explicitly specified otherwise, the ratio of CC:Elastic jobs is 1:1. A. Different Deadlines Figure 5 compares the peak capacity used by Oracle and GRESE for our synthetic trace, as a function of the deadlines assigned to the jobs. In this experiment all jobs have the same deadline, and it is assigned to the jobs from their arrival time. Note that if we employ an FCFS strategy, the bandwidth used by FCFS is 3.4 Gbps for the synthetic trace regardless of the deadline. As we can see, the performance of GRESE very closely follows Oracle. For a deadline of 6 hours, the peak is reduced to 2.71 Gbps in GRESE and to 2.66 Gbps in Oracle. This is already a 20% savings on the peak bandwidth of 3.4 Gbps. Larger deadlines show diminishing returns as expected. For a deadline of 7 days (not shown in the figures), the peak seen is the max-bandwidth consumed by the RT jobs (1.8Gbps), i.e., all the BD jobs are scheduled during the offpeak hours and the residual capacity never has to increase at all in order to accommodate the BD jobs. Figure 6 shows the results for the runs on the real CDN trace, and again, GRESE performs almost as well as Oracle. In all the experiments above, the % of CC and elastic jobs were set to be 50% each. In all of our tests, we measured the peak capacity as well as the 95th percentile capacity of the algorithms. We found that the performance margins for FCFS and GRESE to be similar for both metrics. For simplicity, we only show the peaks capacity here. B. Mix of CC and Elastic jobs In order to understand the impact of CC jobs on the peak bandwidth, we next varied the percentage of CC jobs in the

7

3.5

3 2.5 2 1.5

4

GRESE Oracle

Peak Traffic (Gbps)

GRESE Oracle

Peak Traffic (Gbps)

Peak Traffic (Gbps)

3.5

3 2.5 2 1.5

6

8

10

12

14

16

18

20

22

Deadline (in Hours)

Fig. 5. The impact of larger flow deadlines on the peak link bandwidth, for different schedulers. The trace used here is the synthetic trace.

24

Rest of the jobs are Elastic

3.5 3 2.5 2

6

8

10

12

14

16

18

20

22

24

0

10

Deadline (in Hours)

Fig. 6. The impact of larger flow deadlines on the peak link bandwidth, for different schedulers under the real CDN trace.

synthetic trace and measured the peak capacity consumed using GRESE. Figure 7 shows the results where we varied the % of CC from 5 to 95%. The rest of the jobs were elastic jobs in these tests. Clearly, more CC jobs leads to a higher peak as they cannot be preempted once started. In fact, when the number of CC jobs is 95%, the peak is higher than the peak in the real trace (3.4 Gbps). This is because GRESE might postpone scheduling some CC jobs to a later time, but a sudden arrival of a burst of CC jobs later combined with the fact that the deadline has to be met for the earlier jobs will force our algorithm to increase the peak capacity. However, GRESE clearly does well as long as the CC jobs compose less than 75% of the BD jobs in the trace, a highly likely scenario in the real-world. C. Quality of the Schedule One of the aspects that we also aim to evaluate is the quality of the schedule obtained by GRESE. The motivation behind this is as follows. A job can be scheduled any time before its deadline. However, it is very beneficial to have jobs complete as early as possible, in order to keep the links available to schedule any future unexpected traffic spikes. Therefore, we evaluate two distinct aspects of GRESE: (a) how soon does GRESE raise its capacity requirements, and (b) how early do jobs finish before their deadlines? One can see that the answers to these two questions are intertwined. We use the synthetic trace to plot the value of maximum link capacity that GRESE requires at every epoch, and compare it against that of the Oracle scheme. We show this in Figure 8, where we can see that GRESE increases its capacity early on to match that of the Oracle. The difference between the two schemes is that GRESE requests additional capacity at around 150 epochs, due to a sudden dip and spike in the rate of RT jobs (seen around 700 seconds in the trace in Figure 2). This measurement error exceeds the slack capacity reserved for such spikes, causing GRESE to request additional capacity, unlike the Oracle, which does not encounter measurement errors. Beyond this one-time event, GRESE tends to follow the behavior of Oracle in the other epochs. When we consider how soon BD jobs complete, we have to make a distinction between the CC and elastic jobs, since CC jobs request a constant rate, which makes them complete much later, while elastic jobs can soak up excess capacity, thereby potentially finishing faster. For each BD job, we compute how

20

30

40

50

60

70

80

90 100

% of CC jobs (Deadline is 6 hours)

Fig. 7. The variation in peak link load for different % of CC jobs under synthetic trace.

much time is left after completion until the actual deadline, and plot the CDF over all jobs in Figure 9. We show the values for CC and elastic jobs separately, and for two different sets of deadlines, as can be seen in the Figure. We see that more than half of the elastic jobs finish very early. Nearly 80% of the elastic jobs finish well-ahead of time, and a very small fraction gets stretched out longer. For CC jobs, we see a similar pattern but less abrupt. Most of these jobs are scheduled early on, with less than 20% of the jobs stretching out to the near end of their deadlines. Note that all of the jobs complete with more than 20% time remaining before their deadlines. D. Accuracy of Slack Capacity Estimate In the scheduling problem we consider, we have to reserve slack capacity to deal with unexpected traffic spikes. Predicting the amount of slack needed is vital to keep bandwidth consumed low, as described in Section IV. As described in Section IV, we use the prediction error measurements from past epochs, and select the 95th percentile of the error to set as the slack capacity. We now want to look at how sensitive the performance of the algorithm is to the selection of this slack capacity. Instead of 95th percentile, we vary this from 90th to the 96th percentile and show the peak bandwidth consumed as a result in Figure 10. The difference in results is quite drastic. Any smaller value of the threshold leads to a huge increase in peak bandwidth consumed for data transfer. This points to the optimality of the slack capacity estimation strategy that we use. This method is applicable to even the standard water-filling algorithms that schedule only elastic jobs. Without such an optimal slack capacity estimation proposed in this paper, the algorithms proposed in [1], [4], [7] will run up the peak capacity used to very high values. E. Deadline assignment schemes We also explored the best way to assign the deadlines to bulk-data jobs. This is a useful problem to consider since deadline assignment has traditionally not been studied, and is assumed to happen somehow. In Table I we compare the performance of four different assignment schemes: (a) Randomly assigning deadlines to arriving jobs from a fixed set of {6, 12, 24} hours, (b) Assigning deadlines to jobs by the job size (jobs less than 150 MB in size have 6 hours to complete, jobs less than 250 MB have 12 hours and the other larger jobs have 24 hours

100 80

40

Fig. 8. Comparing the increment in the link capacity of different schedulers under synthetic trace.

2

GRESE 2.50 2.55 2.57 3.47

4

6

8

10

Time Between Finish and Deadline (Hours)

Fig. 9. CDF of the time left between the time the job is finished and the job’s deadline for different flows under synthetic trace and GRESE scheduler.

to complete. (c) Assigning deadlines based on the arrival time (jobs arriving between 6am and 2pm get 24 hours, 2pm to 8pm is 12 hours, and 8pm to 6 am is 6 hours) so that they each get sufficient time to finish in the off-peak hours, and finally (d) Based on the type of the job (all CC jobs get 6 hours and elastic jobs get 24 hours). The random deadline assignment scheme outperforms all of the other assignment schemes in both GRESE and Oracle. But the assignment schemes based on the time of the day and flow size perform close to random. Fixing deadlines based on flow type is clearly not a good scheme, as it limits the flexibility available to the scheduler. Thus, we find that there are three potential deadlineassignment schemes that are available to the CSP for use with bulk-data jobs for scheduling over inter-DC links. Scheme Random By size of the flow By time of arrival By type of the flow

8 6

2 0

Time in Epochs (One Epoch = 5 Minutes)

10

4

0

200 400 600 800 1000 1200 1400 1600 1800 2000

GRESE 6 Hours

12

20

GRESE Oracle 0

60

CC 6 Hrs Elastic 6 Hrs CC 10 Hrs Elastic 10 Hrs

Traffic (Gbps)

2.7 2.6 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8

CDF

Traffic (Gbps)

8

Oracle 2.52 2.55 2.57 2.86

TABLE I T HE PERFORMANCE OF GRESE AND O RACLE UNDER DIFFERENT DEADLINE ASSIGNMENT SCHEMES .

F. Cost savings to the CSPs The peak link utilization in the live CDN we used for our measurements is 3.4Gbps, and we saved nearly 30% of it due to our mechanisms. This 1 Gbps savings leads to a savings of $30K to $90K per month for this CDN provider according to recent prices [1]. A cloud provider usually has several such high capacity links and hence can easily realize millions of dollars in savings in a year due to our mechanisms. VII. C ONCLUSION The use of cloud service providers is increasing significantly in the last few years. Cloud service providers often have multiple data centers, interconnected with each other, that they use to replicate data for backup, serve nearby users and process other types of bulk data. With more data stored in the cloud and due to the frequent use of geographic replication, more and more data is transferred over inter-data center links driving up their 95th percentile costs. In this paper we propose a scheduling algorithm to reduce the peak (and the 95th percentile) bandwidth consumed on these inter data center links by scheduling bulk data flows – or flows that can be delayed – during off-peak hours and

90

91

92

93

94

95

96

Slack Threshold Based on Prediction Error (in %)

Fig. 10. The impact of choosing a lower slack on the peak load, under synthetic trace and GRESE scheduler for 6 hours deadline.

using the peak hour bandwidth mainly for serving realtime or critical data. We propose a scheduling algorithm that reduces billable bandwidth usage significantly. We show that delaying the bulk data flows for even 8 hours can lead to a significant 30% reduction in the cost of the links. Given that data centers have multiple high capacity links out of each data center, the cumulative cost reductions are very high. R EFERENCES [1] N. Laoutaris, G. Smaragdakis, P. Rodriguez, and R. Sundaram, “Delay tolerant bulk data transfers on the internet,” in Proc. of SIGMETRICS, 2009. [2] Wikipedia, “Burstable billing,” http://en.wikipedia.org/wiki/Burstable billing. [3] Y. Chen, S. Jain, V. Adhikari, Z. Zhang, and K. Xu, “A first look at inter-data center traffic characteristics via yahoo! datasets,” in Proc. of IEEE INFOCOM, 2011. [4] B. B. Chen and P. V.-B. Primet, “Scheduling deadline-constrained bulk data transfers to minimize network congestion,” in Proc. of IEEE CCGRID, 2007. [5] M. Mathis, J. Semke, J. Mahdavi, and T. Ott, “The macroscopic behavior of the congestion avoidance algorithm,” ACM Computer Communications Review, vol. 27, July 1997. [6] GigaOm, “Netflix moves into the cloud with amazon web services,” http://gigaom.com/video/netflix-moves-into-the-cloud-with-amazonweb-services/. [7] M. A. Bender, S. Chakrabarti, and S. Muthukrishnan, “Flow and stretch metrics for scheduling continuous job streams,” in Proc. of ACM SODA, 1998. [8] A. Goel, M. R. Henzinger, S. Plotkin, and E. Tardos, “Scheduling data transfers in a network and the set scheduling problem,” J. Algorithms, vol. 48, pp. 314–332, September 2003. [9] S. Andrei, A. Cheng, M. Rinard, and L. Osborne, “Optimal scheduling of urgent preemptive tasks,” in Proc. of IEEE RTCSA, 2010. [10] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson, Approximation algorithms for bin packing: a survey. Boston, MA, USA: PWS Publishing Co., 1997, pp. 46–93. [11] F. Bouabache, T. Herault, S. Peyronnet, and F. Cappello, “Planning large data transfers in institutional grids,” in Proc. of IEEE/ACM CCGrid, 2010. [12] N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez, “InterDatacenter Bulk Transfers with NetStitcher,” in Proc. of SIGCOMM, 2011. [13] T. Benson, A. Anand, A. Akella, and M. Zhang, “Understanding data center traffic characteristics,” in Proc. of WREN, 2009. [14] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken, “The nature of data center traffic: measurements & analysis,” in Proc. of IMC, 2009.

Lowering Inter-Datacenter Bandwidth Costs via Bulk Data Scheduling

reasonable amount of time, say within a span of few hours to days. We call this traffic as Bulk Data (BD), and the notable characteristic of such jobs is their large ...

Download PDF

579KB Sizes 2 Downloads 137 Views

Report

Lowering Inter-Datacenter Bandwidth Costs via Bulk Data Scheduling

Recommend Documents