Mitigating Power Contention: A Scheduling Based ...

Viewer
Transcript

Mitigating Power Contention: A Scheduling Based Approach Hiroshi Sasaki1 1

1

Alper Buyuktosunoglu2

Columbia University [email protected]

2

Augusto Vega2

Pradip Bose2

IBM T. J. Watson Research Center {alperb,ajvega,pbose}@us.ibm.com

2

Abstract—Shared resource contention has been a major performance issue for CMPs. In this paper, we tackle the power contention problem in power constrained CMPs by considering and treating power as a first-class shared resource. Power contention occurs when multiple processes compete for power, and leads to degraded system performance. In order to solve this problem, we develop a shared resource contention-aware scheduling algorithm that mitigates the contention for power and the shared memory subsystem at the same time. The proposed scheduler improves system performance by balancing the shared resource usage among scheduling groups. Evaluation results across a variety of multiprogrammed workloads show performance improvements over a state-of-the-art scheduling policy which only considers memory subsystem contention. Index Terms—Power contention, process scheduling, power capping, energy-efficient systems, multi-core processors

F

S

I NTRODUCTION

HARED resource contention is known as one of the major performance issues in chip multiprocessors (CMPs). Since performance degrades when multiple processes compete for shared hardware resources of CMPs, various attempts have been made by both hardware [5], [11] and software [10], [12], [13] to mitigate its performance impact. In this paper, we argue that power should also be treated as a first-class shared resource for CMPs with the emergence of power capping mechanisms [1]. Power capping is realized by the power management system via powerperformance knobs such as DVFS (dynamic voltage and frequency scaling). Therefore, executing a program (or programs) with relatively high power consumption on power capped CMPs can cause “power contention” that guides the power management system to throttle the processor which results in degraded system performance. We enhance the power management system in the context of scheduling where processes with mutually exclusive shared resource (including power) needs are co-scheduled such that the shared resource utilization is balanced; in this way we mitigate contention and hence system performance is improved. Recent study also takes power into account to provide QoS for latency-sensitive services in datacenters [6]. Major differences in dealing with power comes from the fact that we leverage scheduling to address both CPU and DRAM power contention where the related work addresses only CPU power using DVFS. Fig. 1 shows the performance when 6 copies of sjeng (from SPEC CPU2006 [4]) and 6 copies of CPU power hungry microbenchmark CPU-hog are executed on a 6-core Intel Sandy Bridge processor. The unbalanced scheduler executes 6 sjeng after 6 CPU-hog where the balanced scheduler pairs 3 sjeng and 3 CPU-hog, and executes them one after the other. The figure highlights two important points: (1) power contention degrades noticeable performance and (2) potential of power contention-aware scheduling. The left figure shows the performance without a power cap. Since both sjeng and CPU-hog are CPU-intensive, there is no contention at the memory subsystem (or “cachememory contention”) regardless of the scheduling policy.

Unbalanced

Balanced

19.9%

1.0

Performance

1

10.6%

0.8 0.6 0.4 0.2 0

sjeng

CPU-hog

Hmean

sjeng

CPU-hog

Hmean

Fig. 1. Performance without a power cap (left) and with a CPU power cap (right). Performance is normalized to its solo run.

Therefore, the performance of the two policies are the same. When certain CPU power cap is applied, scheduling policy matters. The performance of CPU-hog degrades 19.9% due to power contention in the unbalanced policy where that of sjeng remains constant, because CPU-hog consumes more power than sjeng. The balanced policy succeeds in mitigating the power contention where both sjeng and CPU-hog show a slowdown of only 1.0%, which results in 10.6% total performance improvement. This simple experiment shows the need of power contention-aware scheduling, which we further explore in this paper. In the rest of the paper, we develop and evaluate a scheduling algorithm that mitigates the performance issue with resource contention by balancing the utilization of the shared resources. The key idea of the proposed approach is to treat both CPU and DRAM power consumption as firstclass shared resources, such as the memory subsystem.

2

I MPACT OF P OWER C ONTENTION

This section explores the performance of various multiprogrammed workloads in order to understand the impact due to power contention. 2.1 Experimental setup Platform: We perform experiments on an Intel Xeon E52620 Sandy Bridge 2.0 GHz 6-core CMP with 16 GB of main

1

(a) sjeng

2

3

4

5

6

(b) calculix

1

2

3

4

5

6

No CL0 CL1 CL2 CL3

1

(c) gromacs

2

3

4

5

6

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Relative performance

No CL0 CL1 CL2 CL3

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Relative performance

No CL0 CL1 CL2 CL3

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Relative performance

Relative performance

[x-axis] # of instances

Relative performance

[y-axis] Performance (normalized to solo run with no power cap)

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

No CL0 CL1 CL2 CL3

1

(d) wrf

2

3

4

5

6

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

No CL0 CL1 CL2 CL3

1

(e) tonto

2

3

4

5

6

(f) leslie3d

1

(a) perlbench

2

3

4

5

6

1

(b) bwaves

2

3

4

5

6

(c) gcc

No CL0 CL1 CL2 CL3

1

2

3

4

5

6

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

(d) GemsFDTD

Relative performance

No CL0 CL1 CL2 CL3

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Relative performance

No CL0 CL1 CL2 CL3

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Relative performance

[x-axis] # of instances

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Relative performance

[y-axis] Performance (normalized to solo run with no power cap)

Relative performance

Fig. 2. Performance of CPU-intensive benchmarks with different number of instances and CPU power caps.

No CL0 CL1 CL2 CL3

1

2

3

4

5

6

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

No CL0 CL1 CL2 CL3

1

(e) libquantum

2

3

4

5

6

5

6

(f) lbm

(a) perlbench

1

2

3

4

5

(b) bwaves

6

1

2

3

4

5

6

(c) gcc

No DL0 DL1 DL2 DL3

1

2

3

4

5

6

(d) GemsFDTD

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Relative performance

No DL0 DL1 DL2 DL3

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Relative performance

No DL0 DL1 DL2 DL3

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Relative performance

[x-axis] # of instances

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Relative performance

[y-axis] Performance (normalized to solo run with no power cap)

Relative performance

Fig. 3. Performance of memory-intensive benchmarks with different number of instances and CPU power caps.

No DL0 DL1 DL2 DL3

1

2

3

4

5

(e) libquantum

6

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

No DL0 DL1 DL2 DL3

1

2

3

4

(f) lbm

Fig. 4. Performance of memory-intensive benchmarks with different number of instances and DRAM power caps.

memory. Simultaneous multithreading (SMT; also known as Hyper-Threading) and dynamic overclocking (Turbo Boost) are disabled. The Linux governor is set to ‘performance’ to avoid non-deterministic performance variance. Workloads: We use SPEC CPU2006 benchmark suite with reference inputs for our study. The number of co-scheduled copies of the same instance of a benchmark is varied from 1 to 6 and the average performance results are reported, unless otherwise specified. In order to explore the performance impact due to power contention at the CPU or DRAM, we apply different levels of power caps to either domain. 2.2 Results Figures 2, 3 and 4 show performance characteristics due to power contention (and also inevitably, cache-memory contention) for some SPEC benchmarks (not all of them are shown due to limited space). Results of CPU-intensive benchmarks while varying the CPU power cap are presented in Fig. 2, and that of memory-intensive benchmarks while varying the CPU cap and DRAM power cap are shown in Fig. 3 and Fig. 4, respectively. The x-axis of the figures shows the number of instances and the y-axis shows the relative performance to the maximum available performance (i.e., solo run). Each line in the figure connects the performance results with the same power cap. The legend shows the power cap levels where ‘No’ means without a power cap and ‘CL0’ (CPU power cap Level-0) through ‘CL3’ represent different CPU power caps. The power cap is tightened by 12.5% each as the number gets incremented. The legend of Fig. 4 (D stands for DRAM) denotes the power cap for DRAM where it is defined in a similar manner to the CPU power caps. Benchmarks in Fig. 2 show similar trends with each other where the performance stays constant by increasing

the number of copies when there is no power cap because there is (almost) no cache-memory contention. Performance degrades with tighter power caps as expected. However, we can see different degrees of performance degradation because the power consumption of each program is different. For the CPU power contention (Fig. 3), we see almost up to 70% performance degradation with tight constraints for programs with higher cache-memory contention. When we focus on Fig. 4, we can see different trends from Fig. 2 and Fig. 3. Some benchmarks (perlbench and gcc) show almost no or modest performance degradation with tighter DRAM power caps whereas others (bwaves, GemsFDTD, libquantum and lbm) show significant performance degradation as we have also seen in the CPU power contention scenario. This is because even though the benchmarks are memory-intensive from a performance perspective, it does not necessarily mean that the DRAM power contention is severe. We conclude that CPU and DRAM power contention can be major performance bottlenecks when reasonable power caps are applied to the system. In addition, we see that performance impact due to power contention is hard to predict especially when cache-memory contention and power contention occur simultaneously, which can happen in a realistic scenario where the system needs to schedule a mix of different workloads. This motivates us to develop a scheduling algorithm that mitigates both cache-memory contention and power contention.

3 3.1

C ONTENTION -AWARE C O -S CHEDULING Basic Idea

The basic idea of the proposed algorithm is to combine programs stressing mutually exclusive shared resources in

order to avoid severe contention. The scheduler organizes the tasks in CPU-local run queues, divides time into epochs and executes each scheduling group composed of tasks with different characteristics in a round-robin manner to ensure fairness among scheduling groups. Since the degree of resource contention depends on the execution phases of each program, the combination of programs are dynamically controlled. 3.2 Estimating Shared Resource Usage In order to balance the shared resource utilization among scheduling groups, the scheduler needs to be capable of estimating the shared resource usage of each program. We collect statistics from hardware performance monitoring units (PMUs) to estimate the shared resource usage, represent them as an activity vector [9] and utilize them to make the co-scheduling decisions. Usage of memory subsystems: Many previous works have shown that the amount of memory pressure programs generate has high correlation with the last level cache (LLC) miss statistics [2], [8], [11]. We use the number of LLC misses per second as a proxy to estimate the amount of memory pressure each program develops. Usage of power: For power usage we estimate the actual CPU and DRAM power consumption of each program using a statistical model. We trained the model using the power consumption values measured via RAPL counters as inputs. The training data is obtained by running SPEC CPU2006 benchmarks without a power cap. We collected the PMU data every 400ms, and each benchmark is executed 6 times with number of co-scheduled instances varied from 1 to 6. We model the CPU and DRAM power consumption of the platform with multiple input variables. We use five PMUs in our study: instructions committed, cycles, branch prediction misses, L1 cache accesses, and LLC misses. LLC misses is also used to estimate memory subsystem usage. The power consumption P is expressed as: P = PN i=1 (f × Wi × Ei ) + C , where f is the average clock frequency within an epoch. Note that f is only applied for CPU power estimation and not for DRAM. Wi is the coefficient of each component and Ei is its event counts (e.g., number of LLC misses). The term f × Ei × Wi represents the dynamic power consumption that component i contributes to, and C represents the constant value which is mostly static power consumption. 3.3 Proposed Co-Scheduling Algorithm An activity vector (which is per process and onedimensional) represents to what degree a running process utilizes the shared resources we are interested in. The size of the vector equals the number of resources we consider, which is three: (1) memory subsystem, (2) CPU power and (3) DRAM power. The elements of the vector are normalized to the maximum utilization the respective resource can consume. The maximum usage amount of the memory subsystem is experimentally obtained using a manually written microbenchmark. For the power consumption, the elements are normalized to the CPU and DRAM power caps. The goal of our scheduling algorithm is to balance the usage of each shared resource among scheduling groups, which can be achieved by having programs with diverse characteristics in each group (recall the balanced coscheduling example in Fig. 1). We use a similar algorithm

to vector balancing [10] where at the end of each epoch, the scheduler calculates the variance varsum of each scheduling group which is the sum over the variance of all vector elements; then it searches for a co-schedule that increases the minimum varsum of all the scheduling groups by calculating varsum for all possible co-schedules. Varsum is defined 2 PN PM PM 1 1 2 as: varsum = , i=i M j=1 Eij − j=1 Eij M where Eij is the i-th element (either memory pressure, CPU power consumption or DRAM power consumption) of the j -th program, N is the number of shared resources which is three in our study and M is the number of programs in a scheduling group which is six in our platform.

4 4.1

E VALUATION Scheduler Implementation

We have implemented a prototype of the proposed scheduler as a user level software. The evaluation system runs Linux kernel 2.6.32 where the LIKWID toolset version 3.0 [15] is modified to allow periodical access to performance and power related PMUs. Linux sched_setaffinity(2) API is used to control the CPU affinity of processes to realize the proposed algorithm. Scheduling epoch per scheduling group is set to 400ms. For the CPU and DRAM power models, we have applied multivariate linear regression to periodical statistics obtained via executions of a single copy of all the evaluated SPEC benchmarks. We have confirmed that the model is accurate enough (adjusted R-squared > 0.9) for our purpose.

4.2

Workloads and Evaluation Scenarios

Since it is projected that computer systems will continue to be power limited, we conduct evaluations under a relatively tight power constraint with CL3 and DL3. We use SPEC CPU2006 benchmarks and microbenchmarks that stress CPU and memory to evaluate the proposed scheduler that considers both cache-memory contention and power contention, against a state-of-the-art scheduler which only considers cache-memory contention. The counterpart is implemented in the same way as the proposed scheduler except that it calculates varsum using only the memory pressure (does not estimate nor use the CPU and DRAM power consumption). We also evaluate two schedulers for reference: a scheduler that only considers power-contention, and a random scheduler. We report the Harmonic speedup [7] to discuss performance. Evaluating with all the possible combinations of 12 benchmarks is not feasible due to the vast exploration space. Therefore, we use 30 workloads composed of 12 randomlyselected benchmarks. Workloads are sorted in increasing order of their performance improvements of the proposed scheduler as we will see later. The evaluation methodology we use is similar to the one originally proposed for SMT scheduling [14] where a number of studies have been evaluated in a similar fashion [3]. To account for the varieties of execution times among programs, we instantaneously restart a program when it finishes execution, until all the processes are executed at least one time to completion. This means that we keep the number of threads constant during the whole evaluation.

Harmonic Speedup [%]

25

Random scheduler Power contention aware scheduler Proposed (power contention and cache-memory contention aware) scheduler

20 15 10 5 0 -5 -10

WL1

WL2

WL3

WL4

WL5

WL6

WL7

WL8

WL9 WL10 WL11 WL12 WL13 WL14 WL15 WL16 WL17 WL18 WL19 WL20 WL21 WL22 WL23 WL24 WL25 WL26 WL27 WL28 WL29 WL30

Fig. 5. Performance of random scheduler, power contention aware scheduler and the proposed scheduler compared to cache-memory contention aware scheduler.

4.3

Evaluation Results

Fig. 5 presents the performance results of the 30 workloads. We can see that on average the power contention-aware scheduler (2.2%) and the proposed scheduler (5.0%) both improve performance over the cache-memory contentionaware scheduler in a power-constrained scenario with mixed workloads. Random scheduler performs slightly worse on average (-1.1%). The proposed scheduler improves performance for more than 10% for three workloads (25.0%, 19.5%, 15.8%) and more than 5% for 9 workloads. The proposed scheduler also outperforms the power contentionaware scheduler for all workloads except five of them. Random scheduler performs worse compared to the cachememory contention aware scheduler in the left hand side of the figure where cache-memory contention dominates, and better in the right hand side where power contention dominates (where cache-memory contention aware scheduler behaves poorly). WL11 shows some corner-case scenario, where the power contention scheduler performs 6.1% better than the proposed scheduler. This workload shows relatively stable behavior with high CPU power contention and modest cachememory contention, where the proposed scheduler not only balances the power contention but also balances the cachememory contention which is less of a bottleneck. By doing so, the CPU power utilization becomes slightly unbalanced compared to that of the power contention-aware scheduler which results in degraded performance. Nonetheless, the proposed scheduler still performs better than the cachememory contention-aware scheduler. Another interesting thing to point out is the two rightmost workloads (WL29 and WL30) where the proposed scheduler greatly outperforms the cache-memory contention-aware scheduler. For these two workloads, the power contention-aware scheduler also shows more than 15% improvement, which clearly indicates that most of the performance boost comes from the mitigated power contention.

5

C ONCLUSION

As the ability to cap the power consumption of CMPs is becoming increasingly critical in modern computer system designs, power contention which can lead to significant performance degradation becomes a critical problem. As we have seen in Section 2, we believe that power contention is a serious issue where the problem opens up new research opportunities in this area. Our work tackles the problem from an angle where we have leveraged scheduling as an optimization knob. We have shown that our scheduling algorithm which considers power as a first-class shared

resource and co-schedules programs with different characteristics improves system performance over state-of-the-art cache-memory contention aware scheduler.

ACKNOWLEDGEMENTS This work is sponsored, in part, by JSPS Postdoctoral Fellowships for Research Abroad, and Defense Advanced Research Projects Agency (DARPA), Microsystems Technology Office (MTO), under contract number HR0011-13-C-0022. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government.

R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le, “RAPL: memory power estimation and capping,” in ISLPED ’10, Aug. 2010, pp. 189–194. G. Dhiman, G. Marchetti, and T. Rosing, “vGreen: a system for energy efficient computing in virtualized environments,” in ISLPED ’09, Aug. 2009, pp. 243–248. A. Fedorova, M. Seltzer, and M. D. Smith, “Improving performance isolation on chip multiprocessors via an operating system scheduler,” in PACT ’07, Sep. 2007, pp. 25–38. J. L. Henning, “SPEC CPU2006 benchmark descriptions,” ACM CAN, vol. 34, no. 4, pp. 1–17, Sep. 2006. S. Kim, D. Chandra, and Y. Solihin, “Fair cache sharing and partitioning in a chip multiprocessor architecture,” in PACT ’04, Oct. 2004, pp. 111–122. D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis, “Heracles: improving resource efficiency at scale,” in ISCA ’15, Jun. 2015, pp. 450–462. K. Lun, J. Gummaraju, and M. Franklin, “Balancing thoughput and fairness in SMT processors,” in ISPASS ’01, Nov. 2001, pp. 164–171. J. Mars, N. Vachharajani, R. Hundt, and M. Lou Soffa, “Contention aware execution: online contention detection and response,” in CGO ’10, Apr. 2010, pp. 257–265. A. Merkel and F. Bellosa, “Task activity vectors: a new metric for temperature-aware scheduling,” in EuroSys ’08, Apr. 2008, pp. 1– 12. A. Merkel, J. Stoess, and F. Bellosa, “Resource-conscious scheduling for energy efficiency on multicore processors,” in EuroSys ’10, Apr. 2010, pp. 153–166. M. K. Qureshi and Y. Patt, “Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches,” in MICRO-39, Dec. 2006, pp. 423–432. H. Sasaki, S. Imamura, and K. Inoue, “Coordinated powerperformance optimization in manycores,” in PACT ’13, Oct. 2013, pp. 51–61. H. Sasaki, T. Tanimoto, K. Inoue, and H. Nakamura, “Scalabilitybased manycore partitioning,” in PACT ’12, Sep. 2012, pp. 107–116. A. Snavely and D. Tullsen, “Symbiotic jobscheduling for a simultaneous multithreaded processor,” in ASPLOS-IX, Dec. 2000, pp. 234–244. J. Treibig, G. Hager, and G. Wellein, “LIKWID: a lightweight performance-oriented tool suite for x86 multicore environments,” in ICPPW ’10, Sep. 2010, pp. 207–216.

Mitigating Power Contention: A Scheduling Based ...

software [10], [12], [13] to mitigate its performance impact. In this paper, we argue that .... The Linux governor is set to 'performance' to avoid non-deterministic ...

Download PDF

449KB Sizes 0 Downloads 200 Views

Report

Mitigating Power Contention: A Scheduling Based ...

Recommend Documents