Cluster management at Google with Borg 2015-06 dotScale john wilkes /
[email protected] Principal Software Engineer Derived from EuroSys'15 paper (http://goo.gl/1C4nuo)
Cluster management at Google with
the system we internally call
Borg
2015-06 dotScale john wilkes /
[email protected] Principal Software Engineer Derived from EuroSys'15 paper (http://goo.gl/1C4nuo)
Image by Connie Zhou
User view job hello_world = { runtime = { cell = 'ic' } // Cell (cluster) to run in binary = '.../hello_world_webserver' // Program to run args = { port = '%port%' } // Command line parameters requirements = { // Resource requirements ram = 100M disk = 100M (optional) . cpu = 0.1 } replicas = 510000 // Number of tasks }
User view
Binary
User view What just happened?
Config file webbrowsers browsers web
borgcfg
Cell
BorgMaster BorgMaster UIshard shard BorgMaster UI BorgMaster UI shard read/UI BorgMaster UI shard shard persistent store (Paxos)
Scheduler scheduler
linkshard shard link link shard linkshard shard link
Borglet
Borglet
Borglet
Borglet
User view Hello world! Hello Hello Hello world! world! Hello Hello Hello world! world! Hello Hello Hello Hello world! world! world! Hello world! world! world! Hello Hello world! Hello Hello Hello world! Hello world! Hello world! world! Hello world! world! world! world! Hello Hello Hello Hello world! Hello world! world! world! world!
Hello Hello world! Hello Hello world! Hello Hello world! world! world! Hello world! Hello world! world! Hello Hello world! world!
Hello world!
Hello Hello world! Hello HelloHello Hello world! Hello Hello world! world! world! Helloworld! world! world! Hello world! world! Hello Hello Hello world! world! Hello world! world!
Hello world!
Image by Connie Zhou
User view
Failures
task-eviction rates and causes 9
Failures
A 2000-machine service will have >10 task exits per day This is not a problem: it's normal Images by Connie Zhou
Efficiency Advanced binpacking algorithms Experimental placement of production VM workload, July 2014
available resources
one machine
stranded resources
Efficiency Multiple applications per machine CPI^2 paper, EuroSys 2013
tasks per machine
Efficiency
# machines
shared cell (original)
Sharing clusters between prod/batch helps
shared cell (compacted)
non-prod load (compacted) prod-only load (compacted)
Segregating them would need more machines
13
Efficiency
# machines
shared cell (original)
Sharing clusters between prod/batch helps
shared cell (compacted)
non-prod load (compacted)
overhead prod-only load (compacted)
Segregating them would need more machines
14
Efficiency Sharing clusters between prod/batch helps
Waste
Segregating them would need more machines
15 production cells from a larger pool, omitting small ones (<5000 machines)
15
Efficiency
Resource reclamation
limit: amount of resource requested potentially reusable resources
reservation: estimate of future usage usage: actual resource consumption time 16
Efficiency
Resource reclamation could be more aggressive
Nov/Dec 2013 17
Efficiency
Resource reclamation could be more aggressive
Nov/Dec 2013 18
A few other moving parts Config file webbrowsers browsers web
borgcfg
Cell
UI BorgMaster UI BorgMaster UI BorgMaster UI shard BorgMaster read/UI shard BorgMaster shard shard shard persistent store (Paxos)
Scheduler scheduler
linkshard shard link link shard linkshard shard link
Borglet
Borglet
Borglet
Borglet
A few other moving parts
job config
master
agent
app
A few other moving parts system config
security
accounting/planning
storage job config
master
agent
app monitoring binaries + data distribution Diagram from an original by Cody Smith.
A few other moving parts system config
security
accounting/billing
storage job config
master agent
app monitoring
binaries + data distribution Diagram from an original by Cody Smith.
Kubernetes κυβερνήτης:
pilot or helmsman of a ship http://kubernetes.io
Kubernetes Direct Borg analogues: ● ● ● ● ●
Borg containers => Docker containers alloc (task group) => pod (container group) Borglet => Kubelet persistent, declarative specs reconciliation loops
Kubernetes New / improved: ● ● ● ●
labels + label queries service abstraction composable microservices IP per pod
Observations: 1. Resiliency is achieved only by ruthless attention to detail a. ubiquitous software fault tolerance b. persistent, declarative specs
2. We get efficiency by: a. sharing resources b. reclaiming unused allocations
3. Containers make users more productive
[email protected] http://kubernetes.io http://goo.gl/1C4nuo (Borg paper) Images by Connie Zhou