The Future of Networking May 7, 2016
Amin Vahdat Google Fellow
Computing at a Crossroads • Distributed programming has faced similar challenges since sockets • The free lunch in performance improvements are over • Storage capacity has increased through disaggregation •
I/O latency gap remains
•
“Next-gen” storage remains largely untapped at scale
Networking will drive future improvements in compute performance by blurring the line between individual servers Google Cloud Platform
2
Last Decade
Cloud 1.0
Virtualization delivers capex savings to enterprise DCs
Google Cloud Platform
3
Now
HW on Demand
Cloud 1.0
Cloud 2.0
Public cloud frees enterprise from private HW infrastructure Scheduling, load balancing primitives, “big data” query processing
Google Cloud Platform
4
The Third Wave of Cloud Computing
Compute, not servers
Cloud 1.0
Cloud 2.0
Cloud 3.0
Serverless compute, actionable intelligence, and machine learning Not data placement, load balancing, OS configuration and patching
Google Cloud Platform
5
The Third Wave of Cloud Computing
Cloud 1.0
Cloud 2.0
Cloud 3.0
Networking should be aiming for Cloud 3.0 Google Cloud Platform
6
Networking and Cloud 3.0 Storage disaggregation:
Seamless telemetry
the datacenter is the storage appliance
and scale up/down
Transparent live migration
Open Marketplace of services, securely placed and accessed
Google Cloud Platform
7
Networking and Cloud 3.0 Applications
Policy
not VMs
not middleboxes
Actionable Intelligence
SLOs
not data processing
not placement/load balancing/scheduling
Google Cloud Platform
8
Making the Network Disappear with Software Defined Networking
Making the Network Disappear with Software Defined Networking
Google Software Innovations Driven by Unprecedented Demand for Scale, Bandwidth, Reliability Spanner FlumeJava
Borg
Dremel
Colossus Bigtable
Pregel
MapReduce
2012
GFS
2010 2006
2008
2004 2002
Google Cloud Platform
11
DCN Bandwidth Growth Traffic generated by servers in our datacenters
Aggregate traffic
50x
Time
1x Jul ‘08
Jun ‘09
May ‘10
Apr ‘11
Mar ‘12
Google Cloud Platform
Feb ‘13
Dec ‘13
Nov ‘14
12
Google Networking Innovations Our distributed computing infrastructure required networks that did not exist QUIC
gRPC
Jupiter Freedome
BwE
Andromeda B4
Watchtower Google Global Cache
Onix
2014 2012
2010 2008 2006
Google Cloud Platform
13
SDN Motivation •
Traditional network architectures could not keep up with bandwidth demands in the data center
•
Operational complexity of “boxcentric” deployment
Google’s DCN redesign, inspired by server & storage scale out •
Clos Topologies
• •
Merchant Silicon Centralized Control → Software Defined Networking
Google Cloud Platform
14
Why Balance Matters @ Building Scale An unbalanced data center means: •
Some resource is scarce...limiting your value
•
Other resources are idle...increasing your cost
Substantial resource stranding [Eurosys 2015] if we cannot schedule at scale
Amdahl’s Amdahl’slesser lesserknown knownlaw: law: 1Mbit/sec of IOofforIOevery 1 Mhz of in parallel computing 1Mbit/sec for every 1 computation Mhz of computation in parallel
Google Cloud Platform
computing 15
Bandwidth @ Building Scale 64*2.5 Ghz server
Compute Slice
Compute Slice
Flash
100k+ IOPS 100 us access PB’s storage
NVM
1M+ IOPS 10 us access TB’s storage
100 Gb/s
Datacenter Network 50k servers→ 5 Pb/s Network??
Based on Amdahl’s observation, we might need a 5 Pb/s network •
Even with 10:1 oversub → 500Tb/s datacenter network
•
Every building needs more bisection than the Internet
Google Cloud Platform
16
Latency @ Building Scale Compute Slice
Compute Slice
Flash
100k+ IOPS 100 us access PB’s storage
NVM
1M+ IOPS 10 us access TB’s storage
Datacenter Network 10 us latency
To exploit future NVM, we need ~10 usec latency • •
Even for Flash, we need 100 usec latency Or, expensive servers sit idle while they wait for IO
Google Cloud Platform
17
Availability @ Building Scale Compute Slice
Compute Slice
Flash
Datacenter Network NVM 50k servers
Cannot take down a XX MW building for maintenance • •
New servers always added; older ones decommissioned… with zero service impact Network evolves from 1G → 10G → 40G → 100G → ???
Google Cloud Platform
18
Datacenter Network infrastructure to support Google scale, performance, and availability → underpinnings of Google Cloud Platform
Google Cloud Platform
19
Five Generations of Networks for Google scale Spine Block 1
Spine Block 2
Edge Aggregation Block 1
Spine Block 3
Spine Block 4
Edge Aggregation Block 2
Spine Block M
Edge Aggregation Block N Server racks with ToR switches
Google Cloud Platform
20
Bisection B/w
Jupiter
1000T
Watchtower
Firehose 1.0
100T
Saturn 10T
Firehose 1.1 4 Post
1T
‘04
‘05
‘06
‘08
‘09 Google Cloud Platform
‘12
Year 21
Bisection B/w
Jupiter
1000T
Watchtower
Firehose 1.0
100T
Saturn 10T
Firehose 1.1 4 Post
1T
‘04
‘05
+ Scales out building wide ‘06
‘08
‘09 Google Cloud Platform
‘12
1.3 Pbps Year 22
Bisection B/w
Jupiter
1000T
Watchtower
Firehose 1.0
100T
Saturn
+ Enables 40G to hosts
10T
+ External control servers
Firehose 1.1
+ OpenFlow
4 Post
1T
‘04
‘05
‘06
‘08
‘09 Google Cloud Platform
‘12
Year 23
B4: Google's Software Defined WAN
B4: [Jain et al, SIGCOMM 13]
BwE: [Jain et al, SIGCOMM 15] Google Cloud Platform
24
Andromeda Network Virtualization VNET: 10.1.1/24
Load Balancing DoS
VNET: 192.168.32/24 ACLs
VNET: 5.4/16 VPN
ToR
ToR
ToR
NFV
ToR
Internal Network 10.1.1/24
10.1.2/24
10.1.3/24
10.1.4/24
Google Infrastructure Services Google Cloud Platform
25
Making the Network Disappear Software Defined Networking enables the network to disappear, driving the next wave of computing •
Sufficiently high bandwidth, low latency, and low cost
•
Fabrics, not boxes, programmable for performance and isolation
•
Highest level of availability, zero downtime for new features/performance
Google Cloud Platform
26
References 1.
“Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network,” SIGCOMM 2015.
2.
“B4: Experience With a Globally-Deployed Software Defined WAN,” SIGCOMM 2013.
3.
“Bandwidth Enforcer: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing,” SIGCOMM 2015.
Google Cloud Platform
27
Thank You!