Hundreds of people contributed to the effort over a decade.
Distributed programming has faced the same set of challenges since sockets were introduced into BSD 30 or so years ago. That’s about to change.
With the end of Moore’s Law how we have to think about computing has to change and what distributed computing means is going to change.
A key part of computing and how people like to build their systems is around storage.
We’ve seen tremendous increases in storage capacity, roughly following Moore’s law.
The I/O gap remains. The distance between processors and the underlying data they need to process is continuing to increase.
We can think of disks spread across entire buildings as being available to any server. This is fantastic. At the same time they are looking further and further away given the processing power we have.
Next generation flash at scale in a distributed computing environment remains largely untapped.
Networking is going to be playing a critical role in what computing means going forward.
Networking is at inflection point. What computing means is going to be largely determined by your ability to build great networks.
So datacenter networking a key differentiator.
Google builds building scale computers. Row after row of computes and row after row of storage.
Over the years Google has built software infrastructure to leverage their hardware infrastructure:
GFS (2002), MapReduce (2004), BigTable (2006), Borg (2006), Pregel (2008), Colossus (2010), FlumeJava (2012), Dremel, Spanner.
Many of these efforts have defined what it means to do distributed computing today.
You can’t build world class distributed computing infrastructure without world class networking infrastructure. How could you build a system like GFS to harness 100,000 disks without a world class network to put it all together?
Google innovations in networking:
Datacenter networking is different than traditional Internet networks.
Run by a single organization and is preplanned. The Internet has to be discovered to see what it looks like. You actually know what the network looks like in a datacenter. You planned it. You built it. So you can manage it in a central way rather than discover it as you go along. Because it’s under the control of single organization the software you can run can be very different. It doesn’t have to interoperate with 20 years of legacy, so it can be optimized towards what you need it to do.
What often draws people to networking is the beauty of the problem. Beauty can mean it's a hard problem. Can the problem be defined to be simpler and more scalable and easier?
Google needed a new network. A decade or so ago Google found traditional network architectures, both hardware and software, could not keep up with their bandwidth demands and the scale of their distributed computing infrastructure in the datacenter.
Can we buy it?
Google could not buy at any price a datacenter network that would meet Google’s requirements of their distributed systems.
Can we run it?
Box-centric deployments incurred the cost of high operational complexity. Even if Google bought the biggest datacenter network that they could, the model for operating these networks was around individual boxes, with individual command line interfaces.
Google already knew how to handle 10s of thousands of servers as if they were a single computer and hundreds of thousands of disks as if they were a single storage system. The idea of having to manage a 1000 switches as if they were a thousand switches didn’t make a lot of sense and seemed unnecessarily hard.
We’ll build it.
Inspired by what was learned in the server and storage world: scale out.
Rather than figure out how to buy the next biggest network or the next biggest router, be able to scale out the switching and routing infrastructure with additional commodity elements, just as was done with servers and storage.
Why not when the need arose plug in another network element to give more ports and more bandwidth?
Three key principles that allowed Google to build scale out datacenter networks:
Clos topologies. The idea is leverage small commodity switches to build a non-blocking very large switch that can scale more or less infinitely.
Merchant silicon. Since Google wasn’t building a wide area Internet infrastructure they did not need massive routing tables or deep buffering. This mean commodity merchant silicon could do the job of building Clos based topologies in the datacenter.
Centralized control. Software agents knew what the network should look like, setup routing, and react to exceptions from the underlying plan. It’s much easier to manage a network when you know its shape compared to when you are constantly trying to discover what it should look like. If you want to change the network then you tell the centralized control software to change the network.
Approaches to the datacenter network also applied to the campus network, connecting up building together and to the wide area network. B4 and Andromeda were inspired by and informed by the work in datacenter networking where they had been practicing building their own hardware and centralized control infrastructure for many years.
Over 6 years datacenter bandwidth growth has increased 50x.
The amount of bandwidth that needs to be delivered to Google’s servers is outpacing Moore’s Law.
This means to keep up with bandwidth needs Google has to be able to scale out, it’s not possible to continually rip out old equipment and put in new networking.
Scale drives architecture.
A typical network today (not necessarily Google) may have 10K+ switches, 250K+ links, 10M+ routing rules. How you deal with networks at this scale is fundamentally different than smaller scale networks.
Why build networks at the scale of entire datacenters?
Some of the largest datacenters have 10s and 10s of megawatts of compute infrastructure.
There is substantial resource stranding if you can not schedule at scale. Imagine a number of different jobs that need to run across a shared infrastructure. If jobs must be scheduled within the boundary of a single cluster and you have, say, 10 1,000 server clusters versus 1 10,000 server cluster, you’ll get much much better efficiency if you can schedule arbitrarily in the 10,000 server cluster. If you can place your jobs anywhere across 10,000 servers rather than have to fit within a 1,000 servers the number work out to be quite stark. So if a network could be built to scale to an entire building there is much more efficiency out of the computes and out of the disks. And those wind up being the dominate costs. The network can quite inexpensive and can be quite an enabler for compute and storage.
“Resource stranding” is when you have leftover CPU in one place and leftover memory in another (source).
Balancing your datacenter.
Once you get to the scale of a building you have to make sure you deliver sufficient bandwidth.
An unbalanced datacenter has some resource being scarce which limits your value. If one resource is scarce it means some other resources are sitting idle, which increases your costs.
Typically at datacenter scale the resource that is most scarce is the network. The network tends to be under provisioned because we don’t know how to build big networks that deliver lots of bandwidth.
Bandwidth. Amdahl has another law.
You need 1 Mbit/sec of IO for every 1 Mhz of computation for parallel computation.
Let’s say for example purposes only in a future datacenter near you that compute servers have 64*2.5 Ghz cores then to be balanced each server needs about 100 Gb/s of bandwidth. This is not local IO. Local IO doesn’t matter. You need to access datacenter wide IO. The datacenter may have 50k servers; flash storage at 100k+ IOPS, 100 microseconds access time, petabytes of storage; and in the future some other Non Volatile Memory technology will have 1M+ IOPS, 10 microsecond access times, and terabytes of storage.
So you would need a 5 Pb/s network and correspondingly capable switches. Even with a 10:1 oversubscription ratio this means you would need a 500Tb/s datacenter network to come close to achieving balance. For perspective, a 500Tb/s network has more capacity than the entire internet today (which is probably near 200Tb/s).
Latency. To achieve goals of storage infrastructure disaggregation you need predictable low latency.
Disks are slow with 10 millisecond access latency so they are easy to make look local. The network is faster than that.
Flash and NVM are much harder.
Availability. You have a building worth of computes, new servers need to be introduced continuously, servers need to be upgraded to go from 1G -> 10G -> 40G -> 100G -> ???
The building can’t be taken down for the upgrade to happen. The investment is just too large. Refreshing the network and servers must be constant. The old must live with the new without interrupting very much capacity at all.
Scale up networking doesn’t work. It’s not possible to do perform a scorched earth upgrade on a network that criss crosses the entire datacenter.
Google Cloud Platform is built on a datacenter network infrastructure that supports Google scale, performance, and availability. This is being opened up to the public. The hope is the next great service can leverage this infrastructure without having to invent it.
Key challenges are providing isolation while opening up the raw capacity of the hardware.
In the early days Google faced a choice of how to build their control plane. Use the existing service provider related stack of OSPF, ISIS, BGP, etc. Or build their own.
They chose to build their own for several reasons.
The topologies Google was considering a decade ago needed multipath forwarding to work. Lots and lots of paths between the source destination were needed to achieve the desired bandwidth. Existing protocols didn’t support multipath. They were about connectivity. Finding a path from the source to the destination, not the best best, and not multiple paths.
Ten years ago there were no high-quality open source stacks.
All the protocols were based broadcasts and at the scale Google was looking to operate at they were worried about the scalability of broadcast based protocols.
Once the broadcast based protocols are installed you have to configure each individual box to use them and Google doesn’t like that.
Putting it all together with the new conventional wisdom learned at Google in building massive systems at scale:
Logically centralize (this does not mean a single server) with a hierarchical control plane with peer to peer data plane beats full decentralization. Seen it play out over and over again, and played out with datacenter networks as well.
Scale out is much more convenient than scale up. This was seen with GFS, MapReduce, BigTable, and Andromeda.
Centralized configuration and management dramatically simplifies all system aspects as well as deploying.