Chapter 5

Google Infrastructure

Subscribe and stay connected to the cloud

The next two chapters are about gaining a deeper understanding of the Google Cloud and its many offerings. I’ve done my best to make the content approachable, accessible, and fun for everyone. As with anything technology-related, a few of the interesting details can get a little more technical than some people might like. While you’re free to skip these chapters, they do provide some groundwork for the “Building on the Cloud” chapter, which discusses several success stories, including those of Pokémon GO, Spotify, Evernote, and explains how the cloud helped with their success.

The Google Cloud Platform is like an iceberg: what’s exposed to the public is powerful, but there is a deep and solid foundation underneath that is not visible to us. There are layers and layers of technology stacks, each built and maintained by their own set of world-class experts and engineers. Although the word cloud is a popular way to refer to technology made available over the Internet, I can vouch for the fact that, in reality, it’s very much all right here on flat ground, close to our dear earth. And close to dear earth is exactly where our walk through the Google Cloud starts, in Google’s data centers.

Global Data Centers

Google has more than fifteen data centers across the world, most of which are in the US. Others are located in Asia and Europe. These data centers are massive facilities designed to be physically secure and have their own dependable power source. Each data center has multiple layers of security, from highly trained and carefully screened guards to laser- and access-card-based perimeter security. Everything is logged and recorded, and the grounds are under constant video surveillance.

The heart of each data center, where the servers are located, has even more security, including biometrics, and can be accessed by only a very select few of Google’s personnel. According to Google, it’s so secure that only one percent of their staff has ever entered any of their data centers.

The most important resource that these data centers need is power. Care is taken to ensure everything has multiple redundant power sources, and Google chooses to use renewable power sources such as wind and solar where possible. Greenpeace had this to say about an announcement by Google to purchase 48 megawatts of clean, renewable wind power for its data center in Oklahoma, USA. “Google’s announcement shows what the most forward-thinking, successful companies can accomplish when they are serious about powering their operations with clean energy”. When Greenpeace who’s goal statement is to ensure the ability of the earth to nurture life in all its diversity has good things to say about a company I take that seriously.

Google has stated that over 35 percent of their operations are powered by renewable energy, and they are heavily investing in it more to ensure that the number only rises.

Inside each data center sits millions of custom-built servers. The server hardware that runs in these data centers is designed and built by Google. It’s designed to be simple, secure, and quick to replace. Google aims to never create a single point of failure within their infrastructure, so if one server, or even a whole rack of servers, fails suddenly, it will not affect operations in any way. And even if a whole data center should fail, it would still not have a major impact on Google’s infrastructure. Their software is designed to route around failures by maintaining multiple copies of everything and being able to switch over in seconds.

In addition to all the physical security, Google also does a lot to keep its data safe from any potential snooping. All the data stored and transmitted within the data centers is fully encrypted. Google has a very sophisticated framework for keeping the keys to all of this encryption safe. In addition, users can also provide their own encryption keys for their own data if they feel the need to.

The multiple layers of security, sophisticated redundancy, and custom hardware make Google’s data centers some of the most technologically superior environments of their kind on the planet.

Planetary Scale Networks

Another part at the core of Google’s data center infrastructure is connectivity. When you have so many servers, you need a way to have them talk to each other and to the rest of the world. This is also where Google really shines, and I would be hard pressed to find anyone else who even comes close in terms of a fast and highly connected network. Leveraging their own high-speed fiber cables, public fiber cables, and undersea fiber cables, they have built a fast and resilient network that allows users from all over the globe access Google as fast as possible. This statement by Google from the Open Network Summit (2015) helps highlight how fast the Google network really is “Our current generation network called Jupiter fabrics can deliver more than 1 petabit per second of total bandwidth. To put this in perspective, such capacity would be enough to read the entire scanned contents of the Library of Congress in less than 1/10th of a second.“. The exact digital size of the library of congress is hard to guess but its often extrapolated to be over tens of terabytes so the ability to transfer so much data in one tenth of a second is a mind boggling capability. Its hard to argue that having access to this infrastructure is not a competitive advantage.

To build a network, most companies go out and buy traditional networking hardware such as routers and switches from a vendor and then tie it all together in a relatively static way. Although this works, it’s difficult to change, maintain, and keep secure. Google, on the other hand, has a highly dynamic network that adapts and changes to always stay connected and stay fast. To keep ahead of the curve, they’ve built their network in software, popularly known as software-defined networking (SDN). Since their core competence is building software, Google’s SDN technology, called Andromeda, is years ahead of everything else. Built into their software is all the knowledge gained from running one of the world’s most advanced networks, and it’s custom built to work well with everything else.

The same network that powers all of Google’s internal infrastructure and applications also power the Google Cloud Platform. Their investment in building up their network infrastructure is available to you when your application runs on the Google Cloud. Let’s imagine you did nothing more than just move your servers into Google’s data center. This is a hypothetical situation, as this would not be possible or even allowed in reality. But to help make my point, let’s say you did just that. There would instantly be a marked improvement in the speed of your application as your users see it. This is because of the billions of dollars Google has invested in their network infrastructure and in setting up dedicated connections all over the planet. A byte of data going through Google’s network from a server in the US to a phone in Singapore will be much faster than if that same byte traveled over the standard Internet. This is because of PoPs and peering agreements, which are technical-sounding terms for Google having invested in special deals with Internet providers all over the globe.

PoP stands for “point of presence.” It is a gateway near your users (in every country, continent, or major city), where their data leaves the standard Internet pathways and hops onto a high-speed freeway directly to your servers. When a user requests data from an application on the Google Cloud or running in their data centers, the first thing that happens is that the user’s location is worked out from their IP address. They are then assigned to a location on Google’s Edge Network that’s closest to them, which will provide them with the lowest latency. For a user in Singapore, it’s possible that this location would be in Singapore as well. The Edge Network receives the request for data and then forwards it to the nearest Google data center though their own network and not over standard Internet. The response from the server also travels over the same faster path back to the user’s mobile phone.

Sometimes the request doesn’t have to go all the way back to the data center. It might instead only travel to the nearest Google Global Cache (GGC) node. If over the course of a day ten thousand people living in Asia request to download a file from your website, these requests don’t have to be handled by your servers in one of the Google data centers. They can be served from a GGC node instead. After the first couple of download requests, the data for the file will be stored in GGC nodes across Asia closer to your users and will directly stream from there to the users’ devices.

If nothing else, the lesson here is that this setup costs millions of dollars to get right. It’s extremely hard and expensive for even large data center providers (and almost impossible for smaller companies) to work out individual peering agreements and have edge nodes with PoPs across the globe. Google continues to invest and expand these relationships and can afford to continue to do so in the future.

Computing Power at Scale

The Google platform is built in layers. We have been looking at it from the bottom up. We began by looking at the physical layers consisting of the data centers and the connectivity to the Internet, and we also looked at Google’s servers. We then started moving up through the software layers by looking at Google’s network. Now we will finally move higher up to my favorite part of the software layers: the managed services.

By now, I’m sure you have gained some insight into how large and vast Google’s infrastructure is. They have millions of servers all over the planet. Since they are all about using software to squeeze every last drop of efficiency, Google decided it’s highly inefficient to run just one application on one server, so they decided to do something about it. Enter Omega, Google’s application and job management system that runs hundreds of thousands of jobs from many thousands of different applications across hundreds of thousands of servers. Every second of every day since its inception, Google’s software has been working hard for its users, doing a wide range of things. For example, when you run a query in Google’s search engine, Bob is uploading his photos into Google Photos, Peter is sending an email, Alice is listening to Google Play Music, Tim is getting a Google Now notification, Jack is streaming a video on YouTube, and the list goes on. Multiply this by a few thousand more actions across a billion users, and you will get a pretty big number. Some of these jobs are small, while others are large. Some are instantaneous, while others take a while to get done. Some are always running, and others need to be split across thousands of servers to speed them up. Omega understands these jobs and their needs. It knows how to allocate Google server and network resources to keep things at peak performance. I can only imagine the mind-boggling complexity behind such a task, and from my experience of using Google, I can say it does a great job. The Omega system replaced the older software called Borg, which served Google for more than fifteen years. This behind the scene software was responsible for a lot of the success of Google software, the popular Wired magazine in a 2013 issue said this “The software system is called Borg, and it’s one of the best-kept secrets of Google’s rapid evolution into the most dominant force on the web.” until articles started to surface about the technology in Wired, very little was known about this technology outside of Google even through it was operating for close to a decade.

So now you know that Google has lots of servers and the software to efficiently manage the applications spread across them. Each application, such as Gmail or YouTube, runs on thousands of servers, and since each application is quite specialized and unique, Google needs a standard way to deal with all of these applications. To do this, Google uses a technology called containers. All applications are packaged inside an easy-to-manage unit, or container. Everything that runs on their servers runs in its own container. This is great because these containers can be isolated, standardized, secured, and upgraded, and all of this can happen without Google having to know all the details of the application.

You might be wondering why this matters to you in a managed cloud world, where you’re not supposed to care about the details behind how the cloud is built, where you can instead focus on just running your application on it. However, it’s great to know that your application runs on the Google Cloud inside its own safe and independent container. The Google Cloud doesn’t have to care about the specifics of your application. It only needs to manage the container that wraps around your applications. If the Google Cloud thinks you might need ten thousand instances to manage more users visiting your site, then ten thousand containers will be brought to life, and the traffic load will be balanced between them automatically. Google runs a vast array of software on top of these containers. This is the software that makes up what’s called the Google Cloud. It’s the software that runs the databases, file storage, web servers, etc. This is referred to as infrastructure software. It’s the stuff you use to build your applications. It can be split into two areas: 1) data management, and 2) code management.

One way to think about Google is to imagine your laptop with its hard disk drive, memory, and CPU. Now imagine these parts of your laptop being distributed across hundreds of thousands of computers globally. Google’s global distribution of disks is called Colossus. Every bit of data stored on it is split into 1-megabyte chunks and stored on multiple servers. Multiple copies of each part are stored on even more servers so you will never lose your data, as there will always copies of it. Also, you’ll never run out of disk space, as Google is always adding more servers. Another advantage of splitting data into chunks and spreading it across many servers is read/write performance. When many users read from and/or write to the same data, one server would easily get bogged down, but data on Colossus is distributed across many machines, and it is scaled with the needs of the users.

In addition to having a globally distributed file system, wouldn’t it be nice to have a globally distributed database? A database is great for storing structured data that you can quickly search. For example, Google can access its database to quickly look up a name based on an email address, pull up a list of all of your contacts, or find all the Google Cloud users who work for a specific company. And unlike that database server your IT guys set up in the back room, Google can handle billions of database queries, can quickly serve up petabytes of data, and won’t ever lose the data. Now add a global audience, and all of this activity happens 24/7/365 without skipping a beat. Performance like this is mind-boggling, but Google accomplishes it with Spanner, a database that took them more than four years to develop. This database is often considered the result of one of the hardest problems that Google has solved. Google’s own advertising systems (their primary revenue source) uses Spanner to deliver ads to their worldwide audience within milliseconds.

Subscribe and stay connected to the cloud

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement. The TC50 photograph above (center) is by Jen Consalvo.

© 2016 Culture Capital Corp.