Chapter 7

Building on the Cloud

Subscribe and stay connected to the cloud

To Catch a Pokémon

Late one evening, I was walking the downtown streets of the picturesque little Californian town that I lived in. Normally, at that time of day, there were rarely any people around, and I would take the time to enjoy the mountain air and listen to podcasts. That evening, however, something was very different. There were hundreds of people around. Some of them were even in the middle of the street. People were franticly running about staring at their mobile phones. Suddenly, someone shouted, “I found a Mewtwo down by the gym!” and people ran screaming to the park down the street.

Fortunately, my town wasn’t under attack by killer zombies. It was Pokémon GO, the most popular mobile game ever launched.

Niantic, the company behind Pokémon GO, chose to build their game on the Google Cloud, and it might have been the best decision they made. It probably helped the game become a global sensation. This wasn’t the company’s first game. They previously launched Ingress and Field Trip. Both games had various levels of success but nothing close to the success of Pokémon GO.

It’s hard to predict if and when something will go viral, and when it does, that bubble can pop fast if your technology doesn’t scale. Users won’t suffer a bad experience for very long. When the next thing comes along, they’ll jump ship. Niantic and the Google Cloud Platform team blogged about their experience (or their lack thereof) launching such a massive hit and how the cloud supported it. As a part of planning the game launch, Niantic reached out to the Google Customer Reliability Engineering (CRE) team. When capacity planning for what they expected in terms of users, their worst-case scenario was 5X (or five times) the number of expected users. This is a standard practice when planning to launch any software product. For example, if the expectation were set at five thousand users, the worst-case scenario would be twenty-five thousand users. They would have tested for these metrics and set up their cloud to handle this range of users and activity.

On launch day, however, the reality was a little bit north of those numbers. I can only imagine their panic when they saw the game’s usage growth cross the 5X mark and rocket to the 10X, 20X, 40X, and finally the 50X mark. So if they were expecting no more than twenty-five thousand users, a quarter million showed up for the party. Since their initial X value isn’t public information, we can only guess. For all we know, maybe they expected a million users to begin with only to have 50 million show up. These are massive numbers, and if it weren’t for the cloud, it would have been a time of great suffering and little joy.

Leveraging the cloud gave Niantic two advantages. First, they could effortlessly scale up by provisioning more computing power. With the help of the Google Cloud’s new load-balancing technology, they were able to handle the 50X user growth they experienced. Second, Google’s CRE team worked hard to provide Niantic with additional technical expertise to ensure their code worked more efficiently on the cloud as well as fix any issues they experienced. Over a few days, the company was able to roll out Pokémon GO across the world from Japan to the US as well as actively deploy new software updates without affecting existing players.

On the technical side of things, Niantic built the game using Docker and Kubernetes, which are free open-source technologies. This means it would be easy for them to migrate to another cloud provider if need be, avoiding any vendor lock-in. Fortunately for them, the Google Container Engine (GKE) worked well to keep them on the Google Cloud. Also, smoother player experiences over Google’s fast global networks was an added advantage they got for free.

Spotify Moves into the Neighborhood

One morning in early February 2016, as I got settled into my car to drive to work, I did what I normally do: launch the Spotify app to find a playlist for my drive. Highway 85 from Cupertino to Mountain View, California, sees heavy traffic during the morning commute hours, and music makes my drive a little more enjoyable. That morning, however, something had changed. Spotify announced that they are moving their entire backend software platform to the Google Cloud. The music sounded the same, but the music was now streaming to my phone from the cloud. This was a big deal.

I know from previous blog posts that Spotify has a large and complex backend software stack with many global users depending on them for their music. Streaming music is much more complex than serving webpages. The demands it imposes on the platform are also enormous. It’s a service that requires fast low-latency connections for streaming, transcoding between various audio formats, and a high volume of data transfers based on Spotify’s popularity. With that in mind, moving from their own servers to the Google Cloud and then going all out to leverage managed services was clearly a major decision for the company.

According to this statement in a blog post (Spotify Labs blog, March 15, 2013) “The Spotify backend infrastructure is built up of several layers of hardware and software, ranging from physical machines to messaging and storage solutions.” they didn’t begin on the Cloud. They had their own servers with their own teams handling everything down to managing every database, application, and security issue, and everything else that goes with it. Moving to the Google Cloud was a decision that could have had serious consequences on their business. On top of that, they operate in a highly competitive space where even Google is a competitor with its Google Play Music service. So why did they make that decision? By understanding how Spotify leverages the Google Cloud, we will see why they now have the advantage and can focus on remaining competitive.

To start, let’s look at Spotify’s scale. According to a blog post about the move, Spotify has more than 75 million users, 30 million songs, and 2 billion playlists. They also provide personalized music choices for all their users. They have an engineering team of more than three hundred people spread across multiple cities and time zones. They also recently added podcasts to videos to their content offerings. Their biggest competitors are Apple, Amazon, Google, Slacker, and Pandora.

At the heart of Spotify’s decision to move to the Google Cloud are their customers. To be able to provide them with a better experience and a better product is above all else. For Spotify customers, a better product is one that can instantly play their song of choice and help them discover more music. How quickly a song can stream from their servers to your phone depends on so many things, including how fast the software can decode and stream the song, how scalable the backend is to handle the hundreds of thousands of song requests, how fast the database is at locating the song, how close to you the server infrastructure is, how good your Internet connection is, etc. The other piece of the puzzle is personalization, or how Spotify can use machine learning to help you discover music you will enjoy. This means dealing with lots of data, which we know is something the Google Cloud does well.

Also, since Spotify now runs within Google’s data centers, the songs you request flow over Google’s internal planetary-scale networks and not the open Internet. If, over the course of a day, thousands of people living in Europe request the same song, these requests don’t have to be handled by the Spotify song-streaming servers back in one of the data centers. Google will cache the song on a node closer to those users and will directly stream from there to their phones, making it quicker to play the song and at the same time not adding any load to the song servers, thereby saving resources.

Spotify announced that they are approaching their move to the Google Cloud in two separate streams, one that focuses on their data and another that focuses on their applications. As you now know, there are several major managed services that make up the application side of the Google Cloud, including Compute Engine, Container Engine, and App Engine. Compute Engine is essentially a virtual server. You are free to put any kind of operating system and software on it, from Windows to Linux. Unlike most managed services, you don’t have to fit your problem into their box. It’s really the easiest option to go with when migrating your existing applications to the cloud. I would think that this is the option Spotify went with when they moved their application, as they wouldn’t have to change a line of code and would therefore get a lot of value from just running on this new infrastructure.

You might be wondering why I’m trying to sell you on a basic, well-understood concept such as virtual servers, or why I claim they’re different. The key is in what’s managed for you and how that can save your business valuable resources. According to a Spotify blog post, they originally ran on approximately twelve thousand private physical servers and believed in managing everything themselves. They needed to maintain and manage all of these servers and the applications running on them, and the only people who could do this were their engineers. So in addition to working on new features and adding real value to their product, they’ve spent valuable time on operating their servers just to keep everything running. Engineering time is often the most expensive for any company, and now subtract all of those engineering hours working on servers from engineering hours that could have been put towards developing their core product, and you’re looking at numbers that can massively affect the bottom line.

There are people who will disagree with me. They might believe that all of this work babysitting production servers is simply a part of software engineering. They might even point out several successful companies and startups that do things this way. This way of thinking is more common than it should be. The reality I learned, talking to hundreds of software engineers, was that a lot of them were looking to leave places where they were burdened with exactly this kind of work. I’ve heard countless tales of smart, capable engineers living in fear of having to wake up at odd hours to deal with a disk that went bad or having to add more machines to that one application that couldn’t handle the load. The stress was clearly getting to some of them and interfering with their real job: writing code.

Smarter companies such as LinkedIn had a separate operations organization (consisting of site reliability engineers, or SREs) to focus on that part of running a web service, but clearly not everyone can afford that. Why make people spin a wheel by hand when a motor does a pretty good job? Those same people could focus on building a better car.

Back in 2012, the engineering team at Spotify was busy managing all of these servers and dealing with every little issue that arose. I know from experience how these issues have a tendency to show up at 4 a.m. As their business expanded, these manual interventions got harder and harder, and the engineering teams decided to build a whole new set of tools to manage the workload. They tied these tools into their project management software to help orchestrate all of it. Although it was a little better, it was all still prone to error and often needed human intervention. In one mind-boggling example, before moving to the Google Cloud, when Spotify needed more capacity such as new servers, it could take weeks or even months to get it all together. It took more than two years for their engineers to stabilize and mature their tools to the point where they could do things such as provision new servers in a few hours and not worry about every little change they had to make to their networks. In their own words, it was 2014 when they finally had some breathing room.

I want to point out that the Spotify engineering team had to be very strong to be able to successfully build everything they did, including their product and their infrastructure. Since cloud platforms weren’t really available back then, they couldn’t exactly wait for one to be invented.

Once the Spotify team moved their applications to the Google Compute Engine, what did they get? According to benchmarking done by Sebastian Stadil, founder of Scalr (“By the numbers: How Google Compute Engine stacks up to Amazon EC2,” published on Gigaom and GitHub), performance on the Compute Engine massively beats most other providers. It takes less than thirty seconds to launch a new virtual machine. Imagine how fast Spotify can bring hundreds of new servers to life to quickly handle the burst in music streaming that happens during the daily morning work commute, and once everyone is busy at work, they can easily scale capacity back down to a minimum. This kind of elasticity will do amazing things for Spotify. All their customers will get the best experience all the time, and the company will save money and resources. They’ll no longer have to plan in advance or have extra hardware sitting around wasting resources. They’ve gone from waiting months for a new server in 2012 to now being able to quickly launch hundreds or thousands of servers that they can instantly drop when they’re no longer needed.

Just as physical servers often need maintenance such as changing faulty hardware or upgrading software, even virtual servers need maintenance. This may include addressing a hardware failure on the physical server the virtual servers run on or dealing with security issues requiring a software update. Maintenance often requires the server to be shut down, but what about your application that’s running on it?  ]{.Apple-converted-space}It could also be a case of hundreds of virtual servers running various applications. Normally, your application would be forced to shut down in the middle of its current task. Even a critical task such as accepting a payment from a customer could suddenly be terminated in the middle of the transaction.

Although this sounds bad, most cloud platforms don’t know what your application is doing, and it can be terminated with little or no warning. Not on Google Compute Engine, however. When virtual servers are at risk of requiring maintenance, Compute Engine live migrates (moves as they are) the applications running on them to other servers without the applications even knowing what happened. At the Google I/O demo of this technology, an application encoding and streaming video was live migrated to another server without the video skipping a beat. Even to the most seasoned technologist experienced with using the cloud, this is close to magic.

Google is constantly upgrading the software on their virtual servers with security patches and bug fixes, and the live migration capability allows them to do this as often as they want, making their servers more secure and stable. I can only imagine how all of this managed cloud goodness came into play when Justin Bieber broke the Spotify one-week steaming record the first week of his 2015 album release, Purpose. It hit 205 million streams across the globe.

Building Snapchat

It’s hard for a lot of grown-ups to understand just how popular Snapchat is. Millions of teens all over the world spend a large part of their day glued to it, and this number is growing. Snapchat focuses on being a more natural platform to communicate on. Content posted on it disappears after a few seconds, making teens feel more comfortable being themselves on the app. Just as in reality where your friends don’t log and archive everything you tell them, on Snapchat, what you share is erased soon after your friends see it. Another unique thing about Snapchat is that it started as a mobile app and remains available only as a mobile app.

At a recent Google event, Bobby Murphy, co-founder and CTO of Snapchat, spoke about how Snapchat built such a global app. The founders knew that once their social app went viral, they would need to be able to scale up pretty quickly, and running out and hiring a large team isn’t quick or efficient. Their ability to stay nimble and quickly scale up was critical to their success.

First impressions mean a lot, and if new users can’t log in or share content instantly, they may leave, never to return. Building an engaging consumer app experience is hard enough. Add to that a tech-happy and fickle teen audience, and things get that much harder. The last thing you want is to be hurt by your own success. As the popularity of Snapchat spread like wildfire from schools to universities, with thousands of new users signing on hourly, the load on the Snapchat servers grew exponentially. In this case, people were not just sharing links or small snippets of text; they were sharing large image and video files, streaming them directly from their mobile phones to their friends’ phones.

Bobby Murphy, who graduated from Stanford University, had previously used the Google App Engine for a few small projects and had found that it was a really easy platform to build and launch on. He knew that, as a small team, they had to focus on their application and that they just couldn’t afford to deal with all the work it takes to manage a web platform. They were like me with my startup Socialwok: early adopters of the App Engine, back when it was in its early beta phase. And like mine, their confidence in the early technology was based on Google’s reputation for world-class web infrastructure.

To understand the scale of Snapchat, according to 2014 estimates, more than 760 million photos and videos (more than ten thousand per second) were shared on the platform daily. On the Google I/O stage, Bobby mentioned that they were a team of about twenty-five engineers and zero operations people (SREs), with all of their engineers focusing only on product development. This is so unusual that one of their new engineering VPs was surprised at not having all the operations headaches that he was normally used to, even at their scale and business growth rate.

Gaming in the Cloud

Game developers are a demanding lot. Successful multiplayer games can impose considerable pressure on backend infrastructure. Game development is a very competitive space, so games have to constantly evolve and change to keep the players hooked. This makes it a great test for the scalability and agility of a platform such as the Google Cloud, so it’s a good thing that a lot of developers have placed their bets with it and are seeing considerable success and competitive advantages.

Rovio, FreshPlanet, Pocket Gems are a few of the game developers that love the platform because it allows them to focus on building great engaging games while everything else managed for them. Rovio, the maker of Angry Birds, has over 250 million users. FreshPlanet, the maker of the music trivia game SongPop, has over 100 million users. Both of these developers really appreciate the instant scalability of the Google App Engine platform, as most games they launch quickly grow from zero to millions of users.

Almost of all these games make use of the Google Cloud Datastore, an automatically scalable data storage service, as games usually have a lot of information to store and update such as player scores, locations, and levels, and with tens of millions of players, that’s a lot of data by any standard. In addition to generating and storing a lot of data, the product teams behind these games also want to be able to mine this data quickly. Insights derived from this data help provide ideas for new games and help evolve the strategy and gameplay of existing games. Non-developers within these companies often use Google BigQuery to dig deeper into their data for answers.

A great example of what’s possible was highlighted by the Humble Indie Bundle (2010), a unique promotion launched by the digital video game seller Humble Bundle, where for twelve days you could purchase a unique bundle of fantastic indie games from various publishers for whatever you wanted to pay. The site ended up getting 3.4 million views over the twelve days, and they grossed more than \$1.2 million. That’s a lot of traffic and a lot of revenue. The Humble Bundle team couldn’t risk any downtime or security issues during the short window of time they ran their promotion, so they chose the Google App Engine. The service delivered as promised, and in the end, they paid a grand total of \$71 for access to a computing infrastructure that surpassed even what most large web retailers have access to. From then on, Humble Bundle has continued to leverage the Google App Engine as the infrastructure behind their product. And they’ve gone on to launch many more successful bundles.

Pokémon GO saw the fastest viral growth of any game or app in the history of the web. It went from zero to more popular than Twitter in five days. There is no amount of planning that could have accounted for this trajectory.  ]{.Apple-converted-space}The scalability of the platform was largely responsible for the success of the game, as having your app consistently available to play is probably the best thing you can do to sustain viral growth.

Disrupting Payments in Africa

Nothing does a better job of showing us how ubiquitous the cloud is today than this next story. According to a blog post by Google, half a billion people in Africa live on less than US\$1 per day and don’t have access to reliable electricity and telecommunications. Nomanini, a Cape Town–based startup, aims to help facilitate cash payments in this market and fix some of these issues for the people of Africa. To be able to serve its target market, the company needed to scale quickly. With such a large market and the sensitivity of dealing with payments, they needed a solution that was both secure and efficient. They built special localized point-of-sale units, a piece of hardware that their army of employees could carry into far-flung areas to help them sell prepaid airtime and electricity cheap.

As luck would have it, their CTO had previous experience with the Google App Engine. Given the serious set of requirements that could potentially make or break the company, they chose to go with it. The company operates on razor-thin margins and so availability is critically important. System downtime would cost them and their partners real money. The company is expanding quickly and has more than a thousand terminals across South Africa and a few other countries. Each of these terminals transmits sales data back to the cloud, where it’s processed and analyzed. With a small team, they have to ensure a high level of quality for their backend systems and the software running inside their hardware devices. To make this happen, the company replies on continuous deployment, and the Google Cloud infrastructure helps with this. They push code to production more than ten times a day. New features and bug fixes are continuously rolling out to their customers. Even new software updates for the point-of-sale hardware are handled by the cloud. When a new update is available on Google Cloud Storage, the devices are notified, and they download it and self-update.

In a blog post about their work, Nomanini’s CTO estimated that the efficiency gains of using the Google App Engine is like having a whole extra person in a team of six, not to mention the gains from the productivity boost and developer happiness. The value of these things is hard to quantify. The other managed services that Nomanini uses include Google BigQuery and the Google Cloud Prediction API, which they use to manage and analyze the large amount of data they collect. This helps them with sales forecasting. The impact they are making on people’s lives and the exciting cloud-based software systems they are building are clearly very impressive and will help them attract more talent from all over who want to be a part of their mission.

Evernote Gives Up Its Data Center

When I started working on this book, I was doing a lot of research online, and the best tool I found for that was Evernote. When I came across a paper or an article that I needed to save for future reference, I would use the Evernote Web Clipper to save a copy. I’ve been a fan of the product for a while. It has a simple yet focused function, and they’ve done a great job improving it.

In speaking with people about Evernote over the years, we’ve always agreed that this was a product that clearly needed to be built on the cloud. This was not an option that was available to Evernote’s developers when they started the app, so it’s understandable that they had to host their own servers. But today all of this has changed, and Evernote is dropping their own data center in favor of the Google Cloud. They announced that they would be giving up a lot of their own technology, storage infrastructure, and servers in their move to the Google Cloud Platform and the various managed services it provides.

Like Spotify, Evernote has many active users. They have to manage 200 million users and their petabytes of data. They will be migrating more than 5 billion of their users’ notes from their current infrastructure to the Google Cloud Platform. I assume that most of these users don’t care what cloud Evernote uses in the backend as long as their data is safe and the product functions as expected. Evernote doesn’t have the luxury of starting from scratch or building a new product on the cloud. They have to migrate to the platform and do it without affecting users.

So what does Evernote get in return for this migration? For one, the capability to compete with much larger players such as Microsoft OneNote. By standing on the shoulders of Google, Evernote will be a much more secure product with automatic encryption for their users’ data. It will also have access to a superior network that will make the product faster and access to superior machine learning and AI services such as text and image recognition, which will only improve the Evernote product further. Also, since the Google Cloud Platform is already certified for various regulated use cases (including government and financial), this might help Evernote access new customers that require these certifications.

All things considered, the company will see big savings on labor and other resources that they can now put to work on adding features and improving their core product. The savings on labor may also mean that they can hire more people to focus on building new functionality or entirely new products. These changes could help provide a return on investment far beyond anything their existing private data center investments could.

It’s often hard for the leadership of an organization to comprehend how such a technical decision about moving to the cloud can change the trajectory of their business. There is also the fear of giving up control. Evernote, like Spotify or the thousands of other companies that have chosen the cloud, is showing others how these decisions about the technical foundation of their product really matter and how choosing the cloud is the right answer.

Subscribe and stay connected to the cloud

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement. The TC50 photograph above (center) is by Jen Consalvo.

© 2016 Culture Capital Corp.