Chapter 9

Data and the Cloud

Subscribe and stay connected to the cloud

What if I were to tell you that I’ll be sending more than 2 million messages your way? That’s 2 million per second. Also, I’ll need you to deal with every one of these messages. I need you to derive some insights from them and store them safely. This means you’ll have to deal with 150 billion messages (little blocks of data) per day every day.

I can tell you right away that this is a very difficult problem. However, it’s also not an uncommon occurrence in today’s data-driven world to have to deal with a large amount of data. You might be working on an Internet of Things (IoT) device that records temperature readings from the soil to send back for analysis. Maybe you need to combine these readings with photos a drone takes of a field of crops so you can decide which parts need more water. Maybe you have a home security product that records video and other information every time there’s a disturbance. It will need to use the cloud to keep records, store videos, and analyze videos to see if there’s an intruder. It could even be that you and a couple of friends are planning to build an amazing mobile game where players can battle each other’s dragons in a massive multiplayer world.

Any of the above ideas will have you dealing with a big data problem. This is when you’re dealing with so much data that it won’t fit on one computer. Even if you did find a disk big enough, it would take forever to process that data, which doesn’t help. The answer is pretty obvious: Get more computers, take your data, break it into chunks, and put it on as many computers as you can. With big data, however, this could mean hundreds of computers, if not more, and it’s not an easy feat. Splitting the data into chunks is hard. Finding the data across all of those computers is hard. Managing hundreds of computers is hard. And keep in mind that you’ll need copies of all this data (backups), so you’ll also need more disk space (i.e., more computers). And if this data is really important to your business, then you might want to keep copies of it in different locations, and you might need to encrypt it to keep it safe or set up rules around who can see what.

Not all data is the same. Some data is simply more important to your business and should be kept close at hand. You might need to hold onto other data for just that day, or perhaps five years in case the government comes asking for it. The data you need daily is like the documents you keep on your desk for quick lookup, and the other data is the files you put away in a cheap warehouse until the day they’re needed. Sometimes you need to handle documents as they come in, while in other cases you need to keep neat records so you can get your answers faster. The data your technology products deal with is very similar to these real-world examples.

It doesn’t end there. What’s the point of taking all this trouble if you can’t make use of the data? So now you need to build software to find what you’re looking for in all the data across all of those computers. The software would also need to be smart so it can deal with computers failing, data getting corrupted, and who knows what other stuff. In the technology world, this is referred to as “distributed data infrastructure,” and no one but the largest companies even try to roll this out on their own.

At this point in the book, you probably already know what I’m about to say, and you’d be correct: leverage the managed cloud instead. The Google Cloud Platform has amazing capabilities to handle all of your data needs, and learning more about them will help you make solid decisions about your organization’s data.

Solving Spotify’s Big Data Problems

The example I used above of having to deal with more than 2 million messages per second is actually a problem that Spotify faced, and it’s a number that has probably grown many times over by the time you’ve gotten around to reading this book. This is a lot of data flowing in at a very high rate, and handling it efficiently is no small feat. Spotify certainly struggled with this issue before moving to the Google Cloud. When their customers search for music, create a new playlist, or request a song, they generate data that is sent to Spotify from all over the globe. Spotify initially built their own system. Although it was impressive, it was something that required significant resources to create and maintain. As their music streaming service gained popularity, their engineers noticed that certain events were growing significantly in number, and it was creating stress on their system.

Spotify’s core value is in creating a great music experience for their customers, not building data systems, so every resource committed to fixing this data issue was a resource they couldn’t apply to their core product. Things had gotten so bad that they were finding it increasingly difficult to maintain their own homegrown solution. Even small changes were causing major outages. In short, Spotify had big data problems.

At the time, some of the members of the Spotify team were exploring other options, one of which was the Google Cloud, specifically Google Cloud Pub/Sub, which is a managed service for streaming data. The team became very interested in this solution when they realized that it did everything they needed and more. It was available globally, easy to use, and fully managed by Google. It sounded so good that their first thought was that it was too good to be true. So they ran a test. They blasted Pub/Sub with 2 million messages per second for several hours, and the results were astonishing. Even with that amount of load, everything ran smoothly with almost no errors. The team was instantly sold, and they moved from their own system onto Pub/Sub for all of their data streaming needs.

Another problem the team faced was that they wanted to process the data in addition to streaming it. Having already experienced the Google Cloud, they set out to find a managed service that could do this for them, and they found Google Cloud Dataflow and BigQuery. These services are designed to work well with data streaming in from Pub/Sub, and they can do all kinds of data processing work on it at scale. One of the most interesting reasons why Spotify has to crunch their data is for the music charts. Billboard’s music charts show the most popular music across the world in various lists including Billboard Hot 100, Billboard 200, etc., and Spotify’s usage data is one of the signals that feed into them.

Increasing Revenue with Data

Bangalore-based Redbus is a startup focused on helping people buy bus tickets across India. You can think of them as Expedia for bus tickets. India has a huge population, and many prefer to travel by bus. You can pretty much get anywhere on a bus. As Redbus began to see usage grow, their data needs also began to grow. The team needed a way to analyze hundreds of thousands of bus bookings as well as inventory data for their more than ten thousand bus routes, and it all needed to be fast.

Like most startups, their first response was to go out and build this system themselves using open-source platforms like Hadoop. And like most startups that naively attempt to do this, they soon found the problem to be beyond their own capabilities. Redbus was smart enough to be realistic about their limits. They didn’t focus their resources on trying to build this solution themselves. They instead decided to use Google BigQuery, and they pushed all of their data regarding customer searches, bookings, and inventory into this system. Their engineers used BigQuery to produce answers in a few seconds, whereas anything they could have build would have taken hours. The cost of using this managed service was also 20 percent of what it would have cost them to build and maintain a system themselves.

Being able to pull up deep business insights in seconds did wonders for their business. For example, in just a few seconds, they could find where customer demand for seats was not being met and could quickly move to add more seats to those routes. This allowed them to generate revenue in situations that would have normally disappointed their customers and lost money. It also helped that Google BigQuery is very accessible to everyone in the organization, including executives, product managers, marketing teams, etc. The service has a user-friendly graphical interface where anyone who knows the popular SQL querying language can use simple SQL statements to get insights from billions of rows of data in mere seconds.

Safari Books Online is a popular tech and business book publisher with a library of more than thirty thousand books and videos. Individuals and businesses buy subscriptions on their site to get access to this library. Every day, Safari sees thousands of searches and other usage information from its large subscription base using their site. Their goal was to generate insights and trends from this usage data, which would be valuable to the business to help generate more revenue.

They first began by building their own pipeline to get this data into a MySQL database that they hosted themselves. However, they soon ran into the error of their ways. Even simple queries were slow, and with more data coming in every day, things just got slower. Anyone could see that the trend wasn’t positive and that things weren’t going to get any better. The team had heard about Google BigQuery at the Google developer event Google I/O, and they planned on trying it out.

Getting their data into the service was relatively easy using the tools provided. The biggest issue was that they had to transform the data into a format that they could easily upload into BigQuery. That was pretty much it. Once their data was in the Google Cloud, they could query it any way they pleased, and they would have their results in a few seconds. Since this was so easy, they stopped worrying and embraced the power of the platform by getting more ambitious. Soon they were uploading all of their web server logs into the system to get insights into how many preview books people were reading on each account. These insights were soon flowing into the CRM system to be used by the sales teams to qualify leads. Increasing revenue while decreasing costs—what more could a business ask for?

Mapping the Planet

Descartes Labs, a startup far from Silicon Valley in Los Alamos, New Mexico, ran into an interesting problem. The startup was founded by an all-star team from the world-famous Los Alamos National Laboratory (LANL). Their plan was to take years of research into analyzing satellite images and turn it into a commercial product. Their sophisticated deep-learning-based (AI) algorithms would derive valuable insights about our planet from satellite images of earth. Our world runs on commodities, so if Descartes Labs could successfully predict the production yields of corn crops or the state of available drinking water by just looking at images of earth taken from space, it would be worth a lot of money.

Their problem was that they had to deal with massive amounts of data. More specifically, they needed a way to efficiently download petabytes of data from NASA without bankrupting themselves. Without a solution to this problem, it was doubtful that they could even start this company. A petabyte is a lot of data—thirteen years of HDTV video or 10 billion photos on Facebook—and they might be dealing with several petabytes. Setting up their own data center was clearly not within their budget. Even as early as ten years ago the team probably would have been forced to raise a large amount of funds or, worse, give up and watch as a large company walked off with the prize.

I like to think of software as the big equalizer. Anyone with an idea and the ability to code should be able to change the world. The Descartes Labs team had the idea and the capabilities to build the software. What they needed was access to affordable yet powerful computing and networking resources. It’s what stood between them and changing the world.

Descartes Labs did have venture funding, and through their investors they discovered the Google Cloud Platform for Startups program. The program gives selected startups thousands of dollars worth of credits to use on the Google Cloud. Google provided the team with a sandbox to experiment in and see if Google’s infrastructure could help them.

Leveraging the Google Cloud changed the game. Their big data problem pretty much disappeared. The team found that the satellite images they needed were already available on the Google Cloud through the Google Earth Engine service. And with everything contained in Google’s data center, they automatically had access to petabyte networks to help data zip around.

It’s one thing to have instant access to all of that data. They also needed to run their software algorithms on it and come up with results. This needed a lot of computing power, and that is where Google Compute Engine helped. The team could instantly spin up over thirty thousand CPUs to churn away at their data. In an article they did about their experience, they mentioned that it took just sixteen hours to process all of that data. The team could have easily increased the number of CPUs to shorten their processing time or increase the complexity of the processing to get better results. In the same article, when speaking about their experience, they mention that using the Google Cloud Platform was like having a supercomputer on demand. The cloud allowed the Descartes Labs team to focus on their idea and their mission to change the world.

Subscribe and stay connected to the cloud

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement. The TC50 photograph above (center) is by Jen Consalvo.

© 2016 Culture Capital Corp.