Predictive analytics in the cloud: It’s all about the data
Predictive analytics in the cloud: It’s all about the data
By Dave Hirko
During the 1992 presidential election, the Clinton team coined the phrase “it’s the economy, stupid,” as an easy way to remember one of the most important platforms of the campaign. For the cloud – and especially predictive analytics in the cloud – it’s not the economy, but the data, that makes all the difference.
These days, as the cloud is making storage of enterprise data easier and more affordable for companies of any size, every business is now a data business, whether they know it or not.
And that will be truer still as the Internet of Things begins to collect and contribute data to enterprise systems from nearly every household and business device or appliance. You have to assume that the volume of enterprise data will increase (possibly exponentially) every year. Most organizations already are overwhelmed with data and can’t process it fast enough.
Enter the cloud. There’s a natural synergy between the cloud and analytics. The cloud allows you to scale out horizontally easily and quickly, which in turn enables you to look across silos of data to identify developing trends.
Most companies that are struggling with a move to the cloud are concerned in particular with how to migrate data to this new computing environment – and that’s where they’re going wrong. New technology makes it much more practical to scratch-build their data repositories in the cloud rather than migrate data to the cloud. After that, complex data analysis can be underway in minutes rather than months (if at all).
Let’s look back at the cloud, and ways to make the best use of the technology when putting predictive analytics to work on an enterprise scale.
“Power Company” of the Internet Age
In the past, on-premise data collection and management was limited because IT resources were finite and expensive. That has changed with the cloud. Think of it as analogous to the power company at the turn of the 20th century.
In the late 1880s when electricity was just coming on the scene in industry, every business built its own generating capacity at great expense to the company. When local electric utilities came along to manage capacity and distribution, businesses no longer needed their own power generators, and the use of electrically powered equipment became more commonplace.
Datacenters in the 1990s and 2000s were a lot like that: Every company had one, and they were very expensive. But the cloud has changed the landscape. And just as power utilities mean we no longer have to worry about running out of power, so the cloud means we no longer have to worry about running out of space and computational power to analyze our data.
With the power company, you plug into an outlet. With the cloud, you plug into a software application program interface (API).
For most enterprise purposes, the cloud has practically infinite capacity for storing data. It is finite, of course, but for all intents and purposes, like the power company, it doesn’t ever run out.
It does, however, have its own set of challenges.
The Myth of Cloud’s Security Problems
The most common challenge for organizations when moving to the cloud is embracing a new security paradigm for storing their data.
Legacy technology companies whose business model is threatened by the cloud have for years been perpetuating the notion that the cloud is less secure than enterprise data systems. That’s mostly a marketing pose, but it has taken its toll on cloud adoption.
In reality, cloud security is implemented differently, but when it’s done correctly, the cloud can actually be more secure than on-premise solutions for data storage.
The problem comes in a skills gap in the market for technical professionals. Comparatively few IT professionals have both security and cloud skills, which is, to an extent, hampering the adoption of the cloud for data-intensive applications like predictive analysis.
It will take time for customers to fully trust storing their sensitive data in the cloud, but it will happen. It’s not a question of “if,” but of “when.”
Data Migration, No; Scratch-building Data, Yes
Contrary to popular belief, many organizations are not “migrating” their analytical applications to the cloud. Rather, they are building new cloud-based systems from scratch and writing off the legacy systems.
As early as the end of 2013, Piper Jaffray enterprise software analyst Mark Murphy surveyed a number of IT professionals and concluded that by 2018 one-third of all computing workloads will be running on the cloud as opposed to on-premise. In a separate study, the analyst firm IDC noted that the cloud represents one-third of all new IT infrastructure spending, and that cloud spending is expected to increase steadily through 2019.
Our own experience in helping businesses move to the cloud is that enterprises are building many times more applications in the cloud than had existed previously in legacy on-premise datacenters.
Migrating data to the cloud is hard; it takes money and time. Expensive circuits are needed to move the data. Consequently, it may take months – or even years – to move terabyte- and petabyte-scale data to the cloud.
The better strategy is to start collecting data right to the cloud, and to analyze that data locally in the cloud. Migrate data only as a last resort – and be prepared to spend a considerable amount of time, money and effort to do so.
Open Source Analytics: The Democratizing Software Trend
After creating the appropriate scheme for data collection in the cloud comes the challenge of analysis in that environment. Fortunately, open source analytics software is allowing an exponentially larger number of analytic solutions to be used in an enterprise environment.
Traditionally, there were only a handful of proprietary analytical software solutions, from companies like Oracle, SAS, SAP and so on. These solutions were very expensive, so only a few “elite” organizations (big banks, government agencies and so on) could afford them.
Open source software has “democratized” the ability for small and large companies to build their own analytical applications. In our experience, no one is building analytics applications meant exclusively for on-premise use any longer. There are more net-new analytical applications today than there were five or 10 years ago, but all of them are being born in the cloud.
In the early days of enterprise computing, massive computing power scaled up vertically, by adding more CPUs, servers, etc. The cloud takes a different approach, scaling from one node to hundreds with a single API call. Analytics software can be distributed over thousands of servers, scaling horizontally rather than vertically. With this distributed processing technology, you can send a job to each node in a system, which lets you to do very sophisticated processing in a parallel fashion.
As for analytics applications themselves, modern open source analytics software has its roots with Google’s search technology, specifically the Google File System. By the mid-2000s, Yahoo announced an open source version of Google’s technology called Hadoop. Hadoop relies on many (on the scale of hundreds or thousands) of cheap commodity hardware servers, which made it a perfect fit to run in the cloud. Over the past 10 years, the Hadoop ecosystem has grown significantly, and has now been adopted by almost every enterprise or Fortune-level business doing work in the cloud.
A chief advantage of the cloud is its scalable, “pay as you go” nature. It allows you to pay for only what you use, and then turn it off. By extension, analytics applications in the cloud are similarly scalable and even “ephemeral.” They exist only when they’re needed – and like the power company, you can turn the service on and off when you need it (thus eliminating the fees of massive and cumbersome enterprise applications).
“Software is Eating the World” – Sort of
That well-known quote comes from Marc Andreessen, an original developer of the modern Web browser and present-day venture capitalist. Internet-style applications are transforming other industries as Uber has for ride-hailing or Seamless for food service.
In trying to get to that same level of software and industry integration, most organizations have not fully embraced how to best leverage cloud software APIs to maximize performance potential and cost savings. They’re still treating the cloud like their datacenter. Or, to use the power company analogy, they’re keeping the lights on all the time, incurring 24/7 expenses.
And while open source analytics software packages are very capable, they’re not yet mature enough to deploy easily. These packages still require very specialized (and expensive) staff that are hard to find and hard to maintain.
Consequently, many organizations are building their own cloud-based analytics software manually, one step at a time. This manual approach introduces risk for project failures and delays. A deployment may take weeks, months or sometimes years to complete, making analytics software one of the most fragile systems in the organization. Once they work, no one wants to touch them.
This, of course, stifles innovation; after all, what’s the point of analysis if you can’t be flexible?
In fact, most organizations are spending too much time on the systems, and not enough time on the data and analytics. One IT executive from a well-known hedge fund told us, “I hired mathematicians and data scientists from Princeton and Harvard to analyze my portfolio performance. I didn’t hire them to fix the Hadoop cluster.”
It’s the same complaint, regardless of the industry. Customers say, “It’s taking too long to analyze data; I can’t afford to have my IT department take that long to build a new analytics application. We need applications and data ready immediately.”
Responding to that pain from business leaders, some firms – including ours – are now using software automation to orchestrate cloud and data APIs.
Software Cloud Automation Cuts Deployment to Minutes
With software automation, analytical applications can be developed and launched not in months or years but in minutes, ready to ingest data almost immediately. The approach is taking place in a range of technology sectors, from financial services to healthcare, retail and technology.
Take the example of harvest.ai, a San Diego, Calif.-based provider of next-generation cyber security solutions for companies using platforms such as Google for Work, Office 365, Box or Dropbox. harvest.ai uses cloud-based analytics and natural language processing to help organizations identify and stop data breaches from targeted attacks, insider threats and stolen credentials in near real time while still allowing their employees to efficiently collaborate and share data. The company works at the crossroads of computer science, artificial intelligence and computational linguistics to monitor the interactions between computers and human (or “natural”) languages.
By deploying multiple open source analytical software applications, harvest.ai was able to automate a cloud-based predictive analytics and data system. This automated system was launched literally in minutes with a few clicks of the mouse.
Once automated, harvest.ai was able to offer multiple deployment modes of natural language processing based on its customer needs. This deployment reduced the time required to serve customers to a fraction of a typical client engagement. Previously, it took the company weeks or even months to offer new analytical products to its clients; by automating its systems, this turnaround time was cut down to a matter of days.
As we’ve seen, the push to the cloud for predictive analysis can be backbreaking labor for companies trying to apply the traditional data center model to their business. Migrating data to the cloud – and building specific applications for your business alone – is a strategy that’s fallen out of step with the latest advancements in cloud technology.
By automating your cloud strategy and using open source analytical applications, you can start using your enterprise data right away. And making use of that data quickly and easily is the point, isn’t it? Because when it comes to predictive analytics in the cloud, remember – it’s all about the data.