Big Data Analytics: Cloud-supported machine learning services
Exploring and comparing the potential of big data analytics on selected cloud providers’ platforms.
By (l-r) Lakshmi D. Baskar, Neil Lobo, Praveen Ananth and Palaniappa Krishnan
Big data analytics and cloud computing are two of the most prominent technologies of high interest to business organizations and researchers across the globe today. Increasing market penetration of highly affordable sensors, smart gadgets and connected devices have resulted in a continuous, massive quantity of heterogeneous data, i.e., big data . Harnessing its potential has become the key to competitive advantage for data-driven businesses. The complex, fast-streaming “big data” flows from many areas, including social media, the Internet of Things (IoT) and finance as companies seek to gain richer insight in a timely and cost-efficient manner. This phenomenon, characterized by the “3Vs” (volume, velocity and variety), produces information far beyond the processing capability of conventional tools. Building powerful computational infrastructure to warehouse and make sense of all the data requires large investments and poses practical limitations. Cloud computing offers an attractive alternative.
Cloud computing and big data are gaining significance and value in small and large enterprises.
Cloud Computing: A New Paradigm
Cloud computing and big data are clearly gaining significance and value in both small and large enterprises , . Cloud service providers such as Google, Microsoft and Amazon are renting some of their well-managed, massive, worldwide data centers to developers and companies that require large computing power and storage resources to run their applications. Clouds enable users to have easy access to large distributed resources in an on-demand fashion, similar to utilities, thereby decreasing the overall cost of system administration and efficient management of resources. Cloud delivery models can be categorized as:
Infrastructure as a Service (IaaS), the most basic layer allows users to set up virtualized hardware and software resources in the data center.
Platform as a Service (PaaS) offers developers with the tools and environment to build and deploy their applications in the cloud.
Software as a Service (SaaS) has applications operating in the cloud as a service to the end users.
While there is a certain level of concern on privacy and security issues, cloud as a delivery mechanism has grown in interest among researchers and enterprises.
Machine Learning in the Cloud
Traditional analytical approaches are insufficient to analyze big data as the emphasis is on comprehensive analysis of highly scalable, unstructured data captured in real time. Machine learning (ML), one of the solutions to address this challenge, enables a system to automatically learn patterns from data that can be leveraged in future predictions. Extracting information can be achieved via supervised, unsupervised or reinforced learning. Learning big data with conventional ML algorithms becomes computationally expensive and sometimes intractable for which cloud computing offers a practical alternative , , . Given that data and software already reside in a cloud, bringing over the computation logic is a natural progression, thereby reducing overall input/output overhead and monetary cost.
Cloud-Hosted ML Service
By exploring the potential of big data analytics on major cloud providers’ platforms including Azure, AWS, Google and IBM, we can lay out the workflow automation of data science for big data analytics in these clouds as shown in Figure 1.
Figure 1: Data science workflow for big data analytics.
1. Set business objective or research question.
“A business problem well-stated is a problem half solved.” Formulating an appropriate business question in accordance to a company’s goals and market trend is of utmost importance.
Case study: Predicting click-through-rate (CTR) is a very essential learning problem in the online advertising industry. It is an evaluation metric often used for sponsored search advertising and real-time bidding auctions. For our case study, we used a CTR prediction data set obtained from ShareThis social network traffic. ShareThis network traffic captures terabytes of consumer social engagement data across three million publishers and processes the data to make it actionable for businesses.
2. Collect & store.
In this process, data scientists decide on the number of features to be gathered for analysis. Based on business requirements, the collected data will be stored in highly scalable data storage that provides a secure way to retrieve information. For this discussion, we cover computing and storage services offered by Amazon, Microsoft and Google. Interested readers could also look at services from cloud providers including IBM SmartCloud, Rackspace and many more. Most cloud providers provide a combination of IaaS and PaaS services.
Amazon Web Services (AWS) is one of the largest providers of highly scalable, cost-effective infrastructure platform. In AWS, Amazon’s Elastic Compute Cloud (EC2) provides users an effortless way to configure the computing capacity using either pre-configured or customizable Amazon Machine Images. Amazon’s Simple Storage Service (S3) offers a simple, secure, inexpensive and scalable data storage.
Windows Azure, offered by Microsoft, allows end-users to host, store, scale and run Web applications on a network of Microsoft datacenters. For computing services, Azure supports creating virtual machines similar to Amazon’s EC2. Azure’s storage service provides various forms of persistent storage such as tables, blobs and queues that can be accessed using interfaces.
Google Cloud Platform integrates many suites to store and analyze data on Google’s infrastructure. Google Compute Engine allows users to set up virtual machines hosted by Google. Google’s BigQuery, a highly scalable database, can accommodate and retrieve multi-terabytes of data. Data can be directly imported to BigQuery through Google Cloud Storage, an advanced storage service on Google’s infrastructure.
Case study: Our CTR data set is CSV formatted, approx. 8.9 GB in size, with more than 71 million rows and 25 features including click flag (click/non-click), click behaviors, timestamps, impression details, campaign, ad groups, URL, IP address, device details, user agents, mobile device, verticals, image details, etc. At ShareThis, the data is stored in BigQuery, a Google Cloud-hosted analytical database.
3. Data preprocessing and ML service.
Real-world big data is generally unstructured, incomplete, noisy and inconsistent. Data preprocessing is an essential step to improve data quality, resulting in better model building and scoring. Preprocessing data includes the following techniques that are not mutually exclusive: data cleaning, data transformation and data reduction. Preprocessing is followed by model building and scoring, which involves selecting appropriate algorithms that enable the timely processing of large-scale data for new business opportunities. As discussed earlier ML is time-consuming, and for big data analytics, cloud providers offer ML as a service to build models and deploy in production.
Amazon Machine Learning (AML)  is a platform for users of all skill levels to deploy machine learning for making predictions at scale. Similarly, Microsoft introduced a fully managed ML as a Web service on Azure for model building and deployment. The Azure ML  service is primarily built and evaluated in a cross-platform, browser-based development tool, “Azure ML Studio” – an intuitive, user-friendly, drag-and-drop, collaborative Web interface with zero-installation. We will walk through data processing and ML steps in both AML and Azure ML using a case study and Table. 1.
|Service/Attributes||Data limit (FreeTier)||Data Source||Data Wrangling||ML Algorithms||Interpretability||Versioning & Pricing||Other Language Support|
|Azure ML||10GB||Upload text file locally, HiveQL, SQL databases, Azure Blob Storage, Web URL
||ROC, AUC, Accuracy, Precision Recall, F1-Score, Confusion matrix, Root Mean Squared Error||Runs are cached
Model building $1per hour
Application integration with API $2 per computing hour
|R, Python, SQL|
|AML||100GB||AWS S3, AWS RDS, Redshift
Storage: Pay for large data sets
||Evaluate Model: AUC, Accuracy, Precision and Recall, F1-Score, Root Mean Squared Error||Runs are cached
Model building: $0.42 per hour
Predictions Batch: $0.10 per 1000 predictions
* Uses In-Built Tools
Table 1: Data processing and ML steps for AML and Azure ML.
Case study: We used Azure ML and AML console to evaluate their respective services with data stored on their respective storage services. Using our domain expertise in CTR, we performed both manual and automatic methods to eliminate redundant and irrelevant features. R script and Azure ML were used to produce data summaries and visualizations of feature distribution and perform initial subset selection. Missing data was handled using sample mean/mode substitutions. Training data was obtained using a 70 percent split on data. After the training data upload, we chose to modify schema, ignore unnecessary columns, etc. Appropriate learning models were selected to build the model on a training data set. The model was validated with the 30 percent of remaining data (test data). Results were interpreted using AUC metrics.
Experiments and results: For this experiment, we sampled our data set using 10 percent, 25 percent and 50 percent of the complete data set. For example, for the 25 percent balanced data set, every fourth row of the data was taken. We analyzed ML services on the following dimensions:
- Scalability: ability of the system to efficiently handle increasing amounts of data by making use of additional resources with less overhead complexity.
- Robustness: ability to process large data and build models in real time with low time complexity.
- Performance of the system: usually measured in terms of area under the curve (AUC), determines how accurate the model predictions are, with a perfect model scoring an AUC of 1 (100 percent).
Our observed results in the computational time and performance experiments were based on default parameter settings. Binary classification was selected as a model in both the services. ML services provided the performance (AUC) based on evaluation of the test data set on the trained model. As shown in Figure 2a and 2b, both Azure and AML offered scalability and robustness for increasing traffic. AML scored higher AUCs when compared to Azure ML; however, Azure ML was much faster in building and validating the model.
Figure 2a: Performance analysis.
Figure 2b: Computational time analysis.
Figure 3: Brief summary of ML service platform.
Other services: As shown in Figure 3, each platform offers distinguishing advantages over other services.
IBM Watson Analytics service  offers data insights, data visualizations and predictive ability of the features. Unlike AML and Azure ML, Watson Analytics aims to be a visualization and smart discovery solution available on the cloud, to guide data exploration and create dashboards and infographics. This service, in our opinion, is suited for data exploration, feature analysis and subset selection process.
Deployment: Azure ML, AML and Google Prediction API provide deployment of the machine-learning models via API endpoints or as Web services. IBM Watson does not provide creation of machine learning models for deployment.
Conclusions and Future Work
Today there is a growing interest in cloud delivery, and many organizations are evolving to support cloud services. As cloud-supported big data analytics is not a one-size-fits-all solution, through this survey, we have provided a quantitative review to help businesses and researchers make use of these platforms effectively. Topics for future work include optimizing parameter tuning, deploying in a production setup and analyzing API implementations.
Lakshmi Dhevi Baskar (email@example.com) is a data scientist and Neil Lobo (firstname.lastname@example.org) is a software engineer at ShareThis in Palo Alto, Calif. Praveen Ananth (email@example.com) is an incubator strategist with eBay Mobile Labs in San Francisco. Palaniappa Krishnan (firstname.lastname@example.org) is an associate professor of applied economics and statistics at the University of Delaware in Newark, Del., and a member of INFORMS.
Glossary of definitions
CTR: click-through-rate metric
Scalable storage: Handles increasing amounts of data.
Persistent data store: Retains data even when the machine is powered off or any failure occurs.
API: Application programming interface.
ROC: receiver operating characteristic
AUC: area under the curve
- R. L .Villars, C. W. Olofson, and M. Eastwood, 2011, “Big data: what it is and why you should care,” white paper, IDC.
- Q. Zhang, L. Cheng, and R. Boutaba, 2010, “Cloud computing: state-of-the-art and research challenges,” Journal of Internet Services and Applications, Vol. 1, No. 1, pp. 7-18, May 2010.
- D. Talia,2013, “Clouds for scalable big data analytics,” Computer, Vol. 46, No. 5, pp. 98-101, May 2013.
- I. T. Hashem, I. Yaqoob, N. B. Anuar, S. Mokhtar, A. Gani, and S. U. Khan, 2015, “The rise of big data on cloud computing: Review and open research issues,” Information Systems, Vol. 47, pp. 98-115, January 2015.
- M. D. Assunção, R. N. Calheiros, S. Bianchi, M.A.S. Netto, and R. Buyya,2015, “Big Data computing and clouds: Trends and future directions,” Journal of Parallel and Distributed Computing, Vol. 79–80, pp. 3-15, May 2015.
- H. Hu, Y. Wen, T. Chua, and X. Li, 2014, “Toward Scalable Systems for Big Data Analytics: A Technology Tutorial,” IEEE Access, Vol. 2, pp. 652-687, June 2014.