The big ‘V’ of big data
By (l-r) Pramod Singh, Ritin Mathur, Arindam Mondal and Shinjini Bhattacharya
“There is a big data revolution. However, what is revolutionary is not the quantity of data alone. The big data revolution is that now we can do something with the data.”
– Professor Gary King, Harvard University
While storage and computational capacity is essential and a given, it is important to note that improved statistical and computational methods are creating opportunities like never before when dealing with big data. Today, large amounts of data are available that individuals, businesses and governments can manage and leverage for information that can lead to insights. However, it is essential to analyze information in a cost-effective manner. High volume, variety and velocity of data cannot translate into value unless management makes a concerted effort.
Unlocking this value that businesses can get from big data involves three key elements:
- Information infrastructure: (ingest and store efficiently): This element is about creating infrastructure that can capture, store, replicate and scale information at speed.
- Information management: An information ecosystem to manage, secure, govern and leverage information seamlessly across an organization’s information assets.
- Insights: Correlate and use this data in conjunction with existing business data (usually structured data) and analyze using descriptive and prescriptive analytics to aid decision-making.
The last few years have seen a lot of focus and attention on infrastructure and information management. Exciting new technologies, frameworks and methodologies have evolved to address the needs of these elements. For example, infrastructure technologies have greatly improved, as have the innovations that benefit data centers. They range from fast and efficient servers to data center solutions that can capture, store, replicate and scale information at high speeds.
Figure 1: Three key elements to unlocking the value of data.
Information management has seen the most rapid evolution and change. Managing information through distributed technologies (file systems such as Hadoop) has changed information storage to provide low latency, high speed, highly available systems. Innovations in areas such as traditional enterprise data warehousing environments through in-database, in-memory capabilities and compression techniques have resulted in organizations being able to get insights from data quicker.
As organizations come to terms with managing big data and harnessing them through systems and converge their usage with traditional data sources, the business analytics to guide decision-making is itself evolving. This article will focus on how analytics has been influenced by big data and what practices will emerge in years to come through observations within Hewlett Packard.
United we are “big,” divided we may be small
For business analytics to be successful in meeting an organization’s needs for decision support, it fundamentally needs to be able to consider all data sets relevant to solving a particular business question.
Traditional business intelligence (BI) and enterprise data warehouse (EDW) environments focus on the usual data generated from business operations. This is data generated through point of sale transactions, customer data, financial, business planning data, inventory management systems, etc.
Businesses today, however, also have access to two other key forms of data. The first of these can be loosely categorized as “human information”; this form of data comes from having increased knowledge of customers through e-mail, social media and other marketing channels, but also from an organization’s institutional data in the form of documents and customer support call records, as well as video, audio and image sources. This data tends to be unstructured in format.
Figure 2: Bringing data together.
The second new type of data comes from machine data. This is data generated from an increasingly interconnected world of devices and systems. Examples range from data generated by sensors, smart meters, RFID tags, security and intelligence systems, IT logs (application and Web servers), etc. This data tends to be largely semi-structured or unstructured.
Business systems in BI and EDW environments are not architected to handle the volume and variety of “human information” nor the volume and velocity of machine generated data.
Today, organizations need to bring all their data together for advanced analytics. For example, at HP, structured data from a customer’s purchase history, demographics and warranty data can be combined with unstructured data coming from customer support records and social media for a more focused customer engagement strategy.
Big Data and Analytics: Process, Purpose, Practice
As information and data assets of an organization come together and combine with external data, analytical techniques and analysis will have a larger role to play. In general, the characteristics of big data that most influence the analytics process are related to the variety and volume of data. However, velocity, which is handled through business intelligence practices, is considered distinct from core analytics practices for the purposes of this article. The analytics process is usually represented as a set of activities – preparing data, developing analytical models based on analytical techniques to solve for the business question, validating the model and deploying it. These are essentials of analytics and represent the elements of big data that organizations today need to pay attention to. Following is a closer, more technical look at each of these activities.
Sampling. Sampling has been the backbone of analytical processes with the premise of using information of a sample to infer on a larger population. Historically, sampling has been a core part of analytical processes due to limitations on collecting data on populations and then analyzing it in aggregate. Sample accuracy, of course, depends on several factors and is predicated on the minimization of various biases in the sampling methodology.
While there are arguments both for and against sampling in big data environments, from a data scientist’s perspective, a few aspects of the data used in an analytics process need to be well understood.
First, one can’t always use large storage and computing power to analyze a population unless the marginal business returns are higher due to the addition of more data sets. Second, some specialized application areas do need population rather than sample. For example, in case of analysis of cyber security threats out of a large data set, our interest lies in finding the outliers and anomalies in the data. Where millions of rows of data may not give any value, a particular row (one individual) may be very useful and could save huge losses for the business. Another example is when one has to identify the five top social media influencers in cloud computing; here we might consider population as an important element.
And last, irrespective of whether we choose to use a sample or a population, it is still vitally important for a data scientist to understand and question the data source and collection methods so that a selection bias in data may be averted.
As such, the need and relevance of sampling in big data applications is contextual and depends on the question being solved and the source of the data.
Regression. One of the most common analytical techniques used by analysts today are related to regression. Regression is commonly understood as a statistical process for estimating relationship between variables. The techniques help predict the value of a dependent variable, given values of independent variables and are widely used for prediction and forecasting.
Most common methods of regression such as “ordinary least squares” and “maximum likelihood estimation” require that the number of variables be less than the number of observations. In a big data environment, where increasing newer data sets are being incorporated, the number of independent variables available often greatly exceeds the number of observations. A case in point is the study of genes, where the different types of genes are the independent variables and the number of patients in a study is the observations. Another good example is texture classification of images where the variables are the pixels and observations are the number of images available for observation.
In addition to this, the analyst also has to address some very important issues. For example, do the new variables really help improve the accuracy of the prediction? In general, not all variables contribute to an improved accuracy of the model. Typically, only a few of the large number of potentially influential factors account for most of the variation.
To handle this complexity of variable selection brought about by increasing number of data sets available for analysis through big data techniques, a few methods have gained attention and adoption, such as subset selection for regression, penalized regression, Biglm, Revolution R and Distributed-R Vertica, and the split-and-conquer approach. For a more technical discussion of each of these methods, click here.
Clustering. Segmentation, using clustering techniques, is a common method used to reveal natural structure of data. Cluster analysis involves dividing the data into useful as well as meaningful groups where objects in one group (called a cluster) are more similar to each other than to those in other groups.
In general, a clustering technique should have the following characteristics to be suitable for use in a big data environment: It should be able to capture clusters of various shapes and sizes, effective treatment for outliers and be able to efficiently execute the algorithms for large data sets.
Most partitional and hierarchical methods that rely on centroid-based approaches to clustering do not work very well in large data sets where the underlying data supports clusters of different sizes and geometry.
Techniques such as DBSCAN (Density Based Spatial Clustering of Application with Noise)  can help find clusters with arbitrary shapes. It works by determining the density associated with a point by counting number of points in a region of a specified radius around a point. Points with a density above a threshold are classified as core points, while noise points are defined as non-core points that don’t have core points within the specified radius. Noise points are discarded and clusters are formed around core points. This very idea of density-based identification of a cluster helps in creating clusters of various shapes.
CURE (Clustering with Representatives)  also does well at capturing clusters of various shapes and sizes, since only the representative points of a cluster are used to compute its distance from other clusters. The clustering algorithm starts with each input point as a separate cluster, and at each successive step merges the closest pair of clusters. The representative points help in capturing the different physical shape and size of the clusters.
A clustering technique called BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)  is effective in managing outliers. It works by first performing a “pre-clustering phase” in which dense regions of points are represented by compact summaries, and then a centroid-based hierarchical clustering algorithm is used to cluster the set of summaries (which is much smaller than the original data sets).
Big data has influenced the entire spectrum of analytics – from data ingestion, storage, preparation, modeling and deployment. Organizations are exhibiting rigor in trying to harness big data through innovations in information infrastructure and information management. New core analytics techniques and practices are also changing to accommodate for the challenges of volume and variety of data associated with big data. A new era of volume, velocity and variety is leading the way for value creation in organizations like never before.
Dr. Pramod Singh is director of Digital and Big Data Analytics at Hewlett-Packard and a member of INFORMS. Ritin Mathur is senior manager of Big Data Analytics at HP. Arindam Mondal and Shinjini Bhattacharya are data scientists at HP. All four are based in Bangalore, India.
Notes and References
- Levent Ertoz, Michael Steinbach and Vipin Kumar, “DBSCAN: Finding Clusters of Different Sizes, Shapes and Densities in Noisy, High Dimensional Data,” paper, Department of Computer Science, University of Minnesota.
- Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, “CURE: An Efficient Algorithm for Large Databases,” Proceedings of the ACM SIGMOD Conference, 1998.
- Tian Zhang, Raghu Ramakrishnan and Miron Livny, BIRCH presentation, 2009.
- 64Many organizations have noticed that the data they own and how they use it can make them different than others to innovate, to compete better and to stay in business. That’s why organizations try to collect and process as much data as possible, transform it into meaningful information with data-driven…
- 55FEATURES Putin vs. Western analysts Russia’s new approach to extending its influence necessitates new approaches to assessment. By Douglas Samuelson Making analytics work through practical project management Making analytics work: Why consistently delivering value requires effective project management. By Erick Wikum Crowdsourcing – Using the crowd: curated vs. unknown Using…
- 53International Data Corporation (IDC) recently released a worldwide Big Data technology and services forecast showing the market is expected to grow from $3.2 billion in 2010 to $16.9 billion in 2015. This represents a compound annual growth rate (CAGR) of 40 percent or about seven times that of the overall…
- 49September/October 2012 Opportunity to deliver real-time customer insights by harnessing the power of structured and unstructured data. By Rohit Tandon, Arnab Chakraborty and Ganga Ganapathi (left to right) The rise of the Internet has created terabytes of data or “Big Data” that is available to consumers and enterprises alike. This…
- 49FEATURES Welcome to ‘worksocial’ world By Samir Gulati New approach, technology blends data, process and collaboration for better, faster decision-making. How to pick a business partner By David Zakkam and Deepinder Singh Dhingra Ten things to consider when evaluating analytics and decision sciences partners. Big data, analytics and elections By…