Share with your friends


Analytics Magazine

There’s no such thing as unstructured data

Four keys to giving structure to your unstructured data initiatives.

Mark GonzalesChuck DensingerBy Chuck Densinger (left) and Mark Gonzales

It’s long been popular to talk about customer interaction data such as clickstream, social activity, inbound email and call center verbatims as “unstructured data.” Wikipedia says of the term that it “…refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner” [1]. Thus, unstructured data is a term used to describe data that does not conform to a typical relational database structure.

There’s valuable customer knowledge in this so-called unstructured data. But if it’s unstructured, how should consumer-facing organizations handle this flow of large amounts of data? This type of data is often verbose, containing relatively low-density insight value, meaning you have to sift through a lot to find the nuggets of value, so it’s often not worth keeping all of it in its raw form forever.

We can start by realizing that unstructured data are, in fact, not unstructured at all. They are highly structured, whether in the form of XML, delimited files or plain text. Even natural language is highly structured, if complex and nuanced.

What is increasingly happening is that this data is being loaded into unstructured, or at best, lightly structured storage. What used to be piles of web log files and other blobs of data are now stored in forms that an ever-increasing set of tools can make easier to parse, structure and analyze.

Actionable insights and derived data can be obtained from this data in ways that do not require the development of new data management systems, nor acquisition of new or difficult technical skills across your organization. Not everyone need be – and in fact not everyone should be – experts in analyzing “unstructured data” to make your organization data-enabled.

Figure 1: Unstructured data processing and usage.

Figure 1: Unstructured data processing and usage.

The Elephant in the Room

The genesis of the term “unstructured data” is based in an implicit assumption that the only way to structure data is in a traditional relational database, with tables and indices, organized in rows and columns, defined and accessible by SQL. SQL-based relational databases are so ubiquitous, and so useful, that they are almost certain to be encountered in every organization. As the “lingua franca” of data storage, transaction processing and analytics and reporting, data that doesn’t easily conform to SQL-based structures is problematic. Hence, “unstructured.”

Enter Hadoop, the open-source software framework. Hadoop promises to make dealing with so-called unstructured data easy, by allowing all this messy data to simply be stored as-is, in a loose collection of files on commodity servers, which can be interrogated by the MapReduce programming model to answer business questions. Almost every organization we’ve worked with in the past couple of years has some form of Hadoop project in flight or up and running. Heck, so do we at Elicit.

The data stored in Hadoop is often referred to as a “data lake,” and this framework is used to perform two broad classes of tasks: (1) answer questions about the data for analytics or operational purposes, and (2) serve as the ingestion point for data that is combed through for useful bits that are fed via ETL processes into relational databases.

But Hadoop can raise its own challenges. The paradigm is so different from traditional database management – at both the software and hardware levels – that it can be challenging for organizations to acquire the talent needed to stand it up and manage it [2]. Though third-party solutions providers such as Amazon Web Services, Cloudera, EMC and others can simplify some of this, there’s no avoiding the learning curve of really understanding and leveraging Hadoop’s capabilities.

Too often, we see data lakes become data swamps, with poor understanding by business users of what’s available in the muck of the data store or how to make use of it.

But lakes do not have to become swamps. The set of things you need to know from your unstructured data is usually bounded. Most organizations do not have the infinite horizon of, say, search terms and results that Google needs to store to profile customers or to understand the interactions on the website or with mobile applications.

Most organizations have a team of analysts and data scientists whose job is to mine unstructured data looking for patterns and a much larger set of business users who will make use of those patterns in business operations and marketing. What we need to do is to make sure that those data scientists uncover the structure in the unstructured data – and make those structures available for the wider organization to use.

Four keys to Working with Unstructured Data

Unstructured data and unstructured storage do not replace structured data and SQL databases; they are tools in our expanded toolkit. The goal is to make use of both structured and unstructured data to enhance what you know about your customers and their behaviors – and to make that knowledge available across your organization. Here are four keys we’ve found that can give structure to your unstructured data initiatives.

1. Don’t treat your business users like data scientists and vice versa. Data mining and using the results are very different classes of activities. Your data scientists need access to both structured and unstructured data stores and the analytical tools (such as RStudio, Hive and others) to process them. Your business and operational user community will be larger and will tend to rely on analytical and operational SQL stores … and on well-established tools that use this type of store.

Ultimately, the goal of your data science team is to uncover patterns, unearth findings and develop predictive models that can be applied by your business teams in operations and marketing. Once uncovered, these patterns and findings are used to inform the structure and objects of your customer-oriented analytical data warehouse, which your operational and business users access to make use of these findings (see Figure 1).

Our rule of thumb: Use the insights developed by your data scientists to inform how to bring elements of unstructured data into your structured analytical data warehouse – with the goal of keeping as much visibility to all of the individual customer’s choices and behaviors as possible. Use that warehouse to build descriptive and predictive models, and as you become more advanced, to train business rules that will be used in real time to interact with customers. Feed those key model outputs into real-time systems for use in the moment when interacting with customers.

2. Transform the unstructured data into traditional structured elements. That is, load the valuable bits into tables and attributes. Which data should be transformed into elements in your data warehouse will be guided by your data science team, as well as by input from your business user community. You won’t need to and won’t be able to pre-process every bit of data coming into your enterprise, or store it all permanently. But for those data that pertain to specific high priority use cases and business challenges, this approach is manageable and effective.

Over time, data will become more extensible. If customer attributes and model outputs are structured appropriately, additional elements can be added as business needs change, your capabilities mature, more data becomes available and your data scientists develop new insights.

3. Take a use-case driven approach to implement specific operational uses of unstructured data. For instance, you may find that the simple act of leaving a product review signals a level of engagement that makes a customer’s predicted lifetime value much higher. Use that fact to implement specific customer treatments for people who write reviews; recognize and reward them for it, and treat those who have differently than those who haven’t.

Another example: You may have found that getting accurate data on inbound calls to the call center is problematic. What was the real cause for the call? Which product or service was the issue? Was the caller upset or calm? Was their problem resolved? Traditionally, we may have used post-call surveys to ask the customer these questions. But there are solutions available now that will analyze both the spoken words during the call and the verbatims captured by the agent to glean the answers to these questions. The output: attributes and metrics easily captured at the customer level and aggregated as needed.

4. Leverage purpose built tools and services to pre-process these types of data. Such tools extract their meaning into a form that can be stored in a database. For instance, a typical clickstream file is verbose and very large. Store that data for a month or two. But as it comes in, parse it for key bits of relevant data at a customer level – create an entry in your data warehouse for each session, tracking key facts such as source of traffic, session length, items/pages browsed and conversion data. Look for missions. For instance, a series of page and product views may all be distilled into “shopping for a new computer.” Building such a parsing engine is a relatively simple matter, as clickstream data is typically fed in XML-structured files.

Text analytics processors can also be employed to extract the meaning of social or product review comments, or call center verbatims. At Elicit, we use open source R packages for a lot of this. There are some great commercial products available for this purpose as well.

Now, log the fact that these events occurred, along with the results of pre-processing analytics. Over time, you’ll have a rich view of your customers’ activities, attitudes and interests, without having to dive into your data lake to get at it.

Key Takeaways

Start by implementing a crawl-walk-run program for better utilizing so-called unstructured data, informing marketing, customer experience and relationship management. Drive early and regular benefits realization opportunities.

The technical excitement generated by Hadoop and unstructured data stores has been a classic shiny object. We’ve even seen some organizations that are so swept up in the data moment that they have looked at moving away from SQL stores entirely. Thousands of other organizations have created unstructured stores, sometimes as experiments, and often without much planning or real understanding of what they’re biting off.

The best application of unstructured stores in today’s business environment is in conjunction with traditional SQL stores. Organizations that have successfully integrated these stores into the fabric of their businesses are doing so in a methodical and well-thought-out manner:

• Stand up unstructured stores with a focus on your data scientists.
• Begin with a few select data feeds, ensuring that the incoming data is clean.
• Your data scientists begin to characterize the data and begin looking for patterns. Use those patterns to determine what data should be extracted from the unstructured stores and placed into your data warehouse.
• Get feedback from your business user community about the usefulness of the extracted data in operations and marketing. Use this feedback to determine the priority of new feeds, and to help direct the work of the data scientists.
• Lather, rinse, repeat, growing your use of unstructured data on a use case by use case basis.

Customers do not know, and don’t care, how you keep track of their interactions with your organization. But they do care, and are expecting, that you remember them and their actions. Both structured and unstructured data are important to maximizing the value of customer relationships – and customers are increasingly expecting them to be used … and used effectively and appropriately.

Chuck Densinger is co-founder and chief operating officer and Mark Gonzales is senior customer technology consultant for Elicit, a customer science and strategy consultancy that helps clients uncover latent insights about their customers, and apply those insights to business, marketing, product, loyalty, brand and customer experience strategy. The company’s Fortune 500 clients include Southwest Airlines, HomeAway, Fossil, GameStop, Sephora, BevMo! and Pier 1 Imports.

2.    Much has been written on this topic; Hadoop does not work “automagically,” and requires investment. See, for instance:






Related Posts

  • 91
    Data science is more than just building machine learning models; it’s also about explaining the models and using them to drive data-driven decisions. In the journey from analysis to data-driven outcomes, data visualization plays a very important role of presenting data in a powerful and credible way. Structured data only…
    Tags: data, unstructured
  • 85
    This free webinar will provide participants with the introductory concepts of text analytics and text mining that are used to recognize how stored, unstructured data represents an extremely valuable source of business information.
    Tags: unstructured, data
  • 76
    FEATURES There’s no such thing as unstructured data By Chuck Densinger and Mark Gonzales How to get around the elephant in the room: Four keys to giving structure to unstructured data initiatives. Making effective decisions in real time By Ron Stein Situational intelligence brings together analytics, data visualization and IoT…
    Tags: data, unstructured
  • 68
    With the rise of big data – and the processes and tools related to utilizing and managing large data sets – organizations are recognizing the value of data as a critical business asset to identify trends, patterns and preferences to drive improved customer experiences and competitive advantage. The problem is,…
    Tags: data
  • 66
    The Internet of Things (IoT) is considered to be the next revolution that touches every part of our daily life, from restocking ice cream to warning of pollutants. Analytics professionals understand the importance of data, especially in a complicated field such as healthcare. This article offers a framework on integrating…
    Tags: data


Using machine learning and optimization to improve refugee integration

Andrew C. Trapp, a professor at the Foisie Business School at Worcester Polytechnic Institute (WPI), received a $320,000 National Science Foundation (NSF) grant to develop a computational tool to help humanitarian aid organizations significantly improve refugees’ chances of successfully resettling and integrating into a new country. Built upon ongoing work with an international team of computer scientists and economists, the tool integrates machine learning and optimization algorithms, along with complex computation of data, to match refugees to communities where they will find appropriate resources, including employment opportunities. Read more →

Gartner releases Healthcare Supply Chain Top 25 rankings

Gartner, Inc. has released its 10th annual Healthcare Supply Chain Top 25 ranking. The rankings recognize organizations across the healthcare value chain that demonstrate leadership in improving human life at sustainable costs. “Healthcare supply chains today face a multitude of challenges: increasing cost pressures and patient expectations, as well as the need to keep up with rapid technology advancement, to name just a few,” says Stephen Meyer, senior director at Gartner. Read more →

Meet CIMON, the first AI-powered astronaut assistant

CIMON, the world’s first artificial intelligence-enabled astronaut assistant, made its debut aboard the International Space Station. The ISS’s newest crew member, developed and built in Germany, was called into action on Nov. 15 with the command, “Wake up, CIMON!,” by German ESA astronaut Alexander Gerst, who has been living and working on the ISS since June 8. Read more →



INFORMS Computing Society Conference
Jan. 6-8, 2019; Knoxville, Tenn.

INFORMS Conference on Business Analytics & Operations Research
April 14-16, 2019; Austin, Texas

INFORMS International Conference
June 9-12, 2019; Cancun, Mexico

INFORMS Marketing Science Conference
June 20-22; Rome, Italy

INFORMS Applied Probability Conference
July 2-4, 2019; Brisbane, Australia

INFORMS Healthcare Conference
July 27-29, 2019; Boston, Mass.

2019 INFORMS Annual Meeting
Oct. 20-23, 2019; Seattle, Wash.

Winter Simulation Conference
Dec. 8-11, 2019: National Harbor, Md.


Advancing the Analytics-Driven Organization
Jan. 28–31, 2019, 1 p.m.– 5 p.m. (live online)


CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:

For more information, go to