Share with your friends


Analytics Magazine

Text Analytics: Two World Collide

May/June 2010


Intersection of two distinct disciplines gleans insight from massive amounts of unstructured data.

Kathy Langeby Kathy Lange

“Text analytics” could be the title of the latest mash-up for business analysts. It represents the intersections of two typically distinct disciplines. In our high school and college years, the core graduation requirements were “math and science” and “English and history.” In fact, the skills measured on the College Board exams are categorized as mathematics and reading/writing skills. Most people are stronger in one discipline than the other. To simplify, I’ll use the classic trope, “There are two kinds of people in this world.” There are those who favor words (text) and those who favor math (analytics). Text analytics brings the two disciplines together, creating a collision of two worlds to help businesses thrive and compete like never before.

The last few years have shown an increasing adoption of analytics-powered applications for better decision-making as described in “Competing on Analytics” by Thomas H. Davenport and Jeanne G. Harris and “The New Know: Innovation Powered by Analytics” by Thornton May. Organizations are utilizing data more strategically to uncover hidden relationships to predict what’s likely to happen in the future and take advantage of that foresight. Until now, the majority of data examined in these analyses has been numerical or categorical data that is often housed in relational databases.

What is Text Analytics?

An estimated 75 percent or more of an organization’s data is unstructured (images, content of Web documents, standard documents, audio, video, emails, call center or claims notes, etc.). The volume at which this type of data is growing is alarming. We are constantly barraged with digital content in the form of e-mail, news articles, contracts, privacy notices, marketing offers, tweets, blog posts and text messages. What business value could be gained if analytical-powered applications could be augmented with information gleaned from this freeform text?

In the Alta Plana research study “Text Analytics 2009,” Seth Grimes defines text analytics as the use of computer software to automate:

  • Annotation and information extraction from text — entities, concepts, topics, facts and attitudes.
  • Analysis of annotated and extracted information.
  • Document processing — retrieval, categorization and classification, and derivation of business insight from textual sources.

The most important piece of this definition is “the derivation of business insight.” In fact, that is the value of business analytics in general, regardless of the source of the data (textual, numerical, categorical, etc).

With text analytics, we are able to extract meaning out of large quantities of various types of textual information. We combine it with structured data within an automated and repeatable process to enhance business insight.

Text Analytics Addresses Business Challenges

Text analytics addresses two major business challenges. The first is information organization and “findability” of the content within documents. The other is discovery of trends and patterns to allow foresight from textual information.

Information Organization and Access —
Improving Search

When we think about finding information, the first thing that comes to mind is search. As any millennial will tell you, there’s nothing you need to remember — you just Google it!

We’re all suffering from information overload. We can’t consume the volume of information that we receive every day, so if we think it might be valuable, we try to file it in a safe place for use at some time in the future. When the time comes to retrieve bits of digital information previously stored by us or by others in our organization, we search. Gartner estimates that information workers spend up to 30 percent of their working day looking for information they need.

The content in our organizations’ documents represents knowledge from our business and may include customer or supplier information, operational information and learning over time. It may contain facts, opinions, discussions and intellectual property. In our search, we may be able to easily obtain information about document properties (information about the author or when the document was created or modified, for example). But it is much more difficult to determine what the document is about without opening it up and reading it.

Search engines do a few basic things: search parts of the Internet or the enterprise based on important words; keep an index of the words they find and where they find them; and allow users to query words and match them to those that have been put in the index.

Most large companies have webmasters who focus much of their time associating words with documents (tagging them or creating metadata) on their Web pages in a way that allows them to be indexed by Internet search engines and ranked so they appear on the first or second page of the search results. With documents inside the enterprise firewalls, the same rigor is typically not applied to document tagging to make enterprise search engines more efficient, so information tends to be more difficult to find. The priorities of most businesses are directed toward generating revenue and customer-facing projects rather than more internally focused activities. In these days of economic instability, there has been a renewed focus on worker productivity and knowledge retention and management.

Enterprise search has again become a priority for many organizations. The goal is to create processes to organize the abundance of documents, spreadsheets, presentations and e-mail that is being generated, so that the information contained within them is more easily accessible by those who need it. Taking the lead from lessons learned on the Internet, the key is associating words with documents that provide information about the content that resides within them. Better search within the enterprise can also be enabled with better strategies for tagging the content. Tools for automatically categorizing content that create and maintain rich metadata associated with the documents are becoming a central piece of a successful enterprise search deployment.

The Process of Analyzing Text

While collecting and searching for information is very important, it’s only the first step in gaining business value.

The process of performing analysis on text to discover insights turns out to be similar to analyzing traditional data types. First, you need to explore the documents. This might be in the form of simple word counts in a document collection or manually creating topic areas to categorize documents by reading a sample of them. For example, what are the major types of issues that have been identified in recent automobile warranty claims (brake failure, air bag deployment, engine failure, sudden acceleration)? As part of the exploration effort, you will likely find misspelled or abbreviated words, acronyms or slang terms.

Prior to more advanced analysis or automated categorization, the text will likely need to be preprocessed to address data quality issues and standardization so the analysis will produce more accurate results. As in traditional types of analysis, up to 80 percent of the time can be spent in the data preparation phase of a project. In addition to correcting misspelled words, much of the data preparation involves standardization or transformation to a consistent set of terms or identifying specific ideas or concepts. For instance, you might want to standardize “LOL” to “laugh out loud,” or expand abbreviations in patients’ medical files from “upr” to “updated patient record” or “lvm” to “left voice mail” for better consistency in analysis.

The contact center may be conducting a loss-prevention campaign to customers who are delinquent on payments. It may be interested in tracking how many attempts were made to contact each customer. The contact center would want to be able to associate “lvm” and “left voice mail” both as unsuccessful attempts to contact. It might also want to classify “no answer,” “bad number” and “phone no longer in service” with unsuccessful attempts. In this manner, synonym lists or business rules would be generated to match against text files to identify various results that would be categorized as “unsuccessful attempts to contact customer.”

Categorizing the content of documents effectively relies heavily on the ability to understand the meaning of the words in text, individually as well as in context. Each language has its own nuances, and within those languages jargon exists that may be unique to companies, to geographic areas, or among functional areas across companies. Who among us has not had to ask co-workers or business associates to explain an unknown acronym or saying? How should one interpret “blue screen of death”? Should it be associated with a computer “crash” or “hang”? The human resources department might talk about layoffs as “downsizing,” “right-sizing,” “employee turnover” or “planned attrition.” In order to analyze information about layoffs that resides in text, HR needs to associate all terms that might be used to refer to layoffs. Typically, subject matter experts are required to identify and interpret the unique terminology from a particular domain.

Language can be ambiguous. Words in and of themselves may have different meanings when used in the context of a sentence. Some words can be used as different parts of speech. In a warranty claim, you may be interested in defects in the “left” wheel of the automobile but not concerned that the driver “left” the window open. The order of the words in the sentence may also be significant. For an insurance claim, it is extremely important to know if “Driver 1 hit Driver 2? or “Driver 2 hit Driver 1.” Notice that the words are identical, but the order makes a huge difference for the insurance company in assessing fault for the accident.

Categorizing documents from information contained within them can be achieved through a combination of statistical models and business rules. As with traditional model development, sample documents are examined to train the models. Additional documents are then processed to validate the accuracy and precision of the model, and finally new documents are evaluated using the final model (scored). Models can then be put into production for automated processing of new documents as they arrive. The models can be monitored for continued performance.


Some businesses may have a predetermined categorization structure that they use to organize their documents. For instance, the technical support center might have a list of the top 10 product defects identified through customer interactions. Additional or emerging trends could be identified through the use of text mining to discover unknown issues from patterns discovered in the technicians’ notes. In this manner, emerging defects could be quickly identified before they become top 10 issues, or previously unnoticed defects that should have been in the top 10 could be uncovered.

Associated terms or concepts such as “unsuccessful attempts to contact customer” can be used as input to text mining models to refine the analysis and produce better, more interpretable results. New topics discovered by text mining models can augment rule-based categorization models. The processes can work in harmony to continually learn.

Often text data such as surveys, product reviews and call center notes are accompanied by numeric data (3 stars out of 5), demographic data (city, state, gender, income level), customer purchase amount or monthly service usage data. By integrating the text into the predictive model, the models can often produce better results. In the case of identifying customer churn (customers likely to go to a competitor or drop your product or service), strong indicator words or concepts could be identified as a significant predictor of behavior for a specific customer segment. The model might indicate that if a “platinum level” customer mentions a competitor’s lower price to the service-center representative, he is 25 percent more likely to cancel his service with your company.

Sentiment Analysis

A growing area of interest for many firms is the understanding of what the market (customers, analysts, or key opinion leaders) is saying about their products and services. Many refer to this area of analysis as sentiment analysis. They want to understand more about people’s opinions, attitudes and emotions when discussing their products, services or overall brand. Textual information is typically accumulated in the form of surveys, technical support notes or reviews on third-party Web sites (such as Studies show that consumers are much more trusting of other users’ opinions than of the marketing collateral produced by the manufacturer.

From the company perspective, listening and analyzing what people are saying about your products and services is the first step in creating a dialogue with that audience by listening, learning and then engaging with them. This dialogue can better inform targeted marketing initiatives to customers and prospects, enabling the organization to communicate at a significantly lower cost with increased speed and effectiveness than traditional marketing. It can also enable a more rapid response to perceived customer issues and competitive threats.

Sentiment analysis builds on both human-defined business rules and computer-generated statistical models to automatically identify positive and negative sentiments expressed in free-form text. Keywords are identified in the context of the text to classify positive and negative concepts. Text might explicitly contain words that identify the mood or sentiment, for instance, “mad” or “angry.” You may also be able to infer attitudes and emotions based on action words or phrases such as “screamed” or “hung up.” Some phrases might be more subtle, like “not impressed by” or “answers were unsatisfactory.” And some words can be ambiguous until associated with a product feature or service. In the context of a computer, “long” battery life would be a positive attribute, but “long” boot time would be a negative attribute.

Early projects with sentiment analysis may start by categorizing positive and negative dimensions of the feedback and counting how many responses were positive or negative with regard to a particular product, feature or service. More advanced projects may track the sentiment over time to track trends and patterns or compare the sentiment of a company’s product with the sentiment of the competitor’s product.

Public relations firms use sentiment analysis to track positive and negative mentions of their company, company officials or products in the media or other publications (pharmaceutical brands, for instance). They try to determine if their marketing messages are being interpreted positively or negatively and what effect that has on their brand in the marketplace.


Business leaders committed to fact-based decision-making are recognizing the power hidden in text to yield insight into marketing, customer service, public relations, product innovation and competition. Techniques for analyzing voice and video and other unstructured content will be more commercially available in the near future, rather than only the subject of research papers.

Today text analytics provides forward-thinking organizations with a framework to maximize the value of information within large quantities of text. This technology helps automate the process by extracting relevant information and interpreting, mining and structuring information to improve findability and reveal patterns, sentiments and relationships among documents. Two worlds have collided, and clever decision-makers will begin to see the biggest bang for their buck.

Kathy Lange ( is a senior director in SAS’ Business Analytics Practice. She has more than 25 years of experience selling and implementing analytics solutions. The SAS Business Analytics practice assists customers in defining their business problems and crafting strategies for solving those problems with integrated SAS solutions that include business intelligence, data integration and advanced analytics.

Copyright © 2010, SAS Institute Inc. All rights reserved. A version of this article appeared in BeyeNetwork. Reprinted with permission.


Using machine learning and optimization to improve refugee integration

Andrew C. Trapp, a professor at the Foisie Business School at Worcester Polytechnic Institute (WPI), received a $320,000 National Science Foundation (NSF) grant to develop a computational tool to help humanitarian aid organizations significantly improve refugees’ chances of successfully resettling and integrating into a new country. Built upon ongoing work with an international team of computer scientists and economists, the tool integrates machine learning and optimization algorithms, along with complex computation of data, to match refugees to communities where they will find appropriate resources, including employment opportunities. Read more →

Gartner releases Healthcare Supply Chain Top 25 rankings

Gartner, Inc. has released its 10th annual Healthcare Supply Chain Top 25 ranking. The rankings recognize organizations across the healthcare value chain that demonstrate leadership in improving human life at sustainable costs. “Healthcare supply chains today face a multitude of challenges: increasing cost pressures and patient expectations, as well as the need to keep up with rapid technology advancement, to name just a few,” says Stephen Meyer, senior director at Gartner. Read more →

Meet CIMON, the first AI-powered astronaut assistant

CIMON, the world’s first artificial intelligence-enabled astronaut assistant, made its debut aboard the International Space Station. The ISS’s newest crew member, developed and built in Germany, was called into action on Nov. 15 with the command, “Wake up, CIMON!,” by German ESA astronaut Alexander Gerst, who has been living and working on the ISS since June 8. Read more →



INFORMS Computing Society Conference
Jan. 6-8, 2019; Knoxville, Tenn.

INFORMS Conference on Business Analytics & Operations Research
April 14-16, 2019; Austin, Texas

INFORMS International Conference
June 9-12, 2019; Cancun, Mexico

INFORMS Marketing Science Conference
June 20-22; Rome, Italy

INFORMS Applied Probability Conference
July 2-4, 2019; Brisbane, Australia

INFORMS Healthcare Conference
July 27-29, 2019; Boston, Mass.

2019 INFORMS Annual Meeting
Oct. 20-23, 2019; Seattle, Wash.

Winter Simulation Conference
Dec. 8-11, 2019: National Harbor, Md.


Advancing the Analytics-Driven Organization
Jan. 28–31, 2019, 1 p.m.– 5 p.m. (live online)


CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:

For more information, go to