Text analytics: Deriving value from tweets, blog posts and call center logs.
By Andy Flint
Data is the lifeblood of analytics. Yet some of the richest data sources now available to us will not easily reveal the insights they carry, because they don’t readily lend themselves to our traditional algorithmic approaches. In fact, much of big data isn’t what we have historically thought of as “data” at all. As much as 80 percent of the world’s big data is unstructured, meaning it doesn’t fit neatly into the columns and rows that feed most analytic models. For instance, new sources of text – blogs, comment streams, device logs, chat sessions with customer service reps, Twitter feeds and other social media posts – are all proliferating, but aren’t easily analyzed using simple regression models or decision trees. However, the body of techniques known collectively as text analytics can help draw insights from these sources by translating messy, complex textual information into signals that can enhance insights to customer behavior and even help refine predictive models.
Although the field is hardly new, text analytics has recently achieved a level of developmental maturity that puts it on the cusp of widespread use. In its “Hype Cycle for Big Data” report (July 2013), Gartner positions text analytics as delivering great business benefit and projects mainstream adoption within two to five years. Many opportunities exist to draw actionable insights from textual sources alone, and still more to combine text analytics with other techniques for structured and unstructured data.
Breaking Down Barriers to Adoption
Recent technological advances are removing obstacles that have kept practitioners from deploying text analytics in the past:
- Transforming for machine insights. A variety of machine learning methods can determine what text data is about, and classify or quantify it for further analysis. Text data can discover customer characteristics and transform it into structured numerical inputs that are comprehensible to predictive models and other traditional analytic algorithms.
- Handling complexity. Unstructured and semi-structured text sources (e.g., XML files, Excel spreadsheets, weblogs) are inherently complex and may contain a vast range of content on a wide range of topics. Moreover, the potential value of text analysis is often increased by combining text of different types from multiple sources. Such a comprehensive approach may reveal complex, subtle customer behavior patterns not evident in smaller, more homogeneous document sets. But the task of collating, regularizing and organizing diverse data was daunting (if not be impossible) before today’s advanced technologies. We now have the analytic techniques and data infrastructures to merge disparate varieties of text into a comprehensive analysis.
- Facilitating engineering, deployment, management and regulatory compliance. While text and the process of analyzing it can be quite complex, the results need to be simple to understand and use. Today we can bring new insights from text analysis into predictive scorecards, for example, maintaining all the advantages of transparency, simplicity and accuracy that scorecards provide. The resulting model can achieve greater predictive power, but it can still be engineered to meet specific business needs and regulatory requirements. It’s easily deployed into rules-driven decision processes and operational workflows. It’s easy to manage, including tracking performance, automating updates and measuring the impact of changes. The contribution of text analysis is transparent, explainable to regulators and documentable through automated audit trails and regular validation and compliance reporting.
Figure 1: Example of sorting through text analytics possibilities to select the right approach for your purpose.
Innovative Techniques Emerging
The right analytic technique depends on the data analysis project at hand. Some projects have a clearly defined objective (can I predict a future outcome with any reliability?); others are just trying to derive insights from a mass of data (what can I possibly learn by analyzing these historical data?) Where the outcome is known – for example, credit card fraud, customer profitability or responsiveness to offers – we use that outcome to direct the search for terms (words or phrases) that have real signal strength. That is, we find the terms that most strongly and reliably correlate to one of those known outcomes.
• A term document matrix lists all the unique terms in the text we are examining, across all the cases (or documents) in the analysis. This simple but often very large intermediate result provides the foundation for further analysis. How do the frequencies of specific terms differ between customers who later purchased product X and those who did not? This takes us to a reduction step, where we sort the words or phrases to be formally modeled from weakest to strongest, based on their signal strength. The presence and frequency of these extracted words and phrases can be indicated, numerically, in a new column within our modeling dataset, and directly incorporated into the search for an optimal new predictive model.
This “semantic scorecard” approach uses traditional scorecard methodology (classifying data into characteristics and attributes, assigning weights to attributes, totaling the attributes to produce a score), augmented with unstructured information. The challenging part is crunching the data to identify which words and phrases have the greatest signal strength, and which sets of terms should be combined to form (and thus later detect) larger concepts within the raw text. Human languages are dense with copious synonyms and colorful idioms, and the right analytic tools help you detect and make sense of the raw information.
• Named entity extraction (NEE) is based on natural language processing, which draws on the disciplines of computer science, artificial intelligence and linguistics. By analyzing the structure of the text, NEE determines which parts of it are likely to represent entities such as people, locations, organizations, job titles, products, monetary amounts, percentages, dates and times. One reason NEE is compatible with scorecards is that both techniques readily allow for engineering. For every entity identified, the NEE algorithm generates a score indicating the probability that the identification is correct. And hence our data scientists can engineer the probability thresholds – accepting only those entities with a score above 80 percent, for example – in the creation of the structured feature and the inclusion of that feature in a predictive model.
Using extracted entities with similarity-based matching algorithms, we can join records from different sources that have no direct links (e.g., a structured file containing customer information and unstructured texts about interactions with the customer, either or both lacking a distinct and reliable customer ID). In addition, by combining extracted entities, we are able to infer the nature and strength (or absence) of relationships among individuals. In a business context, this can allow us to estimate, for example, an individual’s authority to make a purchase decision; and while this determination is critical to intelligently targeting buyers, it is also almost never directly observable in our source data.
• Other analytic techniques effective for segmentation, as well as for detecting changes in customer behaviors, are Latent Dirichlet Allocation (LDA) and related methods for finding similarities in data that enable classification and grouping. LDA is an unsupervised statistical method of extracting topics, concepts and other types of meaning from unstructured data. It doesn’t understand syntax or any other aspect of human language. It’s looking for patterns, and it does that equally well no matter what language the text is written in, or even if it consists of just symbols rather than characters.
For example, LDA could be used to examine a blog with 100,000 posts to determine the dominant theme of that blog. The algorithm could further identify the top four or five topics or “archetypes” of content, and distinguish posts about the workplace from those about new technologies, or even about the adorable things said by 4-year-olds. It could also be applied to a heterogeneous mix of text. In fact, any type of unstructured, semi-structured and structured data from any number of sources can be analyzed with LDA to detect patterns.
This very flexible technique is commonly used in marketing to generate archetypes for customers with similar deposit, withdrawal and purchase behaviors, and can be applied to classify different types of calls into the call center. In the latter case, we can identify meaningful reasons why customers are calling, and use these insights to better predict attrition risk to more accurately forecast call volumes or even refine the features and structures of products.
Here is an example of a useful discovery that LDA can help unearth: We find that customers mapping strongly to an archetype of “frequent traveler” happen to also be among the least likely to attrite. Moreover, if these customers interact with a call center representative about a payment missed due to traveling – resulting in waiving of the late fee – attrition becomes even less likely. This insight could help focus the budget and strategy for late fee waivers, knowing this can ultimately reduce attrition rates and increase loyalty.
Another advantage of LDA-based analyses is that they can be applied and updated readily on the latest stream of customer-generated events. That in turn means that we can quickly appreciate if a customer’s latest behaviors are consistent with or departing from his or her historical behavior. In other words, we can trigger an alert that something unusual is happening with the customer. For instance, the analysis of a collector’s notes over the course of several interactions concerning a delinquent bill could detect that the customer is becoming frustrated or angry, or is losing confidence he’ll ever be able to repay the debt. Perhaps a new factor has entered the equation, such as a family member falling ill, which may require an adjustment in collections strategy. This type of analysis could even signal a change in intent – identifying the moment when a customer who originally intended to pay consciously or unconsciously gives himself permission not to.
Natural Language Processing and Sentiment Analysis
Digging deeper for insights into not only how customers are likely to behave but also what they’re thinking and feeling is an area of text analytics generally referred to as sentiment analysis. The analytic techniques used here are often based on natural language processing (NLP), but they may also be statistical or a hybrid of these. Some of the most promising and challenging areas of development in text analytics seek to use NLP to understand what customers really mean when they use a set of words.
For example, is the phrase “That’s great” always positive? If the text reads, “You’ve been VERY helpful,” is that a genuine and glowing review or a cynical retort? Often, we humans express ourselves in ambiguous, obscure or outright sarcastic ways. As more and more customer interactions take place through e-mail, chat and text messaging – rather than phone calls or face-to-face discussions – we lose the crucial clues to meaning that come from voice tonality and emphasis. The cutting edge of sentiment analysis seeks to pick up these subtleties through other automated means.
As the amount of customer-related text data continues to expand, companies must incorporate a range of text analytic techniques into their big data strategies. There is tremendous untapped value and competitive advantage to be gained, and it is increasingly within reach.
Andy Flint is a senior director for analytic product management at FICO, a leading global analytics software company. He contributes regularly to the FICO Labs Blog.
- 48Cathy O’Neil’s provocative book, “Weapons of Math Destruction: How big data increases inequality and threatens democracy,” created quite a stir in the analytics community when it was released last fall.
- 46Eric Siegel, founder of Predictive Analytics World (PAW, a series of conferences held throughout the year in major U.S. and European cites) and the author of the new book, “Predictive Analytics: The Power to Predict who will Click, Buy, Lie or Die,” is without question a key player in the…
- 43Frontline Systems is shipping a new product line release for desktop and cloud, Version 2017 of its Solvers for Excel and AnalyticSolver.com, its SaaS equivalent. The new release unifies and simplifies Frontline’s products, makes learning predictive and prescriptive analytics accessible to everyone at very low cost, is easier to upgrade…
- 43Over the past 30+ years, businesses have spent billions on talent assessments. Many of these are now being used to understand job candidates. Increasingly, businesses are asking how (or if) a predictive talent acquisition strategy can include the use of pre-hire assessments? As costs of failed new hires continue to…
- 42This free webinar will provide participants with the introductory concepts of text analytics and text mining that are used to recognize how stored, unstructured data represents an extremely valuable source of business information.