Customer Relationships: Real-Time Text Analytics
Cloud-based analytical engine yields instant insight using unstructured social media data.
By (l-r) Aveek Mukhopadhyay and Roger Barga
Information is generated in today’s world more rapidly than ever before, and it will keep growing at an exponential rate. The rise of social media combined with increased Internet penetration has led to a significant increase in user-generated content in the form of product reviews and feedback, blogs, independent news articles, Twitter and Facebook updates. The crux of leveraging such data lies in identifying patterns from it and using the data to generate actionable insights in real time.
This article proposes a cloud-based analytical engine that analyzes comments, reviews and opinions generated by customers to understand the main underlying themes and the general sentiment so that actionable insights can be generated in real time. Algorithms such as latent Dirichlet allocation for topic modeling and the holistic lexicon-based approach for sentiment mining have been operationalized using a multi-agent framework deployed in a cloud environment. This process meets computational demands as it allows users to run virtual machines within managed data centers, freeing them from worrying about acquisition of new hardware and networks.
Unstructured Social Media Data
According to a study by International Data Corporation (IDC), mankind created an estimated 150 exabytes (1 billion gigabytes) of data in 2005, a number that jumped to 1,200 exabytes in 2010. A more recent study by IDC and EMC put the amount of data created in 2011 at 1.8 zettabytes (1 followed by 27 zeroes), a number the study researchers expected to double every two years.
Only 5 percent of this data is structured (comes in a standard format that can be read by computers). The remaining 95 percent is unstructured (photos, phone calls and free-flow texts). A large chunk of such unstructured data is in text format. Posing challenges owing to the sheer volume, depth and complexity, such data, however, holds immense potential for organizations. The key lies in identifying patterns from the data and gaining relevant insights.
Not long ago, analyzing data and generating business intelligence reports depended on the time-intensive ETL process (extract, transform, load). Depending upon the system and data complexity, analytics could be delayed by hours, days or even weeks while data management put it all together.
In today’s business landscape, minimizing the lag between acquiring data and generating actionable insight has become the key differentiator. Acting in real time to respond to an event can result in huge profits and improved customer relationships for a firm.
Real-time analytics can benefit in multiple business scenarios, including:
- High-frequency trading (sophisticated algorithms to rapidly trade securities)
- Real-time detection of fraudulent transactions
- Real-time price adjustment based on competitor information
- Real-time feedback from social media for a product firm about its new launch
- Real-time recommendations by retail stores based on customer’s location
- Real-time traffic routing based on information about vehicle frequency, direction, etc.
Social media content comes from users without any vested interest, thus their opinions beget more trust. Organizations whose products and services are mentioned in such media need to remain current on relevant discussions and be able to track the sentiment of every employee, customer and investor. To address this challenge, a cloud-based real-time ecosystem was created for analyzing comments, reviews and opinions mined from Twitter. In addition, tracking trending themes in the customer space and the evolution of these trends over time was incorporated.
Text Mining Algorithms
Topic modeling. Topic models are statistical techniques that analyze words/phrases in textual data to understand the main themes running through them. This model algorithm is based on LDA (latent Dirichlet allocation) and uses the observed words in tweets (extracted from Twitter) to infer the hidden topic structure.
LDA is more easily understood by its generative process. This generative process defines a joint probability distribution over the observed (the words) and hidden (the topics) random variables. This joint distribution is used to compute the conditional distribution of the hidden variables given the observed variables. This conditional distribution is called the posterior distribution.
A topic is assumed to be a collection of words with different probabilities of occurrence. An individual tweet can be assumed as generated from multiple topics in different proportions. Now every word generated in a tweet can be randomly chosen in a two-step process:
- First, a topic is randomly selected from the distribution of topics.
- Second, the chosen word is randomly selected from the distribution of words over that topic.
So, the joint probability distribution of wordÂÂ W and topic T = Probability (W, T) = Probability (T) * Probability (W | T).
Now when the individual probability of occurrence of a word is known (because it has already occurred in the tweet), the posterior distribution is calculated as follows:
Probability (T | W) = Probability (W, T) / Probability (W)
Given the probabilities of observed words, latent information like the vocabulary distribution of a topic and the distribution of topics over the tweet are thus inferred.
Sentiment analysis. A holistic lexicon-based algorithm is used to analyze individual feature-level sentiments as well as cumulative sentiments over tweets.
Aggregating opinions for a feature: The algorithm parses one tweet at a time identifying the features present. A set of opinion words for each feature is identified using a lexicon. An orientation score for each feature in the sentence is then calculated by summing up the feature-opinion scores for that sentence. (Each feature-opinion score is obtained from the sentiment polarity of the opinion word and a multiplicative inverse of the distance between the feature and opinion word. Opinion words at a distance from the feature are assumed to be less associated to the feature compared to the nearer words.)
For example, the phone is useful and a great work of art.
Let the feature here be phone and opinion words be “useful,” “great.”
Semantic orientation of useful = 1
Semantic orientation of great = 1
Distance between the words useful and phone = 2
Distance between the words great and phone = 5
Aggregating opinions for tweets: The sentiment score for a tweet is the summation of the scores for all opinion words present in the tweet.
For example, “The phone is useful and a great work of art.”
The opinion words in the sentence are “useful,” “great”
Semantic orientation of useful = 1
Semantic orientation of great = 1
score(t) = 1 +1= 2
Negation-rule: This identifies the negation word (which can be 1 or 2 places before the opinion word) and reverses the opinion expressed in a sentence.
For example, “The phone is not good.”
Here phone gets negative orientation.
Context-dependent rules: The features for which we find no opinion words, context dependent constructs are used to identify the orientation score.
For example, “The phone is good but battery-life is short.”
The only opinion word in the sentence is “good” (“short” is a context-dependent word).
Phone gets positive orientation because of “good.”
Battery-life gets negative orientation because of the word “but” being present between good and battery-life.
|Figure 1: Real-time text mining agency.|
Topic Evolution. The next step to topic modeling is to understand how topics and trends develop, evolve and go viral over time.
The algorithm maintains a fixed number of topic streams and their statistics. Each tweet is processed as it comes in and is assigned to the “closest” topic stream (the topic stream most similar to it). If no topic stream is close enough, then a new stream is created and a stale stream is killed to maintain a fixed number of topic streams. Streams are constantly monitored for the rate of arrival of tweets. Whenever there is a burst of tweets in a particular topic stream, an alert for the trending topic is generated.
The Real-Time Edge
A multi-agent distributed framework enables the processing of real-time data and facilitates decision-making by allowing for easy deployment of analytical tasks in the form of process flows. In this multi-agent paradigm, an agent is a software program designed to carry out one or more tasks and can communicate with other agents in the system using agent communication language. Thus, an analytical task can be written as an agent, and the analytical process flow can be established by wiring together a set of communicating agents (an agency) that can run in sequence or in parallel.
These agents were written using R to offer the analyst the benefits of a powerful and flexible statistical modeling language.
Operationalization in the Cloud
The entire real-time platform was then deployed on a cloud ecosystem to allow for the following processes:
Efficient resource management: The cloud platform provides the necessary virtual machine, network bandwidth and other infrastructure resources. Even when a machine goes down because of an unexpected failure, a new virtual machine is allocated for the application automatically.
Dynamic scaling and load balancing: The cloud solution allows scaling out as well as scaling back an application depending on resource requirements. Multiple services running in tandem make the whole system computationally resource intensive. As resource demands increase, new role instances can be provisioned to handle the load. When demand decreases, these instances can be removed so that payment for unnecessary computing power is not required.
|Figure 2: Topic modeling treemap.|
Availability & durability: The cloud storage services replicate data on three different servers, guaranteeing it can be accessed at all times, even if a server shuts down unexpectedly.
Better mobility: The application can be accessed from any place, as long as there is an Internet connection. There is no tight coupling with any physical server or machine.
Figure 2 shows a snapshot of the topic treemap generated in one run of the topic modeling algorithm (different topics are represented by different colors, with the areas representing occurrence frequency).
|Figure 3: Trends stream graph.|
Incoming tweets over a time period were captured in a stream graph visualization as shown in the Figure 3 screenshot. Each topic is represented by a stream in the visualization and is characterized by the top words in that topic. At any point of time, the top words in each topic are displayed in a topic treemap below the stream graph. It is possible to get the keyword “treemap” at any past time in history.
Successive runs of the sentiment analysis algorithm for batches of tweets are represented by the visual in Figure 4.
|Figure 4: Sentiment analysis.|
Each bar captures the sentiment for that feature in a particular batch of tweets. The height of the bar represents the number of opinion words for the feature in that batch. The color of each bar represents the overall sentiment level expressed in a batch of data, ranging from extremely negative (dark red) to extremely positive (dark green). The change in color of the bars across various batches can be used to identify stimuli that are driving the change.
Selection of a particular bar provides a deeper analysis of that batch. The size of a bubble indicates the number of references of a particular opinion word, and the color shows the overall sentiment score for the particular opinion word. Both the size and color are indicators of which opinion words drive the sentiment for a feature in a batch.
Trending topics represent the popular “topics of conversation,” and when detected in real time, these hot topics are the social pulses that are usually ahead of any standard news media. Data analyzed via managed data centers can provide key insights into the evolving nature and patterns of social information and opinion and the general sentiment prevailing over such subjects.
Aveek Mukhopadhyay is an associate manager at Mu Sigma where he works with the Innovation & Development Team with a core focus on driving the adoption of advanced analytical platforms and techniques both internally and externally. He has interests in the fields of text mining, machine learning and analytics automation.
Roger Barga, Ph.D., is group program manager for the CloudML team at Microsoft Corporation where his team is building machine learning as a service in the cloud. Barga is also a lecturer in the Data Science program at the University of Washington. He joined Microsoft in 1997 as a researcher in the Database Group of Microsoft Research (MSR), where he was involved in a number of systems research projects and product incubation efforts, before joining the Cloud and Enterprise Division of Microsoft in 2011.
NOTES AND REFERENCES
- The Economist (Feb. 25, 2010), “The Data Deluge” (http://www.economist.com/node/15579717).
- David M. Blei, “Probabilistic Topic Models,” Communications of the ACM, April 2012, Vol. 55, No. 4 (http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf).
- Xiaowen Ding, Bing Liu and Philip S. Yu, “A Holistic Lexicon-Based Approach to Opinion Mining” (http://www.cs.uic.edu/~liub/FBS/opinion-mining-final-WSDM.pdf).