Share with your friends










Submit

Analytics Magazine

Output visualization of machine learning analysis

Navneet KesherBy Navneet Kesher

Data science is more than just building machine learning models; it’s also about explaining the models and using them to drive data-driven decisions. In the journey from analysis to data-driven outcomes, data visualization plays a very important role of presenting data in a powerful and credible way.

Why Unstructured Data?

Structured data only accounts for about 20 percent of stored information. The rest is unstructured data – texts, blogs, documents, photos, videos, etc. Unstructured data, also known as dark data, includes information assets that organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Unstructured data is the hidden part of the massive iceberg that has yet to be analyzed for useful decision-making.

In many circles, unstructured data is considered a burden that should be sorted and stored away. In reality, it contains valuable business insights that can significantly augment the business understanding that we have today from structured data.

Figure 1: Unstructured data: the hidden part of the massive iceberg.

Figure 1: Unstructured data: the hidden part of the massive iceberg.

Although machine learning can analyze any type of data (structured or unstructured), unstructured data is virtually useless without machine learning algorithms (including natural language processing (NLP) algorithms, text-mining algorithms, pattern/classification algorithms, etc.) While machine learning algorithms have seen significant advancements, the tools and processes to visualize the results from these algorithms for the common man have not kept pace.

Visualization tools for unstructured data are extremely valuable, but they have traditionally operated mostly on highly structured data, such as stock prices and sales records. As we create and consume more unstructured data, we have to extend the visualization efforts to include unstructured data.

Importance of Data Visualization

As a data scientist, I always question the amount of time I put into data visualization. Throughout my early analytics career, I observed that the prettier my graph, the more skeptical my audience was in the quality of my analysis. While I loved data visualization, I also feared coming out as the person who puts more emphasis and effort into making the graphs pretty rather than ensuring a thorough analysis (Figure 2).

Figure 2: Can pretty graphs make an audience more skeptical of the quality of the analysis?

Figure 2: Can pretty graphs make an audience more skeptical of the quality of the analysis?

As I progressed in my career, I realized that data analysis and data visualization are not entirely exclusive work sets – they co-exist and feed off of each other (aka, you can produce pretty graphs and still come off as someone with analytical prowess). The rest of this article on data visualization will focus on representing highly complex analysis on a sheet of paper (or slide) for someone who may not have the need to understand the underlying details.

Representing Unstructured Data

Below are the three broad guidelines that I follow while building visualization for unstructured data:

1. Start with a goal. Goals are the fundamental bonding agent that connect the purpose of the analysis to the visualization of results. Whether the goal is to arrive to a decision or start an action into exploring next steps, the data scientist should aim to identify and convey the results and corresponding visualization that best supports a well-defined goal.

For example, if the goal is to analyze a call center’s audio recordings to determine the type and corresponding volume of complaints, a cubism horizongraph may be very useful. Cubism.js is a very effective time series visualization tool that uses stacked area graphs to help analyze output content from audio-video streaming data. In the case of call center audio recordings, the horizongraph visualization can help determine the intensity of the customer conversations (along with time series data) without having to transcribe audio into text.

Figure 3: Visualization is most effective when it is simple to understand and can stand by itself.

Figure 3: Visualization is most effective when it is simple to understand and can stand by itself.

While we are on the topic of call center’s customer service recordings, if the goal is to understand the differences between a subscription customer vs. a free-tier customer, then a text analysis along with scatter text visualization may make more sense. Of course, this will need transcription and annotation of the media files.

Call centers use analytics for analyzing thousands (or millions) of hours of recorded calls. Among others, the main goal is to gain insight into customer behavior and identify product/service issues. The analysis method that I have found particularly useful for these goals is self-organizing maps (SOM), which, along with classification, have added the benefit of dimensionality reduction. SOMs are also good for visualizing multidimensional data into 2-D planar diffusion map.

Having and understanding the goal is the most crucial aspect for any data visualization process. Always ask yourself and your stakeholders: What will this data be used for? List the data points that will be vital for answering strategic questions for your business and then create a wireframe of the story that is going to engage your audience.

2. Simplicity for the win. The very reason we analyze unstructured data is to provide structure to it. Data visualization plays a very important role in conveying the results of the analysis, and visualization is most effective when it is simple to understand and can stand by itself without a lot of subtext or metadata. One of my favorite examples is this visualization on “How Families Interact on Facebook” by the Facebook Data Science Team. This is a very simple, yet powerful way to reveal the results of a very complex text analysis.

Another classic example is the use of bar/line charts vs. radar/spider charts. I am a big proponent of easy-to-read charts, aka, charts that can convey a maximum amount of information in the least amount of time.

Figure 4: Classic example bar/line charts vs. radar/spider charts.

Figure 4: Classic example bar/line charts vs. radar/spider charts.

Here are a couple of other ways I like for simple visualization of unstructured data:

Word clouds. Word clouds help visualize the occurrence of words within a corpus, with the size of the text representing the number of times the word or the phrase occurs in the larger text collection. Word clouds are very effective in visualizing the results when performing TF-IDF (term frequency–inverse document frequency, a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus of words). Word clouds can be very effective in uncovering the topic areas of discussion for any social media content or feedback surveys/comments. If you use Python, you may want to bookmark this awesome word_cloud library by Andreas Müller. An interesting application of TF-IDF with word clouds can be found here.

Chord diagrams. A chord diagram is a powerful tool that can be used to represent the contextual meaning of words (especially when analyzing using latent semantic analysis). If the number of topics are <10, Seaborn heat-maps (or even a stacked bar chart) may be a better alternative; however, with a larger set of topics, chord diagrams have better visual representation.

Click on the following for Python examples of Chord Diagram and Filled Chord Diagram. The lines connecting the nodes on the circle indicate that the relationship between these nodes/words (color of the line can denote a positive or negative relation) and the thickness of the connecting lines quantifies the extent of the relationship.

Don’t overload your visualization with data and present clear contrasts wherever applicable.

Figure 5: The four quadrants of good analytics insights.

Figure 5: The four quadrants of good analytics insights.

3. Know your audience. Knowing your audience and tailoring visualization for optimal consumption will go a long way into making of a successful presentation. It’s always good to understand how the data translates into strategic direction for the product. Don’t work in a silo – involve and get feedback from your stakeholders as you do your analysis and create visualization thereof. Iterate! If you cannot get feedback from everyone, make sure that you think about who will actually be looking at these visualizations, what’s important to them and most importantly, how much time will they have to look at your graphs. These are the most important things you should do to understand your audience. Technical jargon won’t work if your audience doesn’t know what they mean. No matter how beautiful your graphs are, if you don’t deliver meaningful and actionable insights, your work does not classify as impact.

For example, if you present your data in form of a network graph (network graphs are designed to measure and quantify the relationships between different vertices or nodes on a graph), take some time to explain how the graph works. In a social data context, network graphs can be a powerful tool in telling a story on the health of your product’s ecosystem.

Data visualization is an art that data scientists need to be good at in order to tell a compelling story from their analysis. Figure 5 best depicts the four quadrants of good analytics insights.

Navneet Kesher (navneet.kesher@gmail.com) is the head of Platform Data Sciences at Facebook. Prior to joining Facebook, he served as a manager of analytics for Amazon. Based in the Greater Seattle Area, he holds an MBA from the University of Southern California.

Analytics data science news articles

Related Posts

  • 91
    It’s long been popular to talk about customer interaction data such as clickstream, social activity, inbound email and call center verbatims as “unstructured data.” Wikipedia says of the term that it “…refers to information that either does not have a pre-defined data model or is not organized in a pre-defined…
    Tags: data, unstructured
  • 70
    The Internet of Things (IoT) is considered to be the next revolution that touches every part of our daily life, from restocking ice cream to warning of pollutants. Analytics professionals understand the importance of data, especially in a complicated field such as healthcare. This article offers a framework on integrating…
    Tags: data
  • 61
    Today, we live in a digital society. Our distinct footprints are in every interaction we make. Data generation is a default – be it from enterprise operational systems, logs from web servers, other applications, social interactions and transactions, research initiatives and connected things (Internet of Things). In fact, according to…
    Tags: data
  • 53
    Nearly 40 percent of data professionals spend more than 20 hours per week accessing, blending and preparing data rather than performing actual analysis, according to a survey conducted by TMMData and the Digital Analytics Association. More than 800 DAA community members participated in the survey held earlier this year. The…
    Tags: data
  • 51
    The Panama Papers, the unprecedented leak of 11.5 million files from the database of the global law firm Mossack Fonseca, opened up the offshore tax accounts of the rich, famous and powerful – laying bare how they have exploited secretive offshore tax regimes for decades.
    Tags: data, visualization

Analytics Blog

Electoral College put to the math test


With the campaign two months behind us and the inauguration of Donald Trump two days away, isn’t it time to put the 2016 U.S. presidential election to bed and focus on issues that have yet to be decided? Of course not.


Headlines

Three keys for organizations to gain value from information

In the current information-driven society and increasingly digitalized world, Gartner, Inc. says that sentiments are shifting from the economics of tangible assets to the economics of information – “infonomics” – and other intangible assets. Infonomics is the theory, study and discipline of asserting economic significance to information. It strives to apply both economic and asset management principles and practices to the valuation, handling and deployment of information assets.  Read more →

Burtch Works study on ‘Salaries of Predictive Analytics Professionals’

According to the recently released Burtch Works study on “Salaries of Predictive Analytics Professionals 2017,” senior-level executives saw the largest increase in salaries from 2016 to 2017, and industry diversification of employment has diluted the concentration of such professionals from financial services and marketing/advertising to consulting and technology. Read more →

New study asks, ‘Is your business AI-ready?’

Despite fears that robots will replace human labor, the majority of artificial intelligence (AI) leaders (79 percent) expect their employees will work comfortably with robots by 2020, according to a new Genpact survey of C-Suite and senior executives titled, “Is Your Business AI-Ready?” Read more →

UPCOMING ANALYTICS EVENTS

INFORMS-SPONSORED EVENTS

2017 Winter Simulation Conference (WSC 2017)
Dec. 3-6, 2017, Las Vegas

CAP® EXAM SCHEDULE

CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:


 
For more information, go to 
https://www.certifiedanalytics.org.