Share with your friends










Submit

Analytics Magazine

Output visualization of machine learning analysis

Navneet KesherBy Navneet Kesher

Data science is more than just building machine learning models; it’s also about explaining the models and using them to drive data-driven decisions. In the journey from analysis to data-driven outcomes, data visualization plays a very important role of presenting data in a powerful and credible way.

Why Unstructured Data?

Structured data only accounts for about 20 percent of stored information. The rest is unstructured data – texts, blogs, documents, photos, videos, etc. Unstructured data, also known as dark data, includes information assets that organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Unstructured data is the hidden part of the massive iceberg that has yet to be analyzed for useful decision-making.

In many circles, unstructured data is considered a burden that should be sorted and stored away. In reality, it contains valuable business insights that can significantly augment the business understanding that we have today from structured data.

Figure 1: Unstructured data: the hidden part of the massive iceberg.

Figure 1: Unstructured data: the hidden part of the massive iceberg.

Although machine learning can analyze any type of data (structured or unstructured), unstructured data is virtually useless without machine learning algorithms (including natural language processing (NLP) algorithms, text-mining algorithms, pattern/classification algorithms, etc.) While machine learning algorithms have seen significant advancements, the tools and processes to visualize the results from these algorithms for the common man have not kept pace.

Visualization tools for unstructured data are extremely valuable, but they have traditionally operated mostly on highly structured data, such as stock prices and sales records. As we create and consume more unstructured data, we have to extend the visualization efforts to include unstructured data.

Importance of Data Visualization

As a data scientist, I always question the amount of time I put into data visualization. Throughout my early analytics career, I observed that the prettier my graph, the more skeptical my audience was in the quality of my analysis. While I loved data visualization, I also feared coming out as the person who puts more emphasis and effort into making the graphs pretty rather than ensuring a thorough analysis (Figure 2).

Figure 2: Can pretty graphs make an audience more skeptical of the quality of the analysis?

Figure 2: Can pretty graphs make an audience more skeptical of the quality of the analysis?

As I progressed in my career, I realized that data analysis and data visualization are not entirely exclusive work sets – they co-exist and feed off of each other (aka, you can produce pretty graphs and still come off as someone with analytical prowess). The rest of this article on data visualization will focus on representing highly complex analysis on a sheet of paper (or slide) for someone who may not have the need to understand the underlying details.

Representing Unstructured Data

Below are the three broad guidelines that I follow while building visualization for unstructured data:

1. Start with a goal. Goals are the fundamental bonding agent that connect the purpose of the analysis to the visualization of results. Whether the goal is to arrive to a decision or start an action into exploring next steps, the data scientist should aim to identify and convey the results and corresponding visualization that best supports a well-defined goal.

For example, if the goal is to analyze a call center’s audio recordings to determine the type and corresponding volume of complaints, a cubism horizongraph may be very useful. Cubism.js is a very effective time series visualization tool that uses stacked area graphs to help analyze output content from audio-video streaming data. In the case of call center audio recordings, the horizongraph visualization can help determine the intensity of the customer conversations (along with time series data) without having to transcribe audio into text.

Figure 3: Visualization is most effective when it is simple to understand and can stand by itself.

Figure 3: Visualization is most effective when it is simple to understand and can stand by itself.

While we are on the topic of call center’s customer service recordings, if the goal is to understand the differences between a subscription customer vs. a free-tier customer, then a text analysis along with scatter text visualization may make more sense. Of course, this will need transcription and annotation of the media files.

Call centers use analytics for analyzing thousands (or millions) of hours of recorded calls. Among others, the main goal is to gain insight into customer behavior and identify product/service issues. The analysis method that I have found particularly useful for these goals is self-organizing maps (SOM), which, along with classification, have added the benefit of dimensionality reduction. SOMs are also good for visualizing multidimensional data into 2-D planar diffusion map.

Having and understanding the goal is the most crucial aspect for any data visualization process. Always ask yourself and your stakeholders: What will this data be used for? List the data points that will be vital for answering strategic questions for your business and then create a wireframe of the story that is going to engage your audience.

2. Simplicity for the win. The very reason we analyze unstructured data is to provide structure to it. Data visualization plays a very important role in conveying the results of the analysis, and visualization is most effective when it is simple to understand and can stand by itself without a lot of subtext or metadata. One of my favorite examples is this visualization on “How Families Interact on Facebook” by the Facebook Data Science Team. This is a very simple, yet powerful way to reveal the results of a very complex text analysis.

Another classic example is the use of bar/line charts vs. radar/spider charts. I am a big proponent of easy-to-read charts, aka, charts that can convey a maximum amount of information in the least amount of time.

Figure 4: Classic example bar/line charts vs. radar/spider charts.

Figure 4: Classic example bar/line charts vs. radar/spider charts.

Here are a couple of other ways I like for simple visualization of unstructured data:

Word clouds. Word clouds help visualize the occurrence of words within a corpus, with the size of the text representing the number of times the word or the phrase occurs in the larger text collection. Word clouds are very effective in visualizing the results when performing TF-IDF (term frequency–inverse document frequency, a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus of words). Word clouds can be very effective in uncovering the topic areas of discussion for any social media content or feedback surveys/comments. If you use Python, you may want to bookmark this awesome word_cloud library by Andreas Müller. An interesting application of TF-IDF with word clouds can be found here.

Chord diagrams. A chord diagram is a powerful tool that can be used to represent the contextual meaning of words (especially when analyzing using latent semantic analysis). If the number of topics are <10, Seaborn heat-maps (or even a stacked bar chart) may be a better alternative; however, with a larger set of topics, chord diagrams have better visual representation.

Click on the following for Python examples of Chord Diagram and Filled Chord Diagram. The lines connecting the nodes on the circle indicate that the relationship between these nodes/words (color of the line can denote a positive or negative relation) and the thickness of the connecting lines quantifies the extent of the relationship.

Don’t overload your visualization with data and present clear contrasts wherever applicable.

Figure 5: The four quadrants of good analytics insights.

Figure 5: The four quadrants of good analytics insights.

3. Know your audience. Knowing your audience and tailoring visualization for optimal consumption will go a long way into making of a successful presentation. It’s always good to understand how the data translates into strategic direction for the product. Don’t work in a silo – involve and get feedback from your stakeholders as you do your analysis and create visualization thereof. Iterate! If you cannot get feedback from everyone, make sure that you think about who will actually be looking at these visualizations, what’s important to them and most importantly, how much time will they have to look at your graphs. These are the most important things you should do to understand your audience. Technical jargon won’t work if your audience doesn’t know what they mean. No matter how beautiful your graphs are, if you don’t deliver meaningful and actionable insights, your work does not classify as impact.

For example, if you present your data in form of a network graph (network graphs are designed to measure and quantify the relationships between different vertices or nodes on a graph), take some time to explain how the graph works. In a social data context, network graphs can be a powerful tool in telling a story on the health of your product’s ecosystem.

Data visualization is an art that data scientists need to be good at in order to tell a compelling story from their analysis. Figure 5 best depicts the four quadrants of good analytics insights.

Navneet Kesher (navneet.kesher@gmail.com) is the head of Platform Data Sciences at Facebook. Prior to joining Facebook, he served as a manager of analytics for Amazon. Based in the Greater Seattle Area, he holds an MBA from the University of Southern California.

Analytics data science news articles

Related Posts

  • 91
    It’s long been popular to talk about customer interaction data such as clickstream, social activity, inbound email and call center verbatims as “unstructured data.” Wikipedia says of the term that it “…refers to information that either does not have a pre-defined data model or is not organized in a pre-defined…
    Tags: data, unstructured
  • 73
    With the rise of big data – and the processes and tools related to utilizing and managing large data sets – organizations are recognizing the value of data as a critical business asset to identify trends, patterns and preferences to drive improved customer experiences and competitive advantage. The problem is,…
    Tags: data
  • 70
    The Internet of Things (IoT) is considered to be the next revolution that touches every part of our daily life, from restocking ice cream to warning of pollutants. Analytics professionals understand the importance of data, especially in a complicated field such as healthcare. This article offers a framework on integrating…
    Tags: data
  • 65
    Businesses are greatly expanding the autonomous capabilities of their products, services and manufacturing processes to better optimize their reliability and efficiency. The processing of big data is playing an integral role in developing these prescriptive analytics. As a result, data scientists and engineers should pay attention to the following aspects…
    Tags: data
  • 63
    Frontline Systems releases Analytic Solver V2018 for Excel Frontline Systems, developer of the Solver in Microsoft Excel, recently released Analytic Solver V2018, its full product line of predictive and prescriptive analytics tools that work in Microsoft Excel. The new release includes a visual editor for multi-stage “data science workflows” (also…
    Tags: data


Headlines

Fighting terrorists online: Identifying extremists before they post content

New research has found a way to identify extremists, such as those associated with the terrorist group ISIS, by monitoring their social media accounts, and can identify them even before they post threatening content. The research, “Finding Extremists in Online Social Networks,” which was recently published in the INFORMS journal Operations Research, was conducted by Tauhid Zaman of the MIT, Lt. Col. Christopher E. Marks of the U.S. Army and Jytte Klausen of Brandeis University. Read more →

Syrian conflict yields model for attrition dynamics in multilateral war

Based on their study of the Syrian Civil War that’s been raging since 2011, three researchers created a predictive model for multilateral war called the Lanchester multiduel. Unless there is a player so strong it can guarantee a win regardless of what others do, the likely outcome of multilateral war is a gradual stalemate that culminates in the mutual annihilation of all players, according to the model. Read more →

SAS, Samford University team up to generate sports analytics talent

Sports teams try to squeeze out every last bit of talent to gain a competitive advantage on the field. That’s also true in college athletic departments and professional team offices, where entire departments devoted to analyzing data hunt for sports analytics experts that can give them an edge in a game, in the stands and beyond. To create this talent, analytics company SAS will collaborate with the Samford University Center for Sports Analytics to support teaching, learning and research in all areas where analytics affects sports, including fan engagement, sponsorship, player tracking, sports medicine, sports media and operations. Read more →

UPCOMING ANALYTICS EVENTS

INFORMS-SPONSORED EVENTS

INFORMS Annual Meeting
Nov. 4-7, 2018, Phoenix

Winter Simulation Conference
Dec. 9-12, 2018, Gothenburg, Sweden

OTHER EVENTS

Making Data Science Pay
Oct. 29 -30, 12 p.m.-5 p.m.


Applied AI & Machine Learning | Comprehensive
Starts Oct. 29, 2018 (live online)


The Analytics Clinic
Citizen Data Scientists | Why Not DIY AI?
Nov. 8, 2018, 11 a.m. – 12:30 p.m.


Advancing the Analytics-Driven Organization
Jan. 28–31, 2019, 1 p.m.– 5 p.m. (live online)


CAP® EXAM SCHEDULE

CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:


 
For more information, go to 
https://www.certifiedanalytics.org.