Share with your friends


Analytics Magazine

Text Analytics: Lost in translation

May/June 2013

Part one of two-part series on best practices for analyzing multi-lingual text.

best practices for analyzing multi-lingual text. By Christopher Broxe  and Fiona McNeillBy Christopher Broxe (left) and Fiona McNeill

The phrase, “lost in translation” takes on a special meaning when it comes to text analysis. Nuances in language can indicate the homeland of the author, like “housecoat” or “bathrobe” and “pop” or “soda,” even with the same mother tongue. However, when text is examined across different languages the same phrase can have altogether different meanings. For example, a Swedish tourist asking “snälla, kör mig till ett roligt ställe” of a taxi driver in Copenhagen, Denmark (i.e. “Please take me to a fun place) might be driven to a nearby cemetery because “roligt” in Danish means a calm, peaceful place. Other language-specific idioms abound, such as the Swedish phrase “Ingen ko på isen” (meaning “no worries”) will inevitably be translated into “no cow on the ice.” Likewise, the common English phrase, “It’s raining cats and dogs,” is bound to be problematic once translated.

Translating to a common language can at best confuse the meaning and at worst completely butcher the author’s intentions. Drawing conclusions from text that is translated from one language to a common language can lead to misguided results and inaccurate conclusions. So what do you do? How can you make conclusions from texts that are written in different languages?

Unstructured text retrieval in a native language can readily be done, even without knowing the language that the text is written in. Analysis is best done by a native language speaker who can train and refine statistical and linguistic models to appreciate inherent meanings and preserving the author’s intentions. The outputs of such models can then readily be consolidated into one common language: results.

This first of a two-part series describes a method for multi-lingual document retrieval, illustrated with examples using SAS software.

How to address multi-lingual content

Analytically based information retrieval lets you create distinct content streams (a.k.a. pipelines) for any language, site, source or combination thereof in such a way that each stream contains instructions related to specific language content. So while the text analysis models are uniquely defined to the language associated with the text in order to preserve the meaning – such as the expressed sentiment, concepts and facts [1] – the application of multiple pipeline models to documents written in different languages can be done by any non-native speaker – or even by machine.

Given that Web pages are defined in the universal language of HTML, one can readily identify the “body” of text, title, author and other core elements of a page generically – and simply direct the software to retrieve the desired section of the HTML. Figure 1 illustrates the results from a point-and-click interface used to defined desired HTML (or XML) fields that would be retrieved from a Web page when it is crawled [2]. As you can see, although the language used in the body of the text is Arabic, the author and title follow a standardized format so even without knowing what the text says, the structure of the page can be readily defined to different aspects of the content.

Figure 1: Defining the Web page structure with an interface can be done without knowing the language.
Figure 1: Defining the Web page structure with an interface can be done without knowing the language.
(Click here to view a larger version in a separate window.)

XPath expressions [3] are the query language that typically operates behind such software interfaces, forming the code base for retrieval activities. And given that Web pages can readily change, it is worthwhile to define the XPath expressions at the most generic level possible. For example, changing the XPath expression: /html/head/meta[14]/@content to /head/meta[@name=”author”]/@content will ensure you will always capture the author, even if the author is not present as the 14th element of the meta definition.

After tweaking the XPath expressions to be as generic as possible, a “marked up” template to an Arabic Web page is created without needing to understand a word of Arabic, as illustrated in Figure 1. The same can be done in any language although it is important to keep in mind that for the analysis of text data – that is, when focus turns from retrieving the text to understanding its meaning – a native language speaker is highly valuable to ensure that the meaning intended within that specific language is identified and understood [4].

The ability to easily template a Web page with markers defining the content can be done for multi-lingual tweets as illustrated in Figure 2. XPath expressions are used to identify different aspects of the tweet that are of interest and these become metadata that is added to the document once it is retrieved by the system.

Figure 2: Arabic (on the left) and English (on the right) tweets, marked up to identify the content components.
Figure 2: Arabic (on the left) and English (on the right) tweets, marked up to identify the content components.
(Click here to view a larger version in a separate window.)

Most NLP-based retrieval systems will include a built-in facility to detect language so the appropriate markup template is applied to any given input document, Web page, tweet, etc.

The definition of the HTML file or XML file (i.e., the different fields that are of interest and which have been identified in the XPath expressions) can be saved and included as an early document processing activity that is executed as part of content stream/pipeline processing. If your main interest is to crawl somewhat structured information, for example .pdf, .doc or .txt files, then this markup definition step isn’t necessary given that such document types are fairly standardized and most technologies are predefined to identify the respective content elements. Many files on the Web are actually fairly structured according to pre-defined standards and, if this is the case, the technology can automatically “detect” the body content of that file. A HTML/XML markup facility is simply an alternative that can be used when the Web pages are very unstructured and the exact location of certain items is desired. Furthermore, if unstructured pages are repeatedly crawled, creating customized templates that can be reused for every page language becomes a valued time saver.

That being said, other situations where template markup facility can be very useful include discussion forums and review sites that have more than one comment on each page and in more than one language. As illustrated in Figure 3, a page from the site includes both Swedish and English reviews of hotels.

Figure 3: Reviews from page in both Swedish and English.
Figure 3: Reviews from page in both Swedish and English.
(Click here to view a larger version in a separate window.) includes commentaries from travelers from around the world. If we were to simply “crawl” this page, we’d retrieve a confusing review that would need to be untangled by data processing. In fact, there are actually many different reviews on this page/URL, so rather than creating additional data processing work, each one of these reviews can be separated into different, unique “documents” as the site is being crawled. This is also a requirement for downstream text analysis, as each review would need to be contained as a distinct element to be analyzed. In Figure 4, we can see how a markup facility can distinguish between the different reviews and identify unique documents, one for each commentary about the hotel, thereby formatting the content for text analysis.

Figure 4: Swedish hotel reviews are uniquely identified using a markup template.
Figure 4: Swedish hotel reviews are uniquely identified using a markup template.
(Click here to view a larger version in a separate window.)

As shown in Figure 4, 10 different document “bodies” exist, one for each of the Swedish reviews contained on the Web page. Ensuring that each is expressed uniquely is important to the text analysis that would decipher the concepts, themes and sentiment contained in the reviews. Once the desired content has been defined, a series of document processing steps are employed to generate the desired output from system crawls. Note that such crawls can be external system crawls from the Web or internal file system crawls, such as intranet or internal social platform retrieval activities, or a combination of both. Some steps that might be included in the complete document processing routine include:

  • markup matcher – used to identify the desired content from the pages (created using XPath expressions)
  • extract_abstract – puts text of varying lengths into a field called “abstract” so that a quick overview of the document is created
  • add field (occurring three times) – a placeholder for user-defined fields used for downstream related processing [5]
  • language identification – used to to store the automatically detected language as a field for any type of filtering activity
  • export to files – used to retain the exported XML (or text) for access as training documents for downstream text analytics processing such as building sentiment or categorization linguistic rules
  • filter – used, in this example, to ensure that the analysis of the documents is occurring in the native language [6]
  • send – a final step, this last document processor directs the resultant document to matched (via filter) instances of the text analytics engine for identifying and extracting concepts, sentiment, facts, etc.

These steps would retrieve text in the native language format, isolate them as distinct files and identify the language retaining the intended meaning of the text. We typically retrieve such files in order to analyze their contents. How that is done in this framework is the subject of the second article in this series.

Christopher Broxe is a business solutions manager and Fiona McNeill ( is a global marketing manager at SAS.


  1. Text analysis models and how they can be included in this processing methodology is addressed in the second article in this series, “Processing multi-lingual text for insight discovery.”
  2. The SAS Crawler contains a Markup Matcher facility for point-and-click definitions of text structure.
  3. XPath code can automatically be generated from point-and-click activities of the user.
  4. The SAS text analytics technologies natively support an extensive number of languages.
  5. An example of a useful field to create for this type of content would be the type source, such as “News” for documents from the BBC or Aljazeera or “Microblog” value for tweets from Twitter.
  6. Given that the “language identification” processor outputs a field called “language,” a filter can readily be created to say “language=ARABIC.” It may be that a tweet in Korean or Polish is captured, so this filter would ensure that only Arabic text analytics processing is done on Arabic documents and not to the Korean ones.

business analytics news and articles

Analytics Blog

Electoral College put to the math test

With the campaign two months behind us and the inauguration of Donald Trump two days away, isn’t it time to put the 2016 U.S. presidential election to bed and focus on issues that have yet to be decided? Of course not.


Artificial intelligence a game changer for personal devices

Emotion artificial intelligence (AI) systems are becoming so sophisticated that Gartner, Inc. predicts that by 2022, personal devices will know more about an individual’s emotional state than his or her own family. AI is generating multiple disruptive forces that are reshaping the way we interact with personal technologies. Read more →

FICO predicts AI and blockchain will meet in 2018

Scott Zoldi, chief analytics officer at FICO, in his AI predictions for 2018, predicts the rise of “defensive AI” and manipulative chatbots. Blockchain will use AI to search through relationship data, Zoldi says, and defensive AI will be used to protect systems from malicious AI and machine learning. Adds Zoldi: Chatbots will get so good at understanding us they will learn how to manipulate us. Read more →



2018 INFORMS Conference on Business Analytics and Operations Research
April 15-17, 2018, Baltimore


CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:

For more information, go to