Text Analytics: Lost in Translation II
Part two of a two-part series on best practices for analyzing multi-lingual text.
By Christopher Broxe (left) and Fiona McNeill
In the first part of this series , we examined how to retrieve multi-lingual text without prior knowledge of the language while keeping the native script intact in order to preserve meaning. Once inputs are acquired from social media, external Web pages and even internal file system documents (like e-mail, transaction scripts and archives), the objective is to understand what the text refers to and its associated significance to the analysis and business objective.
While text analysis models are defined to the native language to capture the intended meaning of the author, the application of multi-lingual text models to categorize content or identify sentiment can be completed by a non-native speaker. Although translation technology can be used to frame text in a single language, it is at best a secondary strategy to native language analysis. Beyond issues of idioms, sentence level phrases vary between languages as illustrated in part one of this series: “Snèlla, kür mig till ett roligt stele” said to a cab driver will take you to a fun place in Sweden or a cemetery in Denmark.
The good news is that integrated text analytics software can be used to automatically identify the native language and to apply the correct encoding methods to statistically mine multi-lingual text inputs. Furthermore, the nomenclature of natural language processing (NLP) rule definition for detailed sentiment and advanced linguistic analysis shouldn’t appreciatively change by language. Technology exists to similarly define text analysis models and rules, although detailed discussion of how that is done is beyond the scope of this article. What we do consider in this second installment is how to apply multi-lingual text models and distill meaning from the contents in a digestible manner, illustrated with examples using SAS software.
Applying Multi-lingual Text Models
Document processors are applied to the defined, crawled documents that now exist in separate files resulting from the native language information retrieval described in part one of this series. A pipeline of activity is defined to the system to iterate the steps needed to process the inputs beginning with a pre-processing step of filter to automatically identify the files to the associated language . An example of steps that might be used post language filtering to analyze the document contents is illustrated in Figure 1.
Figure 1: Text analytics pipeline of activity (in English) with “language=ENGLISH” predefined to isolate English documents for further analysis.
At this point we can be certain that the document being analyzed will be in English, given the materials have been filtered on “language=ENGLISH.” Text analytics processing steps defined in the document processor add metadata to the corresponding documents based on the analysis of the document contents itself. The first text processing activity is to extract the date – a common definition can be used, regardless of language. Once extracted, a filter is applied to verify that the body field of the defined document is not empty and that text is actually present to be examined in successive steps.
Figure 2: Multiple languages can be specified to a single taxonomy, with English categories on the left associated with linguistic rules defined in Arabic.
Categorize and Extract Desire Elements
The “content categorization” post processor  will categorize the document into one or many relevant groups depending on the extent to which the desired elements are present in the document. These are based on specifications created using linguistic rules (developed using automatic statistical methods), automatic linguistic methods, user specified NLP rules or predefined rules (a.k.a. taxonomies), or any combination of these methods . As an example, a document from the BBC discussing the Syrian conflict (in English) can be automatically categorized into “social unrest,” – wars and conflicts and so on. At this stage of text processing we can also extract concepts and facts, such as the names of people, locations and organizations based on predefined linguistic rule specification. A variety of methods exist to extract concepts, everything from simple lookups and classifiers (i.e., text strings) to more advanced contextual extraction, where parts-of-speech, operators and conditional definitions can be used to identify desired elements in the documents. Specifications for extraction within paragraphs or sentences can be mixed with desired nouns, verbs, prepositions and other parts of speech – denoted with separators, distances between terms, negation and other building blocks that define the extraction rules used to pinpoint desired aspects of the text.
As an example, suppose we have the text:
“Bill Clinton made a visit to the newly renovated terminal at John F. Kennedy International Airport in New York. Mr. Clinton made a comment, “It’s the best looking terminal I’ve seen!” and even though airport officials told reporters that they unfortunately were $2 million over budget, the project was a success. It was the first visit of the president in N. Y. City this year. Another guest at the press conference was the director of SAS, who told the media that they were starting up scheduled flights from Gothenburg, Sweden, to JFK Airport in June. After leaving the airport, President Clinton went off to another meeting at the Free Library of Philadelphia.”
A computer may have difficulty deciphering these sentences. A simple text analysis engine might return the following results:
- Person: Five different people are mentioned, namely Bill Clinton, John F. Kennedy, JFK, Mr. Clinton and President Clinton.
- City: Three cities are described as New York, Philadelphia and Gothenburg.
- Organization/Company: One organization is mentioned, SAS, but is it SAS Inc. (the software company), SAS Airlines or SAS Special Air Service of England?
A more discriminating, and indeed better, analysis would be able to extract the following from the same paragraph:
- Person: Bill Clinton, with identified co-reference to both “Mr. Clinton,” “the President” and “President Clinton.”
- City: New York, (with co-referenced “N.Y. City”), Gothenburg
- Place: John F. Kennedy International Airport (co-referenced as JFK Airport) and Free Library of Philadelphia.
- Sentiment :
Overall paragraph sentiment is positive
Terminal: positive – “It’s the best looking terminal I’ve seen!”
Investment: negative – “they unfortunately were $2 million over budget.”
Project: positive – “the project was a success.”
Organization/Company: SAS (co-referenced as Scandinavian Airlines).
After the content categorization post processor has classified and extracted desired aspects of the text, the “sentiment analysis” post processor is executed. Similar to that of content categorization, predefined taxonomies and NLP rules are applied; however, in this step the objective is to identify and assess the sentiment expressed in the text. Not only is overall document sentiment defined but so is the sentiment associated with any desired aspect , such as “homeland security,” “economics,” “perception” and so forth. Once the sentiment is defined, this particular activity stream ends with exporting results.
Figure 3: Using the content categorization post-processor, derived categories become facets of the document collection that can be searched and explored by end-users.
The generated metadata along with the corresponding documents are exported, first to XML version and then to a database. This same process is defined and executed for every unique language, applying the specific categorization, extraction and sentiment text models with the associated language filter, which associates the correct language specific model to the corresponding language document.
Metadata, on the other hand, created as a product of this processing, is language independent. Once the language-specific rules are defined, the documents are scored to the taxonomy specification. As such, when a category match occurs and the associated metadata is generated, it is in a common language. As a result, text across multiple languages can be examined in totality. So even if a document matches the “gang activity” category, the document itself can be in Arabic, French, English or any other native language.
Text data is now structured and can be examined in a variety of ways. With the results exported to XML, the files are indexed and can be readily searched and retrieved based on the facet categories created in the post processing, as illustrated in Figure 4 on page 72. These categories index the document collection and can be interactively explored with the document contents residing in its native language format. In Figure 4 we see this for Arabic documents and the related categories through which the user can browse is derived from the Arabic taxonomy defined to the content categorization processor. By clicking on an item in that hierarchy, say “Election,” the documents are filtered to only list texts that discuss that issue.
Figure 4: Structured native language text data, with corresponding derived fields exported from a post processor to a database, viewed in SAS’ Enterprise Guide.
And with the database export the structured text documents can be explored and described with other tools designed to analyze data defined in rows and columns, as illustrated in Figure 5 on page 74.
Reports and visualizations summarizing total population discussions, materials and attitudes can be created from this structured text, and include all language documents given the language independence of the derived metadata, such as illustrated in Figure 5 and Figure 6. Filters and drill-paths enable interactive examination to detail language specific results.
Figure 5: Drillable report listing the top five people, locations, organizations and topics discussed in both Arabic and English from a variety of sources.
Figure 6: Associations across derived categories can be visualized for all language documents.
Advanced crawling, parsing, language recognition and text analysis all contribute to understanding unstructured data. Clarifying the meaning comes from analyzing what the author intended, described in the native language. Multi-lingual information from social media, Web pages and even existing document collections can be accessed without the need to comprehend various dialects. Application of text analysis models can also be done without any language requirements. The text models designed to identify and decipher meaning using linguistic analysis can be created in a common environment, with different models developed for each unique language. The results, enhanced with language independent metadata that has structured the text, can readily be explored, summarized and visualized, enabling decisions based on all the information, and not just what a translator has obfuscated.
Christopher Broxe is a business solutions manager and Fiona McNeill is a global marketing manager at SAS.
1. Referring to http://viewer.zmags.com/publication/3a28b0ac#/3a28b0ac/56, May/June 2013 edition, published by Analytics-Magazine.org.
2. SAS linguistic technologies, including content categorization, support 30 different native languages, in addition to individual dialects.
3. A “language_identification” processor was described in part one of this series, which outputs a field called “language” enabling a filter to be defined, say “language=ARABIC.” By creating a language filter, documents are streamlined to processing activity that has been defined for that specific language (such as only Arabic documents being assessed using the Arabic text analytics pipeline processors).
4. This is a post processor, as it is run after the documents have been identified by the crawler
5. SAS provides methods for all of these different types of linguistic specifications with automatic methods often being used to initiate linguistic specification, which can be further refined by the end-user.
6. Sentiment can be derived for detailed features in text provided the topics are defined as part of a multi-level hierarchy, as these results infer.