Share with your friends


Analytics Magazine

Data Visualizaton: Data reduction for everyone

November/December 2015

Will TowlerBy Will Towler

Data reduction, or the distillation of multitudinous data into meaningful parts, is playing an increasingly important role in understanding the world around us and improving the quality of life. Consider the following examples:

  • Classifying space survey results to better define the structure of the universe and its evolution (see sidebar)
  • Analyzing patient biological data and health records for improved diagnosis and treatment programs (see sidebar)
  • Segmenting consumers based on their behavior and attitudes to enhance
  • Better understanding the nature of criminal activities to improve counter-measures
  • Optimizing mineral exploration efforts (see sidebar) and decoding genealogy

Proliferating Techniques

The large number of use cases have led to a proliferation in data reduction techniques, many of which fall under the umbrellas of data classification and dimension reduction. Wikipedia claims there could be more than 100 published clustering algorithms used to classify data. Even more common techniques such as K-means and Hierarchical Cluster Analysis can vary in execution (e.g., differing in stat tests, distance measures and linkage criteria).

There’s a seemingly equal slew of dimension-reduction methods. And again, even the more popular techniques such as principal component analysis and factor analysis can vary in execution (e.g., differing in extraction and axis rotation methods).

Sloan Digital Sky Survey

Mapping the Universe

Cluster analysis is used to classify galaxies based on spectral data collected through initiatives such as the Sloan Digital Sky Survey (SDSS). Greater insight into the structure of the universe is helping us to better understand its history and future.

Source: M. Blanton and SDSS collaboration,

Exploring Differences

Deciding on the most appropriate data reduction technique for a given situation can be daunting, making it tempting to simply rely on the familiar. However, different methods can produce significantly different results. To make the case we compare K-means and agglomerative hierarchical clustering (AHC) to segment countries based on export composition. We also compare different configurations of principal component analysis (PCA) to determine underlying discriminators. Data are drawn from the OECD Structural Analysis (STAN) Database and are analyzed with XLSTAT, an easy-to-use Excel add-in.

Results from K-means and AHC differ noticeably in statistical efficiency and group memberships. AHC (Euclidean distance and Ward’s linkage) achieves 80 percent explained variance with 25 clusters, while K-means (Trace stat test) requires 30 clusters to achieve the same explained variance. Other K-means and AHC configurations are less efficient. As for segment characteristics, group memberships under the two methods differ by 25 percent with 12 clusters, a number subjectively chosen based on screen plot and qualitative assessment.

Different configurations of PCA also yield noticeably different statistical efficiencies and associations. Using covariance instead of correlation to measure relationships achieves greatest efficiencies and generates the most meaningful output. Promax (oblique rotation) is used instead of Varimax (orthogonal rotation), recognizing that economic activities may be correlated. As can be seen in the plots shown in Figure 1, correlation with Spearman and Pearson’s generates murky principal components while covariance creates more logical constructs and greater explained variance.

None of the techniques described in Figure 1 achieve particularly high levels of explained variance, in part reflecting the diverse nature of the global economy. However, the exercise illustrates how fundamental performance metrics and group formation can vary greatly even between commonly applied methods. And, of course, change the data set and the preferred methodology might change as well.

Principal component analysis on country exports
Figure 1:  Principal component analysis on country exports.

TCGA image data

Improving patient diagnosis and treatment

Cluster analysis can be used to classify diseases. This example groups glioblastoma using TCGA image data, coded professional assessment and patient records. Dr. Lee Cooper and coauthors write in the Journal of the American Medical Informatics Association that such frameworks have the potential to improve preventive strategies and treatments.
Source: National Center for Biotechnology Information

Analytical Considerations

There aren’t any hard and fast rules about what data reduction method is optimal for a given situation. However, here are some general considerations to keep in mind:

1. Exploration and theory: Given the degree to which results can vary by technique, data reduction exercises are best treated as exploratory. Openness to trying different approaches is recommended rather than relying on one method as the be-all and end-all. This doesn’t negate the importance of having a theoretical foundation to shape discovery. Techniques such as cluster analysis and PCA will group and compress any data. Without a priori assumptions and hypotheses, analysis can turn into a wild goose chase.

Optimizing mineral exploration

Optimizing mineral exploration

Dimension reduction techniques enhance geological mapping exercises. In Geosphere, Norbert Ott, Tanja Kollersberger, and Andrés Tassara illustrate how principal component analysis with Landsat satellite data can improve mineral exploration efforts.

2. Sparsity and dimensionality: Data with limited variation are likely to add little value to data reduction exercises and can be considered for exclusion. At the other end of the spectrum, too much dimensionality can complicate efforts. Discerning which attributes are most important up-front and preparing data accordingly play an important role in successful analysis.

3. Scale, outliers and borders: Consistent scales are required so that variables with high magnitudes don’t dominate. Outliers can also skew results. For example, because K-means partitions the data space into Voronoi cells, extreme values can result in odd groupings. At the same time, carte blanche removal of outliers isn’t recommended given the insights that potentially lie in exceptional cases. Another challenge with K-means is the potential to incorrectly cut borders between clusters as it tends to create similarly sized partitions.

4. Order and iterations: Source data should be randomized if the algorithm used is influenced by the order in which observations are processed. And if the technique employed involves random start points rather than fixed or user defined, multiple iterations should be conducted to evaluate the extent to which results are consistent with consecutive runs.

5. Mechanics and validation: Failure to understand the mechanics of data reduction methods could result in a sub-optimal framework and misinterpretation. Being expert in all the different techniques is a tall task, but the wealth of information publically available makes learning as you go possible. With so many factors to consider, validation of results is also critical. Can results be substantiated statistically and logically? What do results look like visually? Can results be replicated and effectively used for prediction?

We’ve only scratched the surface of data reduction, covering just a few of the many techniques at a very high level. However, we’ve shown that even some of the more popular methods can generate significantly different results. Data reduction clearly falls into the camp of “part art, part science.” As such, it’s probably fair to say that there’s no “correct” approach. However, there are important considerations to be made when conducting analysis, perhaps the most important of which is the need to take an exploratory approach, open to the possibility that one size may not fit all.

Decoding Genealogy
Decoding GenealogyDimension reduction can shed light on the genetic makeup of – and relationships between – human populations. In Nature, Dr. John Novembre and team highlight the close relationship between genetic and geographic distances using principal component analysis. They conclude that spurious associations can arise when conducting genetic mapping exercises if such considerations are not properly accounted for.
Source: Nature, Nov. 8, 2008; reprinted by permission from Macmillan Publishers.
Decoding Genealogy


Will Towler ( is an analytics and insights specialist. The views expressed in this article are those of the author and do not necessarily represent the views of an employer or business partners. He is a member of INFORMS.


business analytics news and articles


Using machine learning and optimization to improve refugee integration

Andrew C. Trapp, a professor at the Foisie Business School at Worcester Polytechnic Institute (WPI), received a $320,000 National Science Foundation (NSF) grant to develop a computational tool to help humanitarian aid organizations significantly improve refugees’ chances of successfully resettling and integrating into a new country. Built upon ongoing work with an international team of computer scientists and economists, the tool integrates machine learning and optimization algorithms, along with complex computation of data, to match refugees to communities where they will find appropriate resources, including employment opportunities. Read more →

Gartner releases Healthcare Supply Chain Top 25 rankings

Gartner, Inc. has released its 10th annual Healthcare Supply Chain Top 25 ranking. The rankings recognize organizations across the healthcare value chain that demonstrate leadership in improving human life at sustainable costs. “Healthcare supply chains today face a multitude of challenges: increasing cost pressures and patient expectations, as well as the need to keep up with rapid technology advancement, to name just a few,” says Stephen Meyer, senior director at Gartner. Read more →

Meet CIMON, the first AI-powered astronaut assistant

CIMON, the world’s first artificial intelligence-enabled astronaut assistant, made its debut aboard the International Space Station. The ISS’s newest crew member, developed and built in Germany, was called into action on Nov. 15 with the command, “Wake up, CIMON!,” by German ESA astronaut Alexander Gerst, who has been living and working on the ISS since June 8. Read more →



INFORMS Computing Society Conference
Jan. 6-8, 2019; Knoxville, Tenn.

INFORMS Conference on Business Analytics & Operations Research
April 14-16, 2019; Austin, Texas

INFORMS International Conference
June 9-12, 2019; Cancun, Mexico

INFORMS Marketing Science Conference
June 20-22; Rome, Italy

INFORMS Applied Probability Conference
July 2-4, 2019; Brisbane, Australia

INFORMS Healthcare Conference
July 27-29, 2019; Boston, Mass.

2019 INFORMS Annual Meeting
Oct. 20-23, 2019; Seattle, Wash.

Winter Simulation Conference
Dec. 8-11, 2019: National Harbor, Md.


Advancing the Analytics-Driven Organization
Jan. 28–31, 2019, 1 p.m.– 5 p.m. (live online)


CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:

For more information, go to