Data Visualizaton: Data reduction for everyone
By Will Towler
Data reduction, or the distillation of multitudinous data into meaningful parts, is playing an increasingly important role in understanding the world around us and improving the quality of life. Consider the following examples:
- Classifying space survey results to better define the structure of the universe and its evolution (see sidebar)
- Analyzing patient biological data and health records for improved diagnosis and treatment programs (see sidebar)
- Segmenting consumers based on their behavior and attitudes to enhance
- Better understanding the nature of criminal activities to improve counter-measures
- Optimizing mineral exploration efforts (see sidebar) and decoding genealogy
The large number of use cases have led to a proliferation in data reduction techniques, many of which fall under the umbrellas of data classification and dimension reduction. Wikipedia claims there could be more than 100 published clustering algorithms used to classify data. Even more common techniques such as K-means and Hierarchical Cluster Analysis can vary in execution (e.g., differing in stat tests, distance measures and linkage criteria).
There’s a seemingly equal slew of dimension-reduction methods. And again, even the more popular techniques such as principal component analysis and factor analysis can vary in execution (e.g., differing in extraction and axis rotation methods).
Mapping the Universe
|Cluster analysis is used to classify galaxies based on spectral data collected through initiatives such as the Sloan Digital Sky Survey (SDSS). Greater insight into the structure of the universe is helping us to better understand its history and future.
Source: M. Blanton and SDSS collaboration, www.sdss.org
Deciding on the most appropriate data reduction technique for a given situation can be daunting, making it tempting to simply rely on the familiar. However, different methods can produce significantly different results. To make the case we compare K-means and agglomerative hierarchical clustering (AHC) to segment countries based on export composition. We also compare different configurations of principal component analysis (PCA) to determine underlying discriminators. Data are drawn from the OECD Structural Analysis (STAN) Database and are analyzed with XLSTAT, an easy-to-use Excel add-in.
Results from K-means and AHC differ noticeably in statistical efficiency and group memberships. AHC (Euclidean distance and Ward’s linkage) achieves 80 percent explained variance with 25 clusters, while K-means (Trace stat test) requires 30 clusters to achieve the same explained variance. Other K-means and AHC configurations are less efficient. As for segment characteristics, group memberships under the two methods differ by 25 percent with 12 clusters, a number subjectively chosen based on screen plot and qualitative assessment.
Different configurations of PCA also yield noticeably different statistical efficiencies and associations. Using covariance instead of correlation to measure relationships achieves greatest efficiencies and generates the most meaningful output. Promax (oblique rotation) is used instead of Varimax (orthogonal rotation), recognizing that economic activities may be correlated. As can be seen in the plots shown in Figure 1, correlation with Spearman and Pearson’s generates murky principal components while covariance creates more logical constructs and greater explained variance.
None of the techniques described in Figure 1 achieve particularly high levels of explained variance, in part reflecting the diverse nature of the global economy. However, the exercise illustrates how fundamental performance metrics and group formation can vary greatly even between commonly applied methods. And, of course, change the data set and the preferred methodology might change as well.
Figure 1: Principal component analysis on country exports.
Improving patient diagnosis and treatment
|Cluster analysis can be used to classify diseases. This example groups glioblastoma using TCGA image data, coded professional assessment and patient records. Dr. Lee Cooper and coauthors write in the Journal of the American Medical Informatics Association that such frameworks have the potential to improve preventive strategies and treatments.
Source: National Center for Biotechnology Information
There aren’t any hard and fast rules about what data reduction method is optimal for a given situation. However, here are some general considerations to keep in mind:
1. Exploration and theory: Given the degree to which results can vary by technique, data reduction exercises are best treated as exploratory. Openness to trying different approaches is recommended rather than relying on one method as the be-all and end-all. This doesn’t negate the importance of having a theoretical foundation to shape discovery. Techniques such as cluster analysis and PCA will group and compress any data. Without a priori assumptions and hypotheses, analysis can turn into a wild goose chase.
Optimizing mineral exploration
|Dimension reduction techniques enhance geological mapping exercises. In Geosphere, Norbert Ott, Tanja Kollersberger, and Andrés Tassara illustrate how principal component analysis with Landsat satellite data can improve mineral exploration efforts.|
2. Sparsity and dimensionality: Data with limited variation are likely to add little value to data reduction exercises and can be considered for exclusion. At the other end of the spectrum, too much dimensionality can complicate efforts. Discerning which attributes are most important up-front and preparing data accordingly play an important role in successful analysis.
3. Scale, outliers and borders: Consistent scales are required so that variables with high magnitudes don’t dominate. Outliers can also skew results. For example, because K-means partitions the data space into Voronoi cells, extreme values can result in odd groupings. At the same time, carte blanche removal of outliers isn’t recommended given the insights that potentially lie in exceptional cases. Another challenge with K-means is the potential to incorrectly cut borders between clusters as it tends to create similarly sized partitions.
4. Order and iterations: Source data should be randomized if the algorithm used is influenced by the order in which observations are processed. And if the technique employed involves random start points rather than fixed or user defined, multiple iterations should be conducted to evaluate the extent to which results are consistent with consecutive runs.
5. Mechanics and validation: Failure to understand the mechanics of data reduction methods could result in a sub-optimal framework and misinterpretation. Being expert in all the different techniques is a tall task, but the wealth of information publically available makes learning as you go possible. With so many factors to consider, validation of results is also critical. Can results be substantiated statistically and logically? What do results look like visually? Can results be replicated and effectively used for prediction?
We’ve only scratched the surface of data reduction, covering just a few of the many techniques at a very high level. However, we’ve shown that even some of the more popular methods can generate significantly different results. Data reduction clearly falls into the camp of “part art, part science.” As such, it’s probably fair to say that there’s no “correct” approach. However, there are important considerations to be made when conducting analysis, perhaps the most important of which is the need to take an exploratory approach, open to the possibility that one size may not fit all.
|Dimension reduction can shed light on the genetic makeup of – and relationships between – human populations. In Nature, Dr. John Novembre and team highlight the close relationship between genetic and geographic distances using principal component analysis. They conclude that spurious associations can arise when conducting genetic mapping exercises if such considerations are not properly accounted for.
Source: Nature, Nov. 8, 2008; reprinted by permission from Macmillan Publishers.
Will Towler (email@example.com) is an analytics and insights specialist. The views expressed in this article are those of the author and do not necessarily represent the views of an employer or business partners. He is a member of INFORMS.