Statistical Software Survey: A Long Way From Flip Charts
Survey of statistical software highlights a growing list of capabilities.
By James J. Swain
All U.S. presidential elections are historic, but the most recent one was a doozy! The nomination and election of Barack Obama as president was dramatic, exciting and filled with surprises. Prior to the Iowa Caucus, for instance, Hillary Clinton was thought to be a shoo-in for the Democratic nomination, and her campaign expected that John Edwards would be her chief rival. Prior to his victory in New Hampshire, John McCain’s campaign for the Republication nomination was declared dead. Over the course of the election, voting patterns changed, the distribution of the red and blue states was altered, and an unexpected tidal wave of new, exhilarated younger voters emerged, determined to make their voices heard.
With so much uncertainty, it was difficult to understand what was going on. Yet with so much at stake, we needed to do just that. There was no shortage of statistics used to provide an explanation of the surprising events. Throughout the complex story, we were inundated with polling results as the commentators attempted to make sense of what was happening and to predict what it might mean. Enlivening the entire process was the dynamic “Magic Wall” at CNN, presided over by John King. As the Washington Post  reported a year ago, King “began poking, touching and waving at the screen like an over-caffeinated traffic cop,” setting in motion “a series of zooming maps and flying pie charts,” that he could call forth and as quickly replace with results down to the precinct level and just as easily return to “what-if” scenarios involving delegate counts. Rarely have so much data and so many relations among various factors such as region, age, political affiliation or socio-economic status been attempted to be portrayed on a news show. We have come a long way from the flip charts of Ross Perot or the dry erase board that Tim Russert used in 2000 to explain the results in Florida on election night.
Converting Data into Information
Statistical software is the key to converting the flood of data into information that can be grasped and acted upon. As we observe the centennial year for Gosset’s publication of the “Student” T distribution that helped to usher in the modern development of statistics for small samples, the challenge has now shifted to dealing with large data sets and data bases. Increasingly statistical analysis is used to delve into complex multivariate relations using large amounts of data. Traditional statistics is dominated by the need to make the most of limited data. Now automation and electronic sensing means that data is plentiful and sometimes overwhelming. Entire new areas of statistical analysis are growing to meet this challenge, such as data mining, astrostatistics, multivariable quality measures and genomics, to provide a few examples.
As the Magic Wall suggests, graphics provide a powerful way of presenting data, and the dynamic exploration of various graphics can be a powerful method of uncovering underlying relations within the data. Statistical software makes it possible to examine data in multiple ways, to perform comparisons and to look for multivariate relations. Since the plotting is limited to two- and three-dimensional projections of data, several methods have evolved to assist in forming a coherent view of the complete picture. In the matrix plot, for instance, variables are listed by row and column and the pairwise scatter plots occur in each position (sometimes with the univariate histograms in the diagonal positions). Hence, the scatter plots in each row represent the joint distribution of that variable with the variables represented by the columns.
Of course, these essentially marginal views cannot provide the entire story. For instance, correlations may exist within groups of variables that are not observable by the pairwise scatter plots. In these cases, additional insight can be obtained through transformation of coordinates based upon principal components or factor analysis. Such techniques are often used when the underlying variation consists of groups of related variables, and these underlying factors are often fewer in number than the number of variables that are directly observed and plotted. In another example, Wainer  illustrates graphically how rotation of the random data from the Randu random number generator can highlight the limitations of the generator, where the points fall on only 15 planes in 3-space.
Interactive graphical methods can also be used to explore relations within data. Brushing is a method in which a point or a group of points on a given plot can be highlighted in linked displays or simply used to provide access to their location within the data for detailed examination. Another form of interactive graphics is obtained through slicing . In this case a variable is designated for slicing, and the data is divided into sets above and below the slice value of the variable. As the slice point is varied you are able to highlight the characteristics of the linked displays. In Wainer’s illustration, the value of the slice cuts the set of residuals to reveal that the positive residuals come from a limited set of sources of data, in this case, a particular industrial sector.
Automated data collection by satellites such as the Hubble Space Telescope has helped to change the amount of data available to cosmologists. Freeman, Richards et al.  note that cosmology was a data-starved field until the advent of new collection devices. In the last few decades this has greatly changed, and he points to examples such as the 200 million objects now available in the Sloan Digital Sky Survey. Statistical analysis is being combined with astronomy to produce the interdisciplinary field of astrostatistics. They illustrate how the analysis of the increased amount of data is providing greater resolution in their models and model parameters, as well as demonstrating how underlying relations within the data can be revealed. For example, in one case nonlinear methods are used to reveal a one-dimensional manifold that better explains the relative distance between different objects that is better than the Euclidian distance.
Whether in astronomy or elections, one key to understanding is to identify groupings of similar items within the data. Classification methods seek to find relatively homogeneous subgroups that preserve the overall variability within the population while reducing the number of distinct groupings. Cluster analysis attempts to identify groupings of the most similar objects. Within the latest election this classification was indirectly the basis of several anecdotal groupings including “Joe Six-pack” and “Hockey Moms” who join the “Soccer Moms” of earlier elections. Such classification schemes are heavily used in marketing and politics. In simulation modeling clustering can be used as a basis for aggregation, allowing the range of inventory items, for instance, to be represented by a smaller group of aggregated, representative items with little loss in validity and a great gain in simplicity and efficiency.
Finally, statistical methods are often used to determine the faint signals of interest within otherwise noisy data. In quality control, for instance, random variations within specified limits usually represent normal, random variation, whereas systematic trends signal an assignable cause to be investigated. The problem is similar to those faced by Homeland Security and credit fraud. In both cases it is desired to isolate the indications of a threat or an unauthorized transaction from among the myriad of private messages or normal financial transactions. Statistical methodology can play a role in a variety of similar areas such as environmental monitoring and disease prevalence, which also have potential implications in homeland defense. The American Statistical Association has interest groups in defense and security and is expanding the visibility of these problems to member statisticians.
Statistical Software Survey
In conjunction with this article, we recently conducted a statistical software products survey [www.lionhrtpub.com/orms/surveys/ sa/sa-survey.html] that provides capsule information about a variety of products and vendors. The tools range from general tools that cover the standard techniques of inference and estimation, as well as specialized activities such as nonlinear regression, forecasting and design of experiments. The product information contained in the survey was obtained from vendors and is summarized in tables to highlight general features, capabilities, computing requirements and to provide contact information. Many of the vendors have extensive Web sites for further, detailed information, and many provide demo programs that can be downloaded from these sites. No attempt was made to evaluate or rank the products, and the information provided comes from the vendors themselves. Vendors that were unable to make the publishing deadline will be added to the on-line survey.
Products that provide statistical add-ins available for use with spreadsheets remain common. The spreadsheet is the primary computational tool in a wide variety of settings, familiar and accessible to all. Many procedures of data summarization, estimation, inference, basic graphics and even regression modeling can be added to spreadsheets in this way. An example is the Unistat Add-in for Excel. The functionality of products for use with spreadsheets continues to grow, including risk analysis and Monte Carlo sampling.
Dedicated general and special-purpose statistical software generally have a wider variety and depth of analysis than available in the add-in software. For many specialized techniques such as forecasting, design of experiments and so forth, a statistical package is appropriate. Moreover, new procedures are likely to become available first in the statistical software and only later be added to the add-in software. In general, statistical software plays a distinct role on the analyst’s desktop, and — provided that data can be freely exchanged among applications ¬ each part of an analysis can be made with the most appropriate (or convenient) software tool.
An important feature of statistical programs was the importation of data from as many sources as possible, to eliminate the need for data entry when data is already available from another source. Most programs have the ability to read from spreadsheets and selected data storage formats. Also highly visible in this survey is the growth of data warehousing and “data mining” capabilities, programs and training. Data mining tools attempt to integrate and analyze data from a variety of sources (and purposes) to look for relations that would not be possible from the individual data sets. Within the survey we observe several specialized products, such as STAT::FIT, which are more narrowly focused on distribution fitting than general statistics, but of particular use to developers of stochastic models and simulations.
James J. Swain (email@example.com) is professor and chair, Department of Industrial and Systems and Engineering Management, at the University of Alabama in Huntsville. He is a member of INFORMS, IIE, ASA and ASEE.
1. Farhi, Paul, 2008, “CNN Hits The Wall for the Election,” Washington Post, Feb. 5, 2008, page C01.
2. Freeman, Peter; Richards, Joseph; Schafer, Chad; and Lee, Ann, 2008, “Astrostatitics: The Final Frontier,” Chance, Vol. 21, No. 3, pp. 31-35.
3. Howard Wainer, 2000, “Visual Revelations: New Tools for Exploratory Data Analysis: II. Rotatable SPLOMS and a Slicing Engine,” Chance, Vol. 13, No. 4, pp. 45-47.
To view the directory of Statistical Software products along with the survey data, see: www.lionhrtpub.com/orms/surveys/sa/sa-survey.html.
- 33Many professional and casual users do explanatory and time series forecasting in medicine, business and academia. The forecast “hits” and “misses,” particularly the latter, sometimes make headlines. The large increase in college applications for the class of 2012, especially to prestigious universities, caught some provosts by surprise. Box office returns…
- 32We were repeatedly reminded several times last year that variability can confound statistical predictions and unlikely events do occur. Upsets in sports and politics are always news, since having the underdog beat the “sure thing” is surprising and noteworthy. What is exciting in sports is unexpected in politics, since we…
- 32There are no sure things in an uncertain world, as the Seattle Seahawks dramatically demonstrated when they defeated the heavily favored New Orleans Saints during the recent NFL playoffs. At about the same time the 7-9 Seahawks knocked out the defending Super Bowl champ Saints, snow and ice surprisingly blanketed…