Statistical Software Survey: Power to the people
Survey of popular analytical software includes wide range of tools.
By James J. Swain
Statistics has been with us since the dawn of time. We have clay tablets from Babylonian censuses from as early as 3800 BC, records from Egypt from 2300 BC, and recall the Roman census in AD7 as mentioned in Luke. Statistical data was used to assess populations, set taxes and quantify the need for food. By the 18th century, a political economy of statistics was part of intellectual discourse, as evidenced by “The Wealth of Nations” (published in 1776) and the development of graphical displays by William Playfair. Florence Nightengale is credited with improvements in the use of statistical graphics to communicate where words alone were ineffective. Now, in the age of ratings, nearly continuous polling, Six Sigma and big data, it is hard to conceive of a world without statistics.
While statistics is essential, many are intimidated by the subject and often recount their experiences in “stat class” in terms of tribulations endured rather than mastery obtained. A tribute to R.A. Fisher by his colleagues at Rothamsted Experimental Station  captures the feelings that many have when confronted with actual data in its complexity – what now?
Why! Fisher can always allow for it,
All formulae bend to his will.
He’ll turn to his staff and say, ‘Now for it!
Put the whole blinking lot through the mill.’
Fortunately, statistical software has largely eliminated the need for a Fisher to handle the computations for statistical procedures, making these procedures available virtually anywhere. The software is a valuable tool for both the analyst and engineer seeking to build models and to improve products and processes through designed experiments. Yet they are not the only ones who can take advantage of useful visual displays, avoid computational drudgery and perform statistical analysis without need for specialized assistance. This “democratization” of analysis has greatly multiplied the number of users of statistics in the workplace and vastly expanded the use of statistics as well.
Statistical analysis in the workplace is not new. Many of the tools for quality and design of experiments predate WWII and came to prominence immediately after the war. The revolution has been the shift from being primarily directed by the technical staff to a worker initiative. The revolution gained credence in the 1980s and 1990s in conjunction with lean production and the attention given to its use at Toyota in particular. Americans such as Shewhart, Deming, Dodge and Juran developed and popularized the methods in industry, but widespread public attention came from their widespread application in Japanese industry and later in American companies such as Motorola and General Electric.
Commonplace in Workplace
The revolution has been greatly assisted by the statistical tools represented in this year’s survey, combined with almost ubiquitous access to computing, which have brought statistical tools to the front lines of work. Whether billed as a Six Sigma process improvement or part of a Kaizen event or experimental analysis project, these tools put the analysis where the problems are being defined, the measurements made and above where the solutions can be implemented. Thus the key steps in the well-known DMAIC methodology can be applied where they are needed. Workplace improvements have become both commonplace and an accepted part of practice in recent decades.
While the workplace analysis is often dominated by traditional small sample statistics, the growth of communications and automation has meant unprecedented collections of data whose size is measured in terabytes and even petabytes, giving rise to the new field of analytics. All of our calls are tracked for billing, our tweets and posts collected, online searches noted, and browsing and purchasing on Netflix, Amazon or eBay are noted and stored. All of these data are studied in order to yield insights for marketing and forecasting or to improve service quality or assist in fraud detection. The growing field of analytics combines methods of statistics and heuristics from computer science called “machine learning” to sift for patterns and connections amid the data for competitive advantage and knowledge.
Even as our activities are logged and stored, physical measurement itself is increasingly automated and fields such as astronomy and physics are increasing dominated by large data sets. The recent discovery of the Higgs boson derives from the records of trillions of particles tracks generated by the accelerator and analyzed to identify the evidence of a particular type of interaction that revealed the existence of the rare and highly sought particle. Astronomy is increasingly dominated by observations of the cosmos collected under computer control. These images are stored and searched not just to catalog the actual images but for the evidence of perturbations in their orbits or brief eclipses by unobserved (dark bodies) that might signal the existence of exo-planets. In a similar way analytics attempts to use the measurements of our lives to infer what we might want to purchase, visit or look up in our next search.
Statistical analysis does not stand still, and the challenges of analytics have provided opportunities for new methods of analysis. Many of these are implemented in free (user-supported) software tools such as R and Python. In fact, SAS, R and increasingly Python are the mainstays of analytics. Both R and Python are programming languages that have packages for a wide range of analysis methods. They are widely disseminated and available to anyone. They have the advantage of low cost and flexibility at the cost of a steeper learning curve than most alternatives. They have the advantage that new procedures can be contributed and disseminated quickly.
An interesting development in analytics is open competitions to encourage development of new algorithms. For instance, Netflix conducted a competition between 2006 and 2009 to find an algorithm for predicting user movie ratings using only their earlier ratings that would outperform their own Cinematch algorithm by at least 10 percent. They awarded the grand prize of $1 million in 2009. Since then there have been many similar competitions, and the Kaggle (www.kaggle.com) website lists 145 completed challenges in a wide variety of fields with another 17 currently active. Not all award prizes as Netflix did, but many do, and the competitions seem to attract hundreds (and sometimes thousands) of entries.
Statistical Software Survey
This year’s survey of statistical software products provides capsule information about 15 products selected from 11 vendors. The tools range from general tools that cover the important techniques of inference and estimation as well as specialized activities such as nonlinear regression, forecasting and design of experiments. The product information contained in the survey was obtained from product vendors and is summarized in the survey tables to highlight general features, capabilities, computing requirements and to provide contact information. Many of the vendors have their own websites for further, detailed information and many provide demonstration programs that can be downloaded from these sites. No attempt is made to evaluate or rank the products, and the information provided comes from the vendors themselves. Vendors that were unable to make the publishing deadline will be added to the online survey.
Products that provide statistical add-ins available for use with spreadsheets remain popular and provide enhanced specialized capabilities for spreadsheets. The spreadsheet is the primary computational tool in a wide variety of settings, familiar and accessible to all. Many procedures of data summarization, estimation, inference, basic graphics and even regression modeling can be added to spreadsheets in this way. An example is the Unistat add-in for Excel. The functionality of products for use with spreadsheets continues to grow, including risk analysis and Monte Carlo sampling, such as Oracle Crystal Ball.
Dedicated general and special purpose statistical software generally have a wider variety and depth of analysis than available in the add-in software. For many specialized techniques such as forecasting, design of experiments and so forth, a statistical package is appropriate. In general, statistical software plays a distinct role on the analyst’s desktop and provided that data can be freely exchanged among applications, each part of an analysis can be made with the most appropriate (or convenient) software tool.
An important feature of statistical programs was the importation of data from as many sources as possible to eliminate the need for data entry when data is already available from another source. Most programs have the ability to read from spreadsheets and selected data storage formats. Within the survey we observe several specialized products, such as Stat::Fit and ExpertFit, which are more narrowly focused on distribution fitting than general statistics, but of particular use to developers of models for stochastic systems, reliability and risk.
James J. Swain (firstname.lastname@example.org) is a professor in the Department of Industrial and Systems and Engineering Management at the University of Alabama in Huntsville. He is a member of ASA, INFORMS, IIE and ASEE.
Survey Directory & Data
To view the statistical software survey products and results, along with a directory of statistical software vendors, click here.
- Box, Joan Fisher, 1978, “R. A. Fisher: The Life of a Scientist,” Wiley.