Software Survey: Statistical software in the Age of the Geek
By James J. Swain
We are witnessing the emergence of a data-rich culture that has been made possible by intelligent devices, online commerce, social networks and scientific sensors – all ceaselessly creating, exchanging and collecting data. There is so much data that we have had to coin new words to describe the quantities – terabytes, petabytes, exabytes – just to keep up. What secrets can we divine from within this Everest of data, whether it leads to a responsive global supply chain or window into the souls of consumers? There is a need to make sense of it all, so perhaps it is no surprise that Google’s chief economist Hal Varian declares that statisticians surely have the “really sexy job” for the coming decade, for who better will there be to make sense of so much data? What is equally certain is that software tools will be needed to process and analyze the data.
Election Polls and Predictions
The recent presidential election provides a microcosm of the phenomena. Due to such intense interest in elections, even statistics become interesting when they tell a story about the winners and the losers. Opinion polling began early last century and is constantly being refined in technique and frequency; during this last election multiple polls were made and updated throughout the election cycle. Election predictions using a computer and early election results are now 60 years old, dating to a UNIVAC computer that predicted a landslide for Ike in 1952. The technology and analysis has grown in sophistication. Even better, media has improved its ability to animate a story of progress or retreat in the quest of congressional seats or electoral votes. The experts, buttressed by numbers, confidently assessed the “bounce” of the party convention, the “loss” of a poor debate performance or the impact of the latest economic news.
One indication that polling results were considered important to the perception and eventual results of the campaigns was illustrated by controversies over both opinion polls and key governmental statistics such as the unemployment rate. Nate Silver of The New York Times was maligned for early predicting an Obama win in the election using a model that aggregated the results from a number of polls. Polling using small samples in volatile election is always tricky, as opinions can shift in the time that it takes to obtain a complete sample, though these uncertainties should average out when combined.
Of course, we have come a long way since the Literary Digest fiasco of 1936, which predicted a landslide for Alf Landon in the presidential election based on a badly flawed sample of 2.4 million respondents. In the event, Landon only carried two states. In that same year George Gallup sampled only 50,000 respondents but was able to accurately predict the results for Franklin D. Roosevelt. Typical polls now take samples of only 800 to 1,200 likely voters, with pains taken to ensure that sampling is random and representative of the general population (e.g., likely voters) of interest. During the election anything that might influence public opinion was questioned. This included government statistics such as the decline in the unemployment rate announced by Bureau of Labor Statistics just before the election.
On election night, we were treated to a deluge of results in battleground states such as Ohio, Pennsylvania, Virginia and Florida, both visually and numerically. Each state had its own choropleth map, with results summarized by county or precinct, with results dissected according to various demographic groupings. The great unknown through the evening was the number of voters in various categories and their turn out, since the relative rate at which these groups supported individual candidates was better estimated.
While the presentation of data was impressive during the election, perhaps the most interesting story was the highly effective use that the Obama campaign made of analytics to fine-tune donation pitches, simplify the process of contacting and supporting the campaign, make media buys and coordinate field organizers. As documented in the 90-page report, “Inside the Cave,” the technology team for Obama was significantly larger than for the Romney campaign, had more than four times the number of donors and five times the number of e-mails addresses. His staff was drawn from technologists, volunteers who were recruited directly from technology hotbeds such as the Silicon Valley (including Rayid Ghani, former director of analytics research at Accenture). The Obama campaign raised more than $690 million online, increasing both the number of individuals and the average donation compared to 2008. Throughout the campaign the Obama campaign experimented with a variety of messages, monitored the results and fine-tuned their pitches and even the subject line of e-mails to make them as effective as possible. The campaign hired exclusively for technology skills and gave them free reign to learn from the data with “little to no interference from campaign management on content.” The team also performed extensive surveys and tracking polls in key states, increasing in frequency as the election neared. The team used the data collected to build models and performed simulations to gauge progress and to allocate resources dynamically.
During the presidential campaign, the news media thrived on colorful and dynamic visual representations of the data. The tools in this statistical software survey provide a variety of ways to present data to make comparisons, demonstrate trends or search for outliers within the data. Further tools and some guidelines for good graphical design are provided in “Visual This” (Yau, 2011). The author has a website FlowingData (http://flowingdata.com/) that provides further examples. The website GapMinder (http://www.gapminder.org/) provides examples of animated graphics that offer methods of exploring multiple variables over time among different regions of the world. The stated aim of the website is to provide information about the world (e.g. life expectancy, income, birthrates) to facilitate discussion. The use of animation provides an additional way to perceive relations among and between multivariable factors as a function of another variable, such as time.
The computer and Internet have revolutionized the acquisition, storage and access to data in addition to providing the computation power to numerically process the data. Public and private sources exist for a wide range of data, including demographics, production, investment, monetary, infection, disease, mortality and all manner of consumer and operational data. This has given rise to new techniques in multivariate dimension reduction, data mining and machine learning.
The computer has made it possible to expand the range of what is meant by data, from purely numerical data to text and graphics. Documents and e-mails can be compared using word frequencies and even stochastic models, and user choices online can be logged and analyzed for patterns. Prediction of user choices is useful both to accelerate searching, as well as to match advertising to users. To obtain an idea of the magnitude of data available, consider the data provided in the recent Netflix prize competition. The objective was to predict user movie ratings based on a training data set of more than 100 million ratings from over 480,000 users. The qualifying data used to evaluate the proposed algorithms consisted of more than 2.8 million user and movie combinations whose ratings had to be predicted. Of course, even these huge samples are but a fraction of the data collected on the Web.
Statistical Software Survey
The 2013 survey of statistical software provides capsule information about various products and vendors. The tools range from general tools that cover the important techniques of inference and estimation, as well as specialized activities such as nonlinear regression, forecasting and design of experiments. The product information contained in the survey was obtained from product vendors and is summarized in the following tables to highlight general features, capabilities, computing requirements and to provide contact information. Many of the vendors have their own websites for further, detailed information, and many provide demonstration programs that can be downloaded from these sites. No attempt is made to evaluate or rank the products, and the information provided comes from the vendors themselves. To view the survey data, click here. Vendors that were unable to make the publishing deadline will be added.
Products that provide statistical add-ins available for use with spreadsheets remain popular and provide enhanced specialized capabilities for spreadsheets. The spreadsheet is the primary computational tool in a wide variety of settings, familiar and accessible to all. Many procedures of data summarization, estimation, inference, basic graphics and even regression modeling can be added to spreadsheets in this way. An example is the Unistat add-in for Excel. The functionality of products for use with spreadsheets continues to grow, including risk analysis and Monte Carlo sampling, such as Oracle Crystal Ball.
Dedicated general and special purpose statistical software generally have a wider variety and depth of analysis than available in the add-in software. For many specialized techniques such as forecasting, design of experiments and so forth, a statistical package is appropriate. In general, statistical software plays a distinct role on the analyst’s desktop, and provided that data can be freely exchanged among applications, each part of an analysis can be made with the most appropriate (or convenient) software tool.
An important feature of statistical programs is the importation of data from as many sources as possible to eliminate the need for data entry when data is already available from another source. Most programs have the ability to read from spreadsheets and selected data storage formats. Also highly visible in this survey is the growth of data warehousing and “data mining” capabilities, programs and training. Data mining tools attempt to integrate and analyze data from a variety of sources (and purposes) to look for relations that would not be possible from the individual data sets. Within the survey we observe several specialized products, such as STAT::FIT, which is more narrowly focused on distribution fitting than general statistics, but of particular use to developers of stochastic models and simulation.
James J. Swain (email@example.com) is professor and chair, Department of Industrial and Systems and Engineering Management, at the University of Alabama in Huntsville. He is a senior member of INFORMS. He is also a member of ASA, IIE, and ASEE.
- “Inside the Cave: An In-depth look at the digital, technology, and analytics operations of Obama for America,” Engage Research, www.engagedc.com/inside-the-cave/.
- Nathan Yau, 2011, “Visualize This: The FlowingData Guide to Design, Visualization and Statistics,” Wiley, New York, 2011.
Survey Directory & Data
To view the directory of statistical software products along with the survey data, click here.