Best Statistical Practice: Preparing for the coming flood … of statistical malfeasance
May/June 2015
Identifying and understanding statistics problems.
By Randy Bartlett, CAP
“The non-statistician cannot always recognize a statistical problem when he sees one.” — W. Edwards Deming
“The key element for a successful (big) data analytics and data science future is statistical rigor and statistical thinking of humans.” — Diego Kuonen
For business, the recent growth in fact-based decision-making has provided a path to innovative new products and an escape for companies in disrupted industries. Over the coming years, we should expect a corresponding growth in statistical malfeasance. How large, you may well ask. We cannot be sure; even measuring today’s statistical malfeasance is difficult. In addition to market forces, an important ingredient in this flood of data and analyses is a number of misunderstandings about the purview of statistics, statistical thinking and the underlying statistical assumptions.
In my experience, the best protection from statistical malfeasance is to leverage what I call the three pillars for best statistical practice: statistical qualifications, diagnostics and review” (QDR; see “A Practitioner’s Guide To Business Analytics” [1]). The better we understand statistics problems, the better we can identify the best statistical qualifications, interpret the right statistical diagnostics and apply an appropriate statistical review.
Figure 1: Mathematics problem – navigating a port. |
We need to use statistical diagnostics to measure the accuracy and reliability of results. Diagnostics are far more important for statistics problems, which do not have unique solutions in the way that we can mathematically deduce one answer. We need statistical review to continuously improve decision-making, data analysis and data management. Again, these three pillars are best facilitated using a savvy understanding of statistics problems.
This article provides a practical and conceptual problem-based definition of statistics, rather than the usual mathematical, probabilistic and algorithmic quicksand, which typically dominate the first three introductory college statistics classes. (There are few essay questions in stat classes.) This “mechanical thinking” has led to associating the underlying statistics assumptions with the tools rather than the problems, thus enabling the tragic delusion that we can find some “un-statistics” tool that will allow us to fudge, rather than embrace, both the assumptions and the proper thinking.
Mathematics is the logic of numbers. We leverage its robust tools to deduce “unique” solutions from known inputs, as in E = MC.
Statistics extends mathematics to address uncertainty with the numbers, including the surrogate numbers from an approximate model. One telling innovation is to represent the uncertainty with an error term, E = MC + .
To better grasp statistics and its corresponding assumptions, we turn to jigsaw puzzles to illustrate four common ways in which uncertainty is introduced. First, imagine a mathematics problem … using a jigsaw puzzle (Figures 1-6). With all of the correct pieces, we have “complete information,” and we can deduce one unique solution. For this problem, the solution might be an optimal navigation route into a port.
Figure 2: Inferential uncertainty.
Statistics problems have some source of uncertainty, e.g., inferential uncertainty, missing data, measurement error or surrogate variables. In this sense, the information is not complete, and the solution depends on how you infer from partial information. Also, we use intervals – confidence, prediction and tolerance – to account for the uncertainty. This is the clean split between math and stat: mathematics addresses complete information, and statistics adds innovations to address the uncertainty from missing information, misinformation, disinformation and surrogate information.
By inferential uncertainty, we mean that we are inferring from one set of observations to another. For example, we might infer a solution from data articulating a past layout of the port to a future layout; or we might infer from a known port to a similar, yet unknown one.
A second source of uncertainty comes from missing data. Missing data often have an underlying pattern even if they look “random.”
Figure 3: Random-looking non-random missing data uncertainty. | Figure 4: Another pattern of missing data uncertainty. |
A third source of uncertainty is from measurement error. Here, the numbers are poorly measured rendering partial information. This situation is common; it is occasionally created by poor data management in the form of rounding or mis-collecting numbers.
Figure 5: Measurement error uncertainty. | |
Figure 6: Surrogate variable uncertainty. |
A fourth source of uncertainty is from surrogate variables, which are selected to be the best alternatives for unavailable variables. This type of uncertainty is ever present. The universe is essentially stochastic. Deterministic models, which are either found through deduction or decreed by definition, are uncommon exceptions. Even if a “true” model exists, we are still exposed to uncertainty from statistics problems like surrogate variables.
These four problems require statistical assumptions and statistical thinking. We define applied statistics as all the tools that we need to address uncertainty with the numbers. The choice of tool cannot nullify the need to understand statistics.
Application Example
Now, let us transition from the visuals of jigsaw puzzles to a set of business applications. Suppose we have last month’s sales for 100 customers (complete information), and we want to find the maximum number of widgets purchased by a customer last month. This is mathematics (with a modest algorithm). We go through all the numbers and deduce a “unique” answer.
The following are solved by statistics (mathematics with statistics assumptions and thinking on top):
- Instead of finding the maximum for last month, we must infer to next month – inferential uncertainty.
- The records for 15 customers are missing – missing data.
- Twenty percent of sales numbers are rounded (or are attributed to the wrong customer) – measurement error.
- We know how many widgets were shipped rather than sold – surrogate variables.
- All four problems at once – usual data analysis.
These four statistics problems illustrate that uncertainty with the numbers is part of the problem. The proper approach is to infer based upon statistical assumptions and statistical thinking about the uncertainty. Also, the solution is best estimated using an interval, and we measure the accuracy and reliability of the solution using statistical diagnostics. To discern the professional applications of statistics from the amateur and to amplify the excitement that comes with statistics and the false novelty around it, let’s call this “Deep Stat.”
Conclusion
There is a coming flood of statistical malfeasance. Those who want to avoid debacles like those at AIG, Moody’s, Fitch Ratings, Standard & Poor’s, Arbitron, Fannie Mae, et al. should leverage QDR. Furthermore, we recommend a problem-based definition of statistics to avoid the common pitfalls in applying best statistical practice. This definition is helpful in identifying and understanding statistics problems.
The value proposition of statistics is to provide a way to think about and approach problems with uncertain numbers. An understanding of statistics is necessary to properly lead and organize data analysis resources, and any topic involving data analysis involves statistical thinking and statistical assumptions.
We sure could use Deming right now.
Randy Bartlett, Ph.D. (statistics), CAP, PSTAT, is the author of “A Practitioner’s Guide To Business Analytics” and a statistical data scientist with Blue Sigma Analytics. He has more than 20 years of experience providing and performing advanced business analytics. Bartlett can be reached at RandyBartlett@BlueSigmaAnalytics.com. He also hangs out at the LinkedIn group: About Data Analysis. He is a member of INFORMS.
REFERENCES
- Bartlett, Randy, 2013, “A Practitioner’s Guide To Business Analytics: Using Data Analysis Tools to Improve Your Organization’s Decision Making and Strategy,” McGraw-Hill, ISBN 978-0071807593.