Share with your friends


Analytics Magazine

Best Statistical Practice: Preparing for the coming flood … of statistical malfeasance

May/June 2015

Identifying and understanding statistics problems.

Randy BartlettBy Randy Bartlett, CAP

“The non-statistician cannot always recognize a statistical problem when he sees one.” — W. Edwards Deming

“The key element for a successful (big) data analytics and data science future is statistical rigor and statistical thinking of humans.”
— Diego Kuonen

For business, the recent growth in fact-based decision-making has provided a path to innovative new products and an escape for companies in disrupted industries. Over the coming years, we should expect a corresponding growth in statistical malfeasance. How large, you may well ask. We cannot be sure; even measuring today’s statistical malfeasance is difficult. In addition to market forces, an important ingredient in this flood of data and analyses is a number of misunderstandings about the purview of statistics, statistical thinking and the underlying statistical assumptions.

In my experience, the best protection from statistical malfeasance is to leverage what I call the three pillars for best statistical practice: statistical qualifications, diagnostics and review” (QDR; see “A Practitioner’s Guide To Business Analytics” [1]). The better we understand statistics problems, the better we can identify the best statistical qualifications, interpret the right statistical diagnostics and apply an appropriate statistical review.

Mathematics problem – navigating a port
Figure 1: Mathematics problem – navigating a port.

We need to use statistical diagnostics to measure the accuracy and reliability of results. Diagnostics are far more important for statistics problems, which do not have unique solutions in the way that we can mathematically deduce one answer. We need statistical review to continuously improve decision-making, data analysis and data management. Again, these three pillars are best facilitated using a savvy understanding of statistics problems.

This article provides a practical and conceptual problem-based definition of statistics, rather than the usual mathematical, probabilistic and algorithmic quicksand, which typically dominate the first three introductory college statistics classes. (There are few essay questions in stat classes.) This “mechanical thinking” has led to associating the underlying statistics assumptions with the tools rather than the problems, thus enabling the tragic delusion that we can find some “un-statistics” tool that will allow us to fudge, rather than embrace, both the assumptions and the proper thinking.

Mathematics is the logic of numbers. We leverage its robust tools to deduce “unique” solutions from known inputs, as in E = MC.

Statistics extends mathematics to address uncertainty with the numbers, including the surrogate numbers from an approximate model. One telling innovation is to represent the uncertainty with an error term, E = MC + .
To better grasp statistics and its corresponding assumptions, we turn to jigsaw puzzles to illustrate four common ways in which uncertainty is introduced. First, imagine a mathematics problem … using a jigsaw puzzle (Figures 1-6). With all of the correct pieces, we have “complete information,” and we can deduce one unique solution. For this problem, the solution might be an optimal navigation route into a port.

Inferential uncertainty

Figure 2: Inferential uncertainty.

Statistics problems have some source of uncertainty, e.g., inferential uncertainty, missing data, measurement error or surrogate variables. In this sense, the information is not complete, and the solution depends on how you infer from partial information. Also, we use intervals – confidence, prediction and tolerance – to account for the uncertainty. This is the clean split between math and stat: mathematics addresses complete information, and statistics adds innovations to address the uncertainty from missing information, misinformation, disinformation and surrogate information.

By inferential uncertainty, we mean that we are inferring from one set of observations to another. For example, we might infer a solution from data articulating a past layout of the port to a future layout; or we might infer from a known port to a similar, yet unknown one.

A second source of uncertainty comes from missing data. Missing data often have an underlying pattern even if they look “random.”

Random-looking non-random missing data uncertainty Another pattern of missing data uncertainty.
Figure 3: Random-looking non-random missing data uncertainty. Figure 4: Another pattern of missing data uncertainty.

A third source of uncertainty is from measurement error. Here, the numbers are poorly measured rendering partial information. This situation is common; it is occasionally created by poor data management in the form of rounding or mis-collecting numbers.

Measurement error uncertainty
Figure 5: Measurement error uncertainty.
Surrogate variable uncertainty
Figure 6: Surrogate variable uncertainty.

A fourth source of uncertainty is from surrogate variables, which are selected to be the best alternatives for unavailable variables. This type of uncertainty is ever present. The universe is essentially stochastic. Deterministic models, which are either found through deduction or decreed by definition, are uncommon exceptions. Even if a “true” model exists, we are still exposed to uncertainty from statistics problems like surrogate variables.

These four problems require statistical assumptions and statistical thinking. We define applied statistics as all the tools that we need to address uncertainty with the numbers. The choice of tool cannot nullify the need to understand statistics.

Application Example

Now, let us transition from the visuals of jigsaw puzzles to a set of business applications. Suppose we have last month’s sales for 100 customers (complete information), and we want to find the maximum number of widgets purchased by a customer last month. This is mathematics (with a modest algorithm). We go through all the numbers and deduce a “unique” answer.

The following are solved by statistics (mathematics with statistics assumptions and thinking on top):

  1. Instead of finding the maximum for last month, we must infer to next month – inferential uncertainty.
  2. The records for 15 customers are missing – missing data.
  3. Twenty percent of sales numbers are rounded (or are attributed to the wrong customer) – measurement error.
  4. We know how many widgets were shipped rather than sold – surrogate variables.
  5. All four problems at once – usual data analysis.

These four statistics problems illustrate that uncertainty with the numbers is part of the problem. The proper approach is to infer based upon statistical assumptions and statistical thinking about the uncertainty. Also, the solution is best estimated using an interval, and we measure the accuracy and reliability of the solution using statistical diagnostics. To discern the professional applications of statistics from the amateur and to amplify the excitement that comes with statistics and the false novelty around it, let’s call this “Deep Stat.”


There is a coming flood of statistical malfeasance. Those who want to avoid debacles like those at AIG, Moody’s, Fitch Ratings, Standard & Poor’s, Arbitron, Fannie Mae, et al. should leverage QDR. Furthermore, we recommend a problem-based definition of statistics to avoid the common pitfalls in applying best statistical practice. This definition is helpful in identifying and understanding statistics problems.

The value proposition of statistics is to provide a way to think about and approach problems with uncertain numbers. An understanding of statistics is necessary to properly lead and organize data analysis resources, and any topic involving data analysis involves statistical thinking and statistical assumptions.

We sure could use Deming right now.

Randy Bartlett, Ph.D. (statistics), CAP, PSTAT, is the author of “A Practitioner’s Guide To Business Analytics” and a statistical data scientist with Blue Sigma Analytics. He has more than 20 years of experience providing and performing advanced business analytics. Bartlett can be reached at He also hangs out at the LinkedIn group: About Data Analysis. He is a member of INFORMS.


  1. Bartlett, Randy, 2013, “A Practitioner’s Guide To Business Analytics: Using Data Analysis Tools to Improve Your Organization’s Decision Making and Strategy,” McGraw-Hill, ISBN 978-0071807593.

business analytics news and articles



Fighting terrorists online: Identifying extremists before they post content

New research has found a way to identify extremists, such as those associated with the terrorist group ISIS, by monitoring their social media accounts, and can identify them even before they post threatening content. The research, “Finding Extremists in Online Social Networks,” which was recently published in the INFORMS journal Operations Research, was conducted by Tauhid Zaman of the MIT, Lt. Col. Christopher E. Marks of the U.S. Army and Jytte Klausen of Brandeis University. Read more →

Syrian conflict yields model for attrition dynamics in multilateral war

Based on their study of the Syrian Civil War that’s been raging since 2011, three researchers created a predictive model for multilateral war called the Lanchester multiduel. Unless there is a player so strong it can guarantee a win regardless of what others do, the likely outcome of multilateral war is a gradual stalemate that culminates in the mutual annihilation of all players, according to the model. Read more →

SAS, Samford University team up to generate sports analytics talent

Sports teams try to squeeze out every last bit of talent to gain a competitive advantage on the field. That’s also true in college athletic departments and professional team offices, where entire departments devoted to analyzing data hunt for sports analytics experts that can give them an edge in a game, in the stands and beyond. To create this talent, analytics company SAS will collaborate with the Samford University Center for Sports Analytics to support teaching, learning and research in all areas where analytics affects sports, including fan engagement, sponsorship, player tracking, sports medicine, sports media and operations. Read more →



INFORMS Annual Meeting
Nov. 4-7, 2018, Phoenix

Winter Simulation Conference
Dec. 9-12, 2018, Gothenburg, Sweden


Making Data Science Pay
Oct. 29 -30, 12 p.m.-5 p.m.

Applied AI & Machine Learning | Comprehensive
Starts Oct. 29, 2018 (live online)

The Analytics Clinic
Citizen Data Scientists | Why Not DIY AI?
Nov. 8, 2018, 11 a.m. – 12:30 p.m.

Advancing the Analytics-Driven Organization
Jan. 28–31, 2019, 1 p.m.– 5 p.m. (live online)


CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:

For more information, go to