Share with your friends


Analytics Magazine

Best Statistical Practice: Preparing for the coming flood … of statistical malfeasance

May/June 2015

Identifying and understanding statistics problems.

Randy BartlettBy Randy Bartlett, CAP

“The non-statistician cannot always recognize a statistical problem when he sees one.” — W. Edwards Deming

“The key element for a successful (big) data analytics and data science future is statistical rigor and statistical thinking of humans.”
— Diego Kuonen

For business, the recent growth in fact-based decision-making has provided a path to innovative new products and an escape for companies in disrupted industries. Over the coming years, we should expect a corresponding growth in statistical malfeasance. How large, you may well ask. We cannot be sure; even measuring today’s statistical malfeasance is difficult. In addition to market forces, an important ingredient in this flood of data and analyses is a number of misunderstandings about the purview of statistics, statistical thinking and the underlying statistical assumptions.

In my experience, the best protection from statistical malfeasance is to leverage what I call the three pillars for best statistical practice: statistical qualifications, diagnostics and review” (QDR; see “A Practitioner’s Guide To Business Analytics” [1]). The better we understand statistics problems, the better we can identify the best statistical qualifications, interpret the right statistical diagnostics and apply an appropriate statistical review.

Mathematics problem – navigating a port
Figure 1: Mathematics problem – navigating a port.

We need to use statistical diagnostics to measure the accuracy and reliability of results. Diagnostics are far more important for statistics problems, which do not have unique solutions in the way that we can mathematically deduce one answer. We need statistical review to continuously improve decision-making, data analysis and data management. Again, these three pillars are best facilitated using a savvy understanding of statistics problems.

This article provides a practical and conceptual problem-based definition of statistics, rather than the usual mathematical, probabilistic and algorithmic quicksand, which typically dominate the first three introductory college statistics classes. (There are few essay questions in stat classes.) This “mechanical thinking” has led to associating the underlying statistics assumptions with the tools rather than the problems, thus enabling the tragic delusion that we can find some “un-statistics” tool that will allow us to fudge, rather than embrace, both the assumptions and the proper thinking.

Mathematics is the logic of numbers. We leverage its robust tools to deduce “unique” solutions from known inputs, as in E = MC.

Statistics extends mathematics to address uncertainty with the numbers, including the surrogate numbers from an approximate model. One telling innovation is to represent the uncertainty with an error term, E = MC + .
To better grasp statistics and its corresponding assumptions, we turn to jigsaw puzzles to illustrate four common ways in which uncertainty is introduced. First, imagine a mathematics problem … using a jigsaw puzzle (Figures 1-6). With all of the correct pieces, we have “complete information,” and we can deduce one unique solution. For this problem, the solution might be an optimal navigation route into a port.

Inferential uncertainty

Figure 2: Inferential uncertainty.

Statistics problems have some source of uncertainty, e.g., inferential uncertainty, missing data, measurement error or surrogate variables. In this sense, the information is not complete, and the solution depends on how you infer from partial information. Also, we use intervals – confidence, prediction and tolerance – to account for the uncertainty. This is the clean split between math and stat: mathematics addresses complete information, and statistics adds innovations to address the uncertainty from missing information, misinformation, disinformation and surrogate information.

By inferential uncertainty, we mean that we are inferring from one set of observations to another. For example, we might infer a solution from data articulating a past layout of the port to a future layout; or we might infer from a known port to a similar, yet unknown one.

A second source of uncertainty comes from missing data. Missing data often have an underlying pattern even if they look “random.”

Random-looking non-random missing data uncertainty Another pattern of missing data uncertainty.
Figure 3: Random-looking non-random missing data uncertainty. Figure 4: Another pattern of missing data uncertainty.

A third source of uncertainty is from measurement error. Here, the numbers are poorly measured rendering partial information. This situation is common; it is occasionally created by poor data management in the form of rounding or mis-collecting numbers.

Measurement error uncertainty
Figure 5: Measurement error uncertainty.
Surrogate variable uncertainty
Figure 6: Surrogate variable uncertainty.

A fourth source of uncertainty is from surrogate variables, which are selected to be the best alternatives for unavailable variables. This type of uncertainty is ever present. The universe is essentially stochastic. Deterministic models, which are either found through deduction or decreed by definition, are uncommon exceptions. Even if a “true” model exists, we are still exposed to uncertainty from statistics problems like surrogate variables.

These four problems require statistical assumptions and statistical thinking. We define applied statistics as all the tools that we need to address uncertainty with the numbers. The choice of tool cannot nullify the need to understand statistics.

Application Example

Now, let us transition from the visuals of jigsaw puzzles to a set of business applications. Suppose we have last month’s sales for 100 customers (complete information), and we want to find the maximum number of widgets purchased by a customer last month. This is mathematics (with a modest algorithm). We go through all the numbers and deduce a “unique” answer.

The following are solved by statistics (mathematics with statistics assumptions and thinking on top):

  1. Instead of finding the maximum for last month, we must infer to next month – inferential uncertainty.
  2. The records for 15 customers are missing – missing data.
  3. Twenty percent of sales numbers are rounded (or are attributed to the wrong customer) – measurement error.
  4. We know how many widgets were shipped rather than sold – surrogate variables.
  5. All four problems at once – usual data analysis.

These four statistics problems illustrate that uncertainty with the numbers is part of the problem. The proper approach is to infer based upon statistical assumptions and statistical thinking about the uncertainty. Also, the solution is best estimated using an interval, and we measure the accuracy and reliability of the solution using statistical diagnostics. To discern the professional applications of statistics from the amateur and to amplify the excitement that comes with statistics and the false novelty around it, let’s call this “Deep Stat.”


There is a coming flood of statistical malfeasance. Those who want to avoid debacles like those at AIG, Moody’s, Fitch Ratings, Standard & Poor’s, Arbitron, Fannie Mae, et al. should leverage QDR. Furthermore, we recommend a problem-based definition of statistics to avoid the common pitfalls in applying best statistical practice. This definition is helpful in identifying and understanding statistics problems.

The value proposition of statistics is to provide a way to think about and approach problems with uncertain numbers. An understanding of statistics is necessary to properly lead and organize data analysis resources, and any topic involving data analysis involves statistical thinking and statistical assumptions.

We sure could use Deming right now.

Randy Bartlett, Ph.D. (statistics), CAP, PSTAT, is the author of “A Practitioner’s Guide To Business Analytics” and a statistical data scientist with Blue Sigma Analytics. He has more than 20 years of experience providing and performing advanced business analytics. Bartlett can be reached at He also hangs out at the LinkedIn group: About Data Analysis. He is a member of INFORMS.


  1. Bartlett, Randy, 2013, “A Practitioner’s Guide To Business Analytics: Using Data Analysis Tools to Improve Your Organization’s Decision Making and Strategy,” McGraw-Hill, ISBN 978-0071807593.

business analytics news and articles



Using machine learning and optimization to improve refugee integration

Andrew C. Trapp, a professor at the Foisie Business School at Worcester Polytechnic Institute (WPI), received a $320,000 National Science Foundation (NSF) grant to develop a computational tool to help humanitarian aid organizations significantly improve refugees’ chances of successfully resettling and integrating into a new country. Built upon ongoing work with an international team of computer scientists and economists, the tool integrates machine learning and optimization algorithms, along with complex computation of data, to match refugees to communities where they will find appropriate resources, including employment opportunities. Read more →

Gartner releases Healthcare Supply Chain Top 25 rankings

Gartner, Inc. has released its 10th annual Healthcare Supply Chain Top 25 ranking. The rankings recognize organizations across the healthcare value chain that demonstrate leadership in improving human life at sustainable costs. “Healthcare supply chains today face a multitude of challenges: increasing cost pressures and patient expectations, as well as the need to keep up with rapid technology advancement, to name just a few,” says Stephen Meyer, senior director at Gartner. Read more →

Meet CIMON, the first AI-powered astronaut assistant

CIMON, the world’s first artificial intelligence-enabled astronaut assistant, made its debut aboard the International Space Station. The ISS’s newest crew member, developed and built in Germany, was called into action on Nov. 15 with the command, “Wake up, CIMON!,” by German ESA astronaut Alexander Gerst, who has been living and working on the ISS since June 8. Read more →



INFORMS Computing Society Conference
Jan. 6-8, 2019; Knoxville, Tenn.

INFORMS Conference on Business Analytics & Operations Research
April 14-16, 2019; Austin, Texas

INFORMS International Conference
June 9-12, 2019; Cancun, Mexico

INFORMS Marketing Science Conference
June 20-22; Rome, Italy

INFORMS Applied Probability Conference
July 2-4, 2019; Brisbane, Australia

INFORMS Healthcare Conference
July 27-29, 2019; Boston, Mass.

2019 INFORMS Annual Meeting
Oct. 20-23, 2019; Seattle, Wash.

Winter Simulation Conference
Dec. 8-11, 2019: National Harbor, Md.


Advancing the Analytics-Driven Organization
Jan. 28–31, 2019, 1 p.m.– 5 p.m. (live online)


CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:

For more information, go to