Distribution Processing and the arithmetic of uncertainty

November/December 2012

By Sam L. Savage

Distribution Processing performs calculations with uncertainties just as if they were ordinary numbers. It can be done in Microsoft Excel without using macros, and it leverages the current generation of simulation tools. ProbabilityMananagement.org is a non-profit organization actively promoting this new arithmetic of uncertainty through educational tools, standards and best practices. This article introduces distribution processing in Excel and describes the ecosystem into which it fits. But first let’s take a detour through medieval Italy to review the adoption of the arithmetic we take for granted today.

What did Fibonacci Bring to Italy?

The famous mathematician, Fibonacci, returned to Italy in 1199 A.D. with absolutely nothing, nada, zilch, zero. After studying with leading Arab mathematicians of the time he also brought the rest of the Hindu-Arabic numbers, 1 through 9. But zero was the important one, as it had no representation in Roman numerals. This made it hard to do arithmetic at all, let alone get your financial statements out on time.

When Fibonacci attempted to promote his new system, the typical response was: “Can’t you see we’re busy? We need to multiply these two Roman numerals together by next March!” So after beating his head against the wall for a couple of years, he published his book, “Liber Abaci,” in 1202, which showed the business community how Arabic arithmetic vastly simplified calculations involving interest, currency conversion, etc. Apparently he didn’t even invent the famous Fibonacci series [1], but merely presented it in his book as a challenge, as in “try doing that with your lousy Roman numerals!”

Ultimately 0 through 9 caught on and the rest is history, as chronicled in a lovely book on zero by Charles Seife [2].

 Figure 1: Woodblock of a “medieval quiz show.”

The book contains a woodblock of a medieval quiz show in which a contestant with Arabic numbers is up against another with an abacus (Figure 1) [3]. Roman numerals were undoubtedly eliminated in the previous round. A Medieval show host, Vanna Bianca, is presenting the questions. Her fond gaze implies something might be going on after the show with Mr. Arithmetic, who has a look of smug concentration on his face. From the frown on the fellow with the abacus, he has apparently just soiled his medieval undergarments. Beyond these artistic touches, the graphic indicates that calculation was not just an academic exercise in 16th century Europe, and that there were still competing methodologies.

The Goals of Probability Management

With this in mind, the broad goal of Probability Management is to do for uncertainty what Fibonacci did for numbers. That is, to promote standards for making rational, auditable calculations based on probability distributions. There are three prevalent ways of calculating with uncertainties today. The first and most widespread is to ignore distributions altogether and replace them with single “average” numbers. This leads to a host of systematic errors, which I refer to collectively as the “Flaw of Averages” [4]. The second approach is to apply classical statistical techniques. This has provided groundbreaking insights in fields as diverse as supply chains and finance, but for the everyday manager it falls into the Roman numeral category. The third is computer simulation, which is the most general. However, there are three barriers to this approach. First, it requires statistically trained experts to generate the distributions of uncertain inputs; second, it requires specialized software to perform the calculations; and third, the results are difficult for management to interpret.

Distribution Processing: Monte Carlo Simulation and Much Less

The first problem is solved by allowing statistical experts to store simulation trials as data in Stochastic Information Packets (SIPs). Because a SIP can contain thousands of potential outcomes instead of a single number it cures the Flaw of Averages. Furthermore, native Excel can do SIP math, replacing numbers with uncertainties, thereby bringing down the second and third barriers. I call this Distribution Processing, and it may be viewed as Monte Carlo simulation without the Monte Carlo. Instead of generating random numbers, the user simply performs SIP math with pre-generated, auditable inputs. If you think of traditional simulation as generating electricity instead of random numbers, then distribution processing is a light bulb for illuminating uncertainty.

 Figure 2: Portion of worksheet created in Excel 2010. Figure 3: Worksheet instantly depicts formula as histogram. Figure 4: Changing the formula instantly produces new histogram.

SIP Math with Data Tables in Microsoft Excel

In the late 1980s, Economist William Sharpe of CAPM (capital asset pricing model) fame used data tables in Lotus 1-2-3 to perform simple Monte Carlo simulations using random numbers generated within the spreadsheet. I have used data tables to teach interactive simulation in spreadsheets since the early 1990s [5]. Back then this would often crash Excel, but it was still an inspirational teaching tool.

I recently re-examined this approach, but using SIPs instead of = RAND(). I was amazed at how fast and robust data tables had become on the latest machines and versions of Excel. As a bonus, Excel 2010 introduced sparklines, tiny charts inspired by Edward Tufte [6] that can represent the shape of a distribution in a single cell. The accompanying sidebar story describes how to do this from scratch in the privacy of your own cubicle. Below I’ll display some results and discuss some implications of this approach. You may follow along by downloading the examples from www.ProbabilityManagement.org.

Sparkland

We’ll start with a file I call Sparkland, because it represents the results of simulations as sparklines. Figure 2 depicts a portion of a worksheet that took only a few minutes to create in Excel 2010 and uses no macros.

The sparklines in cells B3 and C3 are histograms of SIPS of 10,000 trials of independent, uniform, random variables.

Cell B5 is shown in edit mode while the formula =B3+C3 is being entered. The instant the formula is completed, the worksheet appears as shown in Figure 3. The sum of two uniform distributions is triangular; for example there are more ways to get a seven than a two or 12 on a pair of dice.

When the formula in B5 is replaced with =Cos(B3+C3), then you instantly get the histogram in Figure 4, not the type of calculation you can do in your head. Click the undo key and instantly you’re back to Figure 3.

The resulting output SIPs may be interactively analyzed with Excel’s statistical functions, such as AVERAGE, STDEV or PERCENTILE.

Hey Fibonacci, try doing that with your lousy Arabic numerals!

OK, I hear some of you saying, “When would I need to calculate the cosine of a sum of uniforms?” and you have a point. So let’s consider a more practical example, the model that launched a thousand mutual funds: the 1952 mean variance analysis of Harry Markowitz [7, 8].

A Portfolio Model

Figure 5 shows a distribution-processing dashboard based on SIPs of returns of seven financial assets. As you change the weights in your portfolio, a simulation of 1,000 trials is run instantaneously, with the risk, return and distribution of results displayed graphically. A partial view of the stochastic library driving this model appears in Figure 6.

Figure 5: Distribution processing dashboard based on a stochastic library.

Figure 6. Partial view of the stochastic library driving the model.

The primary data element is the SIP, containing pre-generated trials for each asset. Of course, in general there is statistical dependence between the various inputs that must be maintained. So what is required for this model is a Stochastic Library Unit with Relationships Preserved, or a SLURP [9] of all seven assets.

ProbabilityManagement.org promotes formatting standards for such libraries to maximize the potential for sharing and collaboration among statistical experts and end users across diverse fields.

Distributions as Data

 Figure 7: Toggling from Lognormal to Fat Tailed.

The library shown above is based on lognormal distributions of asset returns. The way modern finance is taught, this assumption was apparently welded in at the factory. This is analogous to a formula in Excel having the number “7” embedded in it. Data and formulas should not be mixed! The beauty of distributions as data is that different distributional assumptions may be swapped in and out. If you toggle the Library control from Lognormal to Fat Tailed, 7,000 trials are instantly loaded into the model, and all the results update instantly.

The idea of probability distributions as data is not new. In 1955 the Rand Corporation published a book containing one million random digits [10] (a hit with insomniacs). And data-driven simulation is at the heart of statistical bootstrapping as pioneered by Brad Efron [11]. The Fed runs simulations that generate sample paths of future economic conditions [12]. Numerous large firms and risk management systems have their own internal databases of scenarios for stochastic modeling in many contexts.

A confluence of technologies is now making it not only practical, but indeed necessary to better coordinate this sort of data. Nobel Laureate and board member of ProbabilityManagement.org, Harry Markowitz, puts it this way: “Decision-makers of the future will have access to an ocean of data – some generated by their own organizations, some downloaded from networks – including, but not limited to, historical series collected by governments and private firms, series generated by simulation runs, statistical analyses based on one or more of the forgoing, etc. The standardization of the representation of such information – with associated provenance! – is essential to not clumsily drown in this ocean of data.”

Networked Analytics

Network effects provide economic benefit. A single telephone is worthless, but a network of phones brings increasing value as it grows in size. Add cameras, maps and angry birds, and you get a whole new economy.

In principle any analytical software that manipulates simulation scenarios can be networked through stochastic libraries. Some packages  notably Analytica from Lumina Systems, Crystal Ball from Oracle Corporation, Risk Solver from Frontline Systems and XLSim from Vector Economics – are already compliant by design with the principles of Probability Management. Many others, such as @RISK, JMP,

Tableau and Matlab are a mere import or export away.

Networked analytics will play an important role in such enterprise-wide activities as the stress testing of financial institutions, management of portfolios of R&D projects and in communicating forecasts as distributions rather than single numbers. At ProbabilityMananagement.org we are striving to establish standards to help this network flourish.

Sam L. Savage (savage@stanford.edu) is executive director of ProbabilityManagement.org, author of “The Flaw of Averages” and consulting professor at Stanford University. In 2012, ProbabilityManagement.org was incorporated as a non-profit with board members Harry Markowitz (Nobel Laureate in Economics) and Michael Salama (lead tax council for Walt Disney) with the goal of promoting more consistent and auditable ways of communicating and managing uncertainty. Savage has a long history of bringing analytics to spreadsheets, spanning his design of What’sBest!, which coupled Lotus 1-2-3 to linear programming in 1985, to distribution processing in Excel in 2012. Savage holds a Ph.D. in the area of computational complexity from Yale University. He is a senior member of INFORMS.

SIP math in Excel

You can do everything described below in Excel 2007 except the sparklines used to create the Sparkland worksheet.

The proper way to learn this is to download the tutorial file shown in Figure 8 from www.ProbabilityManagement.org. But if you insist on being told, here goes.

The components are as follows:

1) The stochastic library. This is comprised of one or more columns (or rows) of pre-computed Monte Carlo trials (SIPs). The SIPs in Figure 8 were created with two columns of RAND() formulas, which were then copied and pasted with special values.

2) Index formulas that point into the two SIPs.

3) This cell drives the Index formulas, and is also the Column Input cell of the Data Table.

4) The formula specifying the SIP math. In this case the sum of the SIPs referenced by Index formula cells (C14+D14).

5) The Data Table. The use of this powerful but arcane feature of the spreadsheet is a folk art, passed down in the aural tradition. If you are not familiar with it already, I suggest the tutorial mentioned above or a knowledgeable colleague for guidance.

6) The bins and frequency formulas for the histogram. Frequency is an array formula, another folk art. However, Excel help is pretty useful on this one. Just don’t forget to press <Ctrl><Shift><Enter> while holding down the <N> key with your nose as you enter the formula. OK, I was just kidding about the <N> key, but you need all the others.

So how about Sparkland? Sparklines in Excel can reside in the same cells as formulas. So if you take the above example, and put sparklines of the histograms of the two uniforms in the cells containing the Index formulas (2), and put a sparkline of the output cell’s histogram bins in the formula cell (4), then you get the behavior shown in Figures 3 and 4.

And note, you can replace the Index formulas (2 in Figure 8) with RAND() formulas and Excel no longer blows up. Of course this does not yield the reproducible results obtainable with SIP math.

Figure 8: The tutorial file from www.ProbabilityManagement.org.

REFERENCES

1. The Fibonacci sequence is prescribed by the linear equation {Fn}n=1…?, with Fn = Fn-1 + Fn-2 where F1 = F2 = 1.
2. Charles Seife, “Zero: The Biography of a Dangerous Idea,” Penguin Group, 2000.
3. Gregor Reisch, “Margarita Philosophica,” Strassbourg, 1504.
4. Sam L. Savage, “The Flaw of Averages, Why we Underestimate Risk in the Face of Uncertainty,” John Wiley, 2009, 2012,
5. Sam L. Savage, “Decision Making with Insight,” Cengage Learning, 2003, p. 44.
6. Edward Tufte, “Beautiful Evidence,” Graphics Press, 2004.
7. Markowitz, H.M., “Portfolio Selection,” Journal of Finance, Vol. 7, No. 1, pp. 77–91, March 1952.
8. Markowitz, H. M., “Portfolio Selection: Efficient Diversification of Investments,” second edition, Blackwell Publishers, Inc., Malden, Mass., 1957, 1997.
9. Sam Savage, Stefan Scholtes and Daniel Zweidler, “Probability Management,” OR/MS Today, February 2006, Vol. 33 No. 1.
10. www.rand.org/pubs/monograph_reports/MR1418/index2.html.
11. Efron, B. and Tibshirani, R., “An Introduction to the Bootstrap,” Boca Raton, Fla.: Chapman & Hall, 1993.
12. www.phil.frb.org/research-and-data/real-time-center/PRISM/.

Web-First

Most businesses see GDPR as opportunity to improve data privacy, security

A new study from IBM reveals that nearly 60 percent of organizations surveyed are embracing the General Data Protection Regulation (GDPR) as an opportunity to improve privacy, security, data management or as catalyst for new business models, rather than simply a compliance issue or impediment. Read more →

Challenges facing supply chain execs: leadership, labor, legacy technology

While most companies recognize the value of a digitally enabled supply chain – empowered by new technologies like artificial intelligence, blockchain, big data and analytics – many chief supply chain officers (CSCOs) are not leveraging their C-suite counterparts to help reinvent the supply chain function and transform it into an engine of new growth models and customer experiences, according to new research from Accenture. Read more →

Data Science Bowl: Using AI to accelerate life-saving medical research

Imagine unleashing the power of artificial intelligence to automate a critical component of biomedical research, expediting life-saving research in the treatment of almost every disease from rare disorders to the common cold. This could soon be a reality, thanks to the fourth Data Science Bowl, a 90-day competition in which, for the first time, participants trained deep learning models to examine images of cells and identify nuclei, regardless of the experimental setup – and without human intervention. Read more →

UPCOMING ANALYTICS EVENTS

INFORMS International Conference
June 17-20, 2018, Taipei, Taiwan

INFORMS Annual Meeting
Nov. 4-7, 2018, Phoenix

OTHER EVENTS

July 16-19, noon-5 p.m.

Making Data Science Pay
July 30-31, 12:30 p.m.-5 p.m.

Predictive Analytics: Failure to Launch Webinar
Aug. 18, 11 a.m.

Applied AI & Machine Learning | Comprehensive
Sept. 10-13, 17-20 and 24-25

CAP® EXAM SCHEDULE

CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule: