Share with your friends


Analytics Magazine

Five-Minute Analyst: Rainfall and reference years

Harrison SchrammBy Harrison Schramm

This installment comes from a discussion I’ve been having with longtime friend and U.S. Naval Academy classmate Cara Albright. Her problem revolves around determining the “most representative” year of precipitation (rain) data from a large set. The original question – how to incorporate data from years that include “leaps” (i.e., Feb. 29) – started us down an interesting path. This is a fun story about collaboration and thinking about problems.

To make this concrete, consider a graph of two separate year’s raw rainfall data (Figure 1). From this plot, it is unclear what the best method for measuring the “distance” between these two years would be.

Figure 1: Graph of two separate year’s raw rainfall data.

Figure 1: Graph of two separate year’s raw rainfall data.

One current approach to this problem is to measure the similarity of the years “pointwise.” Now, those of us who have been alive for a few years (or have seen “The Pirates of Penzance”) know that not every year is the same; most years have 365 days, but a quarter of years have 366. The approaches to dealing with the problematic Feb. 29 are:

  1. Ignore it, thus throwing away ~.3 percent of the data.
  2. Lump it in with March 1.

Neither of these are particularly satisfactory to us. Instead of trying to measure the distance pointwise – which is highly sensitive to “breakpoints” – hourly and daily, we propose to measure the difference between cumulative precipitation, normalized to 365 days (and thus overcoming the leap year problem).

To measure the difference between years, we follow a simple process of re-normalizing the data to a 365-day “standard year.” We then sum the squared differences between the two years. For those who prefer math over words, we do this:

simple process of re-normalizing the data to a 365-day “standard year”

The year with the minimum distance, as determined by the minimum (summed) distance over all other years, is the “reference” year.


Our current data set consists of 100 years of rainfall data from Philadelphia, as shown in Figure 2. We determine the “representative year” starting in 1965 to the year chosen; in other words, the 1993 point is 1989-1993, 2000 is 1989-2000 and so on. Using this “moving right-hand reference approach,” we see the years chosen as depicted in Figure 3.

Figure 2: Cumulative rainfall over 25 years.

Figure 2: Cumulative rainfall over 25 years.

Figure 3: 1973 chosen as the most frequently representative year.

Figure 3: 1973 chosen as the most frequently representative year.

With 1973 chosen as the most frequently representative year, based on minimum distance and a normalized year length. One “might” argue, as we thought, that this approach tends to favor years that have the total rainfall that is closest to average. To overcome this minor difficulty, we may simply normalize the rainfall over the year as well, scaling the total for the year to 1. This “variance only” approach produces the graph shown in Figure 4.

Figure 4: Variance only approach.

Figure 4: Variance only approach.

Which tends to favor 1956 and, later, 1991 as representative years. A plot of these two candidates is shown in Figure 5:

Figure 5: Plot of two candidates.

Figure 5: Plot of two candidates.

In conclusion, we have applied a few more than five minutes worth of analysis this installment. What is more important than the results is that the basic ideas of calculus and statistics, which we don’t always use every day in practice, continue to echo in practice far beyond our basic schooling.

Technical note: This analysis made ample use of the R base function approxfun(), which interpolates between values of a given empirical data set. This made numerical integration quite straightforward.

Harrison Schramm (, CAP, PStat, is a principal operations research analyst at CANA Advisors, LLC, and a member of INFORMS.

Analytics data science news articles

Related Posts

  • 58
    The Internet of Things (IoT) is considered to be the next revolution that touches every part of our daily life, from restocking ice cream to warning of pollutants. Analytics professionals understand the importance of data, especially in a complicated field such as healthcare. This article offers a framework on integrating…
    Tags: data
  • 50
    It’s long been popular to talk about customer interaction data such as clickstream, social activity, inbound email and call center verbatims as “unstructured data.” Wikipedia says of the term that it “…refers to information that either does not have a pre-defined data model or is not organized in a pre-defined…
    Tags: data
  • 42
    Today, we live in a digital society. Our distinct footprints are in every interaction we make. Data generation is a default – be it from enterprise operational systems, logs from web servers, other applications, social interactions and transactions, research initiatives and connected things (Internet of Things). In fact, according to…
    Tags: data
  • 41
    Many organizations have noticed that the data they own and how they use it can make them different than others to innovate, to compete better and to stay in business. That’s why organizations try to collect and process as much data as possible, transform it into meaningful information with data-driven…
    Tags: data
  • 41
    The CUNY School of Professional Studies is offering a new online master of science degree in data analytics. The program prepares its graduates for high-demand and fast-growing careers as data analysts, data specialists, business intelligence analysts, information analysts and data engineers in such fields as business, operations, marketing, social media,…
    Tags: data

Analytics Blog

Electoral College put to the math test

With the campaign two months behind us and the inauguration of Donald Trump two days away, isn’t it time to put the 2016 U.S. presidential election to bed and focus on issues that have yet to be decided? Of course not.


Three keys for organizations to gain value from information

In the current information-driven society and increasingly digitalized world, Gartner, Inc. says that sentiments are shifting from the economics of tangible assets to the economics of information – “infonomics” – and other intangible assets. Infonomics is the theory, study and discipline of asserting economic significance to information. It strives to apply both economic and asset management principles and practices to the valuation, handling and deployment of information assets.  Read more →

Burtch Works study on ‘Salaries of Predictive Analytics Professionals’

According to the recently released Burtch Works study on “Salaries of Predictive Analytics Professionals 2017,” senior-level executives saw the largest increase in salaries from 2016 to 2017, and industry diversification of employment has diluted the concentration of such professionals from financial services and marketing/advertising to consulting and technology. Read more →

New study asks, ‘Is your business AI-ready?’

Despite fears that robots will replace human labor, the majority of artificial intelligence (AI) leaders (79 percent) expect their employees will work comfortably with robots by 2020, according to a new Genpact survey of C-Suite and senior executives titled, “Is Your Business AI-Ready?” Read more →



2017 Winter Simulation Conference (WSC 2017)
Dec. 3-6, 2017, Las Vegas


CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:

For more information, go to