Share with your friends


Analytics Magazine

Five-Minute Analyst: Rainfall and reference years

Harrison SchrammBy Harrison Schramm

This installment comes from a discussion I’ve been having with longtime friend and U.S. Naval Academy classmate Cara Albright. Her problem revolves around determining the “most representative” year of precipitation (rain) data from a large set. The original question – how to incorporate data from years that include “leaps” (i.e., Feb. 29) – started us down an interesting path. This is a fun story about collaboration and thinking about problems.

To make this concrete, consider a graph of two separate year’s raw rainfall data (Figure 1). From this plot, it is unclear what the best method for measuring the “distance” between these two years would be.

Figure 1: Graph of two separate year’s raw rainfall data.

Figure 1: Graph of two separate year’s raw rainfall data.

One current approach to this problem is to measure the similarity of the years “pointwise.” Now, those of us who have been alive for a few years (or have seen “The Pirates of Penzance”) know that not every year is the same; most years have 365 days, but a quarter of years have 366. The approaches to dealing with the problematic Feb. 29 are:

  1. Ignore it, thus throwing away ~.3 percent of the data.
  2. Lump it in with March 1.

Neither of these are particularly satisfactory to us. Instead of trying to measure the distance pointwise – which is highly sensitive to “breakpoints” – hourly and daily, we propose to measure the difference between cumulative precipitation, normalized to 365 days (and thus overcoming the leap year problem).

To measure the difference between years, we follow a simple process of re-normalizing the data to a 365-day “standard year.” We then sum the squared differences between the two years. For those who prefer math over words, we do this:

simple process of re-normalizing the data to a 365-day “standard year”

The year with the minimum distance, as determined by the minimum (summed) distance over all other years, is the “reference” year.


Our current data set consists of 100 years of rainfall data from Philadelphia, as shown in Figure 2. We determine the “representative year” starting in 1965 to the year chosen; in other words, the 1993 point is 1989-1993, 2000 is 1989-2000 and so on. Using this “moving right-hand reference approach,” we see the years chosen as depicted in Figure 3.

Figure 2: Cumulative rainfall over 25 years.

Figure 2: Cumulative rainfall over 25 years.

Figure 3: 1973 chosen as the most frequently representative year.

Figure 3: 1973 chosen as the most frequently representative year.

With 1973 chosen as the most frequently representative year, based on minimum distance and a normalized year length. One “might” argue, as we thought, that this approach tends to favor years that have the total rainfall that is closest to average. To overcome this minor difficulty, we may simply normalize the rainfall over the year as well, scaling the total for the year to 1. This “variance only” approach produces the graph shown in Figure 4.

Figure 4: Variance only approach.

Figure 4: Variance only approach.

Which tends to favor 1956 and, later, 1991 as representative years. A plot of these two candidates is shown in Figure 5:

Figure 5: Plot of two candidates.

Figure 5: Plot of two candidates.

In conclusion, we have applied a few more than five minutes worth of analysis this installment. What is more important than the results is that the basic ideas of calculus and statistics, which we don’t always use every day in practice, continue to echo in practice far beyond our basic schooling.

Technical note: This analysis made ample use of the R base function approxfun(), which interpolates between values of a given empirical data set. This made numerical integration quite straightforward.

Harrison Schramm (, CAP, PStat, is a principal operations research analyst at CANA Advisors, LLC, and a member of INFORMS.

Analytics data science news articles

Related Posts

  • 64
    With the rise of big data – and the processes and tools related to utilizing and managing large data sets – organizations are recognizing the value of data as a critical business asset to identify trends, patterns and preferences to drive improved customer experiences and competitive advantage. The problem is,…
    Tags: data
  • 58
    The Internet of Things (IoT) is considered to be the next revolution that touches every part of our daily life, from restocking ice cream to warning of pollutants. Analytics professionals understand the importance of data, especially in a complicated field such as healthcare. This article offers a framework on integrating…
    Tags: data
  • 50
    It’s long been popular to talk about customer interaction data such as clickstream, social activity, inbound email and call center verbatims as “unstructured data.” Wikipedia says of the term that it “…refers to information that either does not have a pre-defined data model or is not organized in a pre-defined…
    Tags: data
  • 42
    Today, we live in a digital society. Our distinct footprints are in every interaction we make. Data generation is a default – be it from enterprise operational systems, logs from web servers, other applications, social interactions and transactions, research initiatives and connected things (Internet of Things). In fact, according to…
    Tags: data
  • 41
    Many organizations have noticed that the data they own and how they use it can make them different than others to innovate, to compete better and to stay in business. That’s why organizations try to collect and process as much data as possible, transform it into meaningful information with data-driven…
    Tags: data

Analytics Blog

Electoral College put to the math test

With the campaign two months behind us and the inauguration of Donald Trump two days away, isn’t it time to put the 2016 U.S. presidential election to bed and focus on issues that have yet to be decided? Of course not.


Artificial intelligence a game changer for personal devices

Emotion artificial intelligence (AI) systems are becoming so sophisticated that Gartner, Inc. predicts that by 2022, personal devices will know more about an individual’s emotional state than his or her own family. AI is generating multiple disruptive forces that are reshaping the way we interact with personal technologies. Read more →

FICO predicts AI and blockchain will meet in 2018

Scott Zoldi, chief analytics officer at FICO, in his AI predictions for 2018, predicts the rise of “defensive AI” and manipulative chatbots. Blockchain will use AI to search through relationship data, Zoldi says, and defensive AI will be used to protect systems from malicious AI and machine learning. Adds Zoldi: Chatbots will get so good at understanding us they will learn how to manipulate us. Read more →



2018 INFORMS Conference on Business Analytics and Operations Research
April 15-17, 2018, Baltimore


CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:

For more information, go to