Share with your friends


Analytics Magazine

Five-Minute Analyst: Rainfall and reference years

Harrison SchrammBy Harrison Schramm

This installment comes from a discussion I’ve been having with longtime friend and U.S. Naval Academy classmate Cara Albright. Her problem revolves around determining the “most representative” year of precipitation (rain) data from a large set. The original question – how to incorporate data from years that include “leaps” (i.e., Feb. 29) – started us down an interesting path. This is a fun story about collaboration and thinking about problems.

To make this concrete, consider a graph of two separate year’s raw rainfall data (Figure 1). From this plot, it is unclear what the best method for measuring the “distance” between these two years would be.

Figure 1: Graph of two separate year’s raw rainfall data.

Figure 1: Graph of two separate year’s raw rainfall data.

One current approach to this problem is to measure the similarity of the years “pointwise.” Now, those of us who have been alive for a few years (or have seen “The Pirates of Penzance”) know that not every year is the same; most years have 365 days, but a quarter of years have 366. The approaches to dealing with the problematic Feb. 29 are:

  1. Ignore it, thus throwing away ~.3 percent of the data.
  2. Lump it in with March 1.

Neither of these are particularly satisfactory to us. Instead of trying to measure the distance pointwise – which is highly sensitive to “breakpoints” – hourly and daily, we propose to measure the difference between cumulative precipitation, normalized to 365 days (and thus overcoming the leap year problem).

To measure the difference between years, we follow a simple process of re-normalizing the data to a 365-day “standard year.” We then sum the squared differences between the two years. For those who prefer math over words, we do this:

simple process of re-normalizing the data to a 365-day “standard year”

The year with the minimum distance, as determined by the minimum (summed) distance over all other years, is the “reference” year.


Our current data set consists of 100 years of rainfall data from Philadelphia, as shown in Figure 2. We determine the “representative year” starting in 1965 to the year chosen; in other words, the 1993 point is 1989-1993, 2000 is 1989-2000 and so on. Using this “moving right-hand reference approach,” we see the years chosen as depicted in Figure 3.

Figure 2: Cumulative rainfall over 25 years.

Figure 2: Cumulative rainfall over 25 years.

Figure 3: 1973 chosen as the most frequently representative year.

Figure 3: 1973 chosen as the most frequently representative year.

With 1973 chosen as the most frequently representative year, based on minimum distance and a normalized year length. One “might” argue, as we thought, that this approach tends to favor years that have the total rainfall that is closest to average. To overcome this minor difficulty, we may simply normalize the rainfall over the year as well, scaling the total for the year to 1. This “variance only” approach produces the graph shown in Figure 4.

Figure 4: Variance only approach.

Figure 4: Variance only approach.

Which tends to favor 1956 and, later, 1991 as representative years. A plot of these two candidates is shown in Figure 5:

Figure 5: Plot of two candidates.

Figure 5: Plot of two candidates.

In conclusion, we have applied a few more than five minutes worth of analysis this installment. What is more important than the results is that the basic ideas of calculus and statistics, which we don’t always use every day in practice, continue to echo in practice far beyond our basic schooling.

Technical note: This analysis made ample use of the R base function approxfun(), which interpolates between values of a given empirical data set. This made numerical integration quite straightforward.

Harrison Schramm (, CAP, PStat, is a principal operations research analyst at CANA Advisors, LLC, and a member of INFORMS.

Analytics data science news articles

Related Posts

  • 64
    With the rise of big data – and the processes and tools related to utilizing and managing large data sets – organizations are recognizing the value of data as a critical business asset to identify trends, patterns and preferences to drive improved customer experiences and competitive advantage. The problem is,…
    Tags: data
  • 58
    The Internet of Things (IoT) is considered to be the next revolution that touches every part of our daily life, from restocking ice cream to warning of pollutants. Analytics professionals understand the importance of data, especially in a complicated field such as healthcare. This article offers a framework on integrating…
    Tags: data
  • 50
    Businesses are greatly expanding the autonomous capabilities of their products, services and manufacturing processes to better optimize their reliability and efficiency. The processing of big data is playing an integral role in developing these prescriptive analytics. As a result, data scientists and engineers should pay attention to the following aspects…
    Tags: data
  • 50
    It’s long been popular to talk about customer interaction data such as clickstream, social activity, inbound email and call center verbatims as “unstructured data.” Wikipedia says of the term that it “…refers to information that either does not have a pre-defined data model or is not organized in a pre-defined…
    Tags: data
  • 45
    Frontline Systems releases Analytic Solver V2018 for Excel Frontline Systems, developer of the Solver in Microsoft Excel, recently released Analytic Solver V2018, its full product line of predictive and prescriptive analytics tools that work in Microsoft Excel. The new release includes a visual editor for multi-stage “data science workflows” (also…
    Tags: data


Finalists for 2018 Syngenta Crop Challenge announced

Syngenta and the Analytics Society of INFORMS this week announced the finalists for the 2018 Syngenta Crop Challenge in Analytics. Now in its third year, the competition aims to address the challenge of achieving global food security by fostering cross-industry collaboration between agriculture and advanced analytics experts. Read more →

International Conference to spotlight O.R., analytics and AI

The 2018 INFORMS International Conference will be held in the world-class International Convention Center (TICC) and the Grand Hyatt Taipei in beautiful Taipei City, Taiwan, on June 17-20. The conference theme is “A Better World Through Operations Research, Analytics and Artificial Intelligence.” The conference offers excellent opportunities to learn about the emerging O.R. and A.I. technologies and applications from more than 40 invited, sponsored and contributed clusters featuring more than 600 talks. Read more →

How math education can catch up to the 21st century

Big data analytics is increasingly becoming a trending practice that society is adopting, and Daniel Kunin, a scholar at Stanford University and the creator of the online platform, Seeing Theory, is using creative and innovative ways to teach statistics and probability relevant to a changing world. In a recent interview with CMRubinWorld, Kunin discusses math education today and how he believes it can be improved in order to both foster curiosity and be more relevant now and going forward. Read more →



2018 INFORMS Conference on Business Analytics and Operations Research
April 15-17, 2018, Baltimore


CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:

For more information, go to