Share with your friends


Analytics Magazine

Five-Minute Analyst: Rainfall and reference years

Harrison SchrammBy Harrison Schramm

This installment comes from a discussion I’ve been having with longtime friend and U.S. Naval Academy classmate Cara Albright. Her problem revolves around determining the “most representative” year of precipitation (rain) data from a large set. The original question – how to incorporate data from years that include “leaps” (i.e., Feb. 29) – started us down an interesting path. This is a fun story about collaboration and thinking about problems.

To make this concrete, consider a graph of two separate year’s raw rainfall data (Figure 1). From this plot, it is unclear what the best method for measuring the “distance” between these two years would be.

Figure 1: Graph of two separate year’s raw rainfall data.

Figure 1: Graph of two separate year’s raw rainfall data.

One current approach to this problem is to measure the similarity of the years “pointwise.” Now, those of us who have been alive for a few years (or have seen “The Pirates of Penzance”) know that not every year is the same; most years have 365 days, but a quarter of years have 366. The approaches to dealing with the problematic Feb. 29 are:

  1. Ignore it, thus throwing away ~.3 percent of the data.
  2. Lump it in with March 1.

Neither of these are particularly satisfactory to us. Instead of trying to measure the distance pointwise – which is highly sensitive to “breakpoints” – hourly and daily, we propose to measure the difference between cumulative precipitation, normalized to 365 days (and thus overcoming the leap year problem).

To measure the difference between years, we follow a simple process of re-normalizing the data to a 365-day “standard year.” We then sum the squared differences between the two years. For those who prefer math over words, we do this:

simple process of re-normalizing the data to a 365-day “standard year”

The year with the minimum distance, as determined by the minimum (summed) distance over all other years, is the “reference” year.


Our current data set consists of 100 years of rainfall data from Philadelphia, as shown in Figure 2. We determine the “representative year” starting in 1965 to the year chosen; in other words, the 1993 point is 1989-1993, 2000 is 1989-2000 and so on. Using this “moving right-hand reference approach,” we see the years chosen as depicted in Figure 3.

Figure 2: Cumulative rainfall over 25 years.

Figure 2: Cumulative rainfall over 25 years.

Figure 3: 1973 chosen as the most frequently representative year.

Figure 3: 1973 chosen as the most frequently representative year.

With 1973 chosen as the most frequently representative year, based on minimum distance and a normalized year length. One “might” argue, as we thought, that this approach tends to favor years that have the total rainfall that is closest to average. To overcome this minor difficulty, we may simply normalize the rainfall over the year as well, scaling the total for the year to 1. This “variance only” approach produces the graph shown in Figure 4.

Figure 4: Variance only approach.

Figure 4: Variance only approach.

Which tends to favor 1956 and, later, 1991 as representative years. A plot of these two candidates is shown in Figure 5:

Figure 5: Plot of two candidates.

Figure 5: Plot of two candidates.

In conclusion, we have applied a few more than five minutes worth of analysis this installment. What is more important than the results is that the basic ideas of calculus and statistics, which we don’t always use every day in practice, continue to echo in practice far beyond our basic schooling.

Technical note: This analysis made ample use of the R base function approxfun(), which interpolates between values of a given empirical data set. This made numerical integration quite straightforward.

Harrison Schramm (, CAP, PStat, is a principal operations research analyst at CANA Advisors, LLC, and a member of INFORMS.

Analytics data science news articles

Related Posts

  • 58
    The Internet of Things (IoT) is considered to be the next revolution that touches every part of our daily life, from restocking ice cream to warning of pollutants. Analytics professionals understand the importance of data, especially in a complicated field such as healthcare. This article offers a framework on integrating…
    Tags: data
  • 50
    It’s long been popular to talk about customer interaction data such as clickstream, social activity, inbound email and call center verbatims as “unstructured data.” Wikipedia says of the term that it “…refers to information that either does not have a pre-defined data model or is not organized in a pre-defined…
    Tags: data
  • 41
    The CUNY School of Professional Studies is offering a new online master of science degree in data analytics. The program prepares its graduates for high-demand and fast-growing careers as data analysts, data specialists, business intelligence analysts, information analysts and data engineers in such fields as business, operations, marketing, social media,…
    Tags: data
  • 41
    Many organizations have noticed that the data they own and how they use it can make them different than others to innovate, to compete better and to stay in business. That’s why organizations try to collect and process as much data as possible, transform it into meaningful information with data-driven…
    Tags: data
  • 38
    Benjamin Franklin offered this sage advice in the 18th century, but he left one key question unanswered: How? How do you successfully drive a business? More specifically, how do you develop the business strategy drivers that incite a business to grow and thrive? The 21st-century solution has proven to be…
    Tags: data

Analytics Blog

Electoral College put to the math test

With the campaign two months behind us and the inauguration of Donald Trump two days away, isn’t it time to put the 2016 U.S. presidential election to bed and focus on issues that have yet to be decided? Of course not.



Study: Salaries for early career data scientists decrease for first time

Salaries for early career data scientists decreased year over year for the first time in four years as did the percentage of early career data scientists with a Ph.D. while demand for data scientists continued to increase, according to a recently released Burtch Works’ 2017 salary study of data scientists. Salaries for more experienced data scientists generally held steady or increased slightly depending on an individual’s focus area, responsibility and geographic base, according to the report. Read more →

Generous health insurance plans encourage overtreatment, but may not improve health

Offering comprehensive health insurance plans with low deductibles and co-pay in exchange for higher annual premiums seems like a good value for the risk averse, and a profitable product for insurance companies. But according to a forthcoming study in a leading scholarly marketing journal, the INFORMS journal Marketing Science, such plans can encourage individuals with chronic conditions to turn to needlessly expensive treatments that have little impact on their health outcomes. This in turn raises costs for the insurer and future prices for the insured. Read more →




2017 INFORMS Healthcare Conference
July 26-28, 2017, Rotterdam, the Netherlands


CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:

For more information, go to