The origin, detection, treatment and consequences of missing values in analytics.
By Gerhard Svolba
Missing values – and how to deal with them – is an inevitable problem for statisticians, data miners or anyone working with analytical data. Missing values in the data create uncertainty for the analyst and the information consumer because decisions need to be made without having the full picture. Missing values should trigger a discussion about randomness and systematic patterns, as they might introduce more fuzziness and/or bias into the picture.
Missing values can also reduce the number of usable records for the analysis, or force analysts to eliminate variables from the analysis. This happens for a technical reason, since many analytical methods such as regression techniques, neural networks or cluster algorithms cannot deal with missing values, as they require numeric values to be used in a mathematical equation. Consequently, if an observation has a missing value in any of the required variables, the whole observation (data record) needs to be omitted from the analysis. Other options would be to exclude it from the analysis variable as a whole or to insert imputation values for the missing data points. (However, in descriptive statistics or with decision trees, missing values can simply be represented as a separate category.)
Along with the technical consideration of what to do with missing values, it is also important, from a business point of view, to decide whether a sufficient number of records remain for the analysis to produce meaningful results. The following example illustrates how the true distribution can become distorted when the source of missing values is not identified properly.
The Story of My Aunt Susanne
(Or missing values in the “age variable” in the customer database of a telephone provider)
My Aunt Susanne purchased her phone in the mid-60s. Her date of birth was not collected at that time as the philosophy of “know your customers” and the need for customer data was nowhere near as vital then as it is today. Things changed in the 1990s with the deregulation of the telecommunications market; suddenly, the analysis of customer behavior became important. Since then, it has become mandatory for customers to provide their date of birth on a new contract or with a contract change. My aunt, however, never changed her contract type or answered any customer questionnaire. Thus, the field “date of birth” is missing in the customer database of her phone provider, and we can assume she is not the only customer with a missing value.
If an analyst now looks at the distribution of variable “age” in this customer database, he might get a histogram as shown in Figure 1. Additionally, he will see that he has 9.1 percent missing values. The question is how to treat these missing values.
- Shall the mean be used as imputation value?
- Shall different imputation values be sampled from the actual distribution?
Figure 1:Distribution of variable “age” in a customer database.
In our case, we can assume that the true age value for Aunt Susanne and her friends is not distributed over the whole range of values. After a certain year it was mandatory to provide the date of birth with new contracts. So the missing values will mostly occur for a certain age segment (the older customers) and probably also for a certain behavior segment (those who did not change their contract type).
In the Figure 2 histogram, the true distribution of the unknown age values is shown in red. We realize that we would make a wrong assumption when we treat the missing values as random, as we found out that there is a systematic pattern behind them. In order to qualify such a situation correctly, business and process knowledge is needed. This know-how is also important to formulate an adequate imputation rule as the imputation values should be from the interval 55 to 95 years rather than covering the whole age range.
Figure 2: True distribution of unknown age values shown in red.
Calculating just the proportion of missing values per variable does not really help to uncover such situations. In this case we would just have seen 9.1 percent missing in the age variable. Such an analysis only tells us which characteristics are most commonly infected by the “missing value disease” in the data. For our purpose, we need to find a way that uncovers the relationship of the missing values to other variables or features of the customer.
One method is to create an indicator variable “age missing YES/NO” and compare the distribution of other variables between these two groups. So we might see that the missing age values occur with “old contract types” or have a specific phone behavior (Aunt Susanne is not making international phone calls or having data traffic, she is just phoning her friends from time to time).
Another method that can efficiently be used to uncover the structure of the missing values is to analyze the “missing value patterns” and show these patterns in tile charts. For each record in the data a string of 0s and 1s is created, “1” indicates a missing value for the respective variable, “0” otherwise. The first character of this string could stand for variable AGE, the next for variable GENDER, the third for variable DATA TRAFFIC. Thus a string of “101” would indicate that for this record the variables AGE and DATA TRAFFIC are missing and the variable GENDER is not missing.
With such a representation (Figure 3), it can be seen at one glance that about 60 percent of the records don’t have a single missing value (pattern 000000000000), and that another 30 percent of the records have a missing value in only one of the variables (light blue). The little red cells show groups of records, where already five or more variables are missing. This information is important to decide whether missing values shall be imputed by analytical methods or not.
Figure 3: Representation of missing records.
Such a representation method is well suited to detect patterns of missing values in the variables. It provides an answer to the question: Which missing values occur together and helps to define segments in the data? In our case we would probably find a segment “Aunt Susanne and her friends.” Customers in this segment should be treated differently in marketing actions and probably have demand for specific hardware (phone with large keys, simple usage, etc.). Or they may need special assistance through the customer care hotline.
Systematic Missing Values Really Matter
From the above example we learned that systematic patterns in the occurrence of missing values really matter. In order to illustrate the quantitative effect of random or systematic missing values on model quality, simulation studies have been performed. For scenarios with varying proportions of missing values and different types of missing values, the values have been set to “missing” in the analysis data. As a next step the missing data have been imputed with the mean before they were used in the predictive model.
From the results shown in Figure 4, there is clearly a remarkable difference in model quality when dealing with different types of missing values; the blue and green lines represent the scenarios with random missing values and lay higher than their counterparts from the scenarios with systematic missing values (red and brown lines).
Figure 4: X-axis represents the increasing proportion of missing values, Y-axis shows the relative average response rate of the predictive model.
How do I know that something is missing? This question may sound trivial; a missing value in a table can be recognized by the fact that a cell is empty. Aunt Susanne’s missing date of birth value is one example. Another customer may refuse to answer a question so no value is entered in, for example, the field “number of children.” Such cases can be easily detected and selected by database queries. But not all missing values reveal themselves in such an explicit way. Consider the Figure 5 and Figure 6 that show machine downtime in a factory over 12 weeks.
Figure 5 (top) and Figure 6 (bottom): Machine downtime in a factory over 12 weeks.
When looking at the course over time, we might assume that the diagrams in Figure 5 and Figure 6 represent different data (maybe different years or different factories). In truth the graphs are built with the same data that is shown in Table 1:
From a pure technical point of view, no values are missing and the “number of failures” is always filled. However, from a content point of view we do indeed have missing data: There is no entry for weeks 2, 4, 5, 6 and 10. It now has to be decided whether this means that in week 2 there were no failures or if no information exists for week 2.
The red line in Figure 5 uses the data as they are and assumes a missing value for week 2 that is just interpolated with a straight line. The blue line in Figure 6 treats the missing weeks in the data as weeks with zero failures and interprets the data as shown in table 2:
This example illustrates that in time-series data we can have missing data in the form of missing records that are not explicitly visible. Only by reviewing the continuity of the time axis do we detect these missing values.
Figure 7 shows methods that are frequently used to detect and impute missing values for time series data. The original data only had 12 records, the records for May, July and November 1949 where automatically inserted into the data and different replacement values were inserted.
- ZERO VALUE: assuming that if a period is not represented in the data, no traffic occurred for that period
- The LAST KNOWN VALUE: if no information for a period is available the value is assumed to be the same as in the previous period
- The MEAN VALUE: assuming that a missing value is best represented by the mean of the existing values in the time series
- The INTERPOLATED VALUE: here a most likely value is found based on spline interpolations
The analysis has to be checked where data points are missing (where the time series has “holes”) and how these holes shall be interpreted from a business point of view. These considerations then lead to the decision of how the missing values shall be imputed.
Figure 7: Methods to detect, impute missing values for time series data.
In analytics, missing values are more than just a technical feature of the data. Business considerations are needed to decide how they shall be detected and handled. The aim is to get a more complete picture and to remove biases and patterns. Analytical methods help to detect missing values, to provide optimal replacement values and to simulate the consequences on model quality.
Gerhard Svolba (Gerhard.firstname.lastname@example.org) works for SAS Austria as an analytic solution architect. He is the author of the SAS Press books “Data Preparation for Analytics Using SAS” and “Data Quality for Analytics Using SAS” and speaks at international analytics conferences about the necessary pre-steps before statistical analyses can start. To download the presentations click here.