Forum: Time-to-event analysis
By Kevin J. Potcner
Statisticians and analytics related professionals have been conducting time-to-event analyses across myriad applications for as long as data has been collected and analyzed. Governments and religious institutions throughout history have collected data on birth and death rates to better estimate resource demands and predict tax revenue. Insurance companies use sophisticated time-to-event models to predict accident, illness and mortality rates in order to set policy costs and forecast profits. Engineers use these techniques (often called reliability analysis) to model the lifetime and failure rates of mechanical or electronic systems and uncover the factors that impact those rates, producing metrics such as MTTF (mean time to failure) among others. Marketers have begun adopting many of these techniques to study important time-to-event phenomena of their customers, such as the rates of product adoption or time for a customer to upgrade a service contract.
This article illustrates an example of how the author used the analytical techniques of time-to-event analysis to build a set of statistical models forecasting the time for known software bugs to be encountered by customers after the software had been installed on their networks.
As anyone involved in software development knows, it’s not possible to have 100 percent of known bugs fixed prior to release if a company is to meet market demands and expectations. Being able to predict the expected time for bugs to be encountered by customers once the software is out in the field and determining the characteristics that impact that time can be very valuable information for the organization. Management can prioritize efforts and focus engineering and testing resources on the bugs most likely to be encountered, and then hopefully fix those bugs before too many customers encounter the problem.
Time-to-event data and the statistical techniques developed to analyze such data have an important distinction. The data is often incomplete, or what in the statistical literature is referred to as censored data. That is, instead of having exact time-to-event values, only upper and/or lower bounds for when the event in question occurred is known. For example, consider that a release of the software is installed on a customer’s network with 250 known bugs not fixed at the time of installation. Consider that after six months (180 days) in use, the customer had encountered and reported on 86 of those 250 bugs. The length of time between installation and the date the bug was encountered can be used as the time-to-event for each of those 86 bugs. The remaining 250 – 86 = 164 bugs, however, are still present in the software but haven’t been found by the customer yet. If it’s desired to conduct the analysis after 180 days, it would be incorrect to use the time-to-event value of 180 days for those 164 bugs. If given enough time in use, those 164 bugs would eventually be encountered and would each have a time-to-event value of 181 days or greater. Statistical techniques for time-to-event analysis can handle this, and each of those 164 bugs would be assigned what is called a “right censored” value of 180 days. Similarly, there is also “left censoring” when only the upper bound for the time-to-event is known and “interval censoring” when the time-to-event is only known within a range.
In many time-to-event studies, data will contain a combination of complete data as well as all forms of censoring. The data used for this project had complete data (i.e., time-to-event known) and right-censored data (i.e., a lower bound on the time-to-event known).
As with any analytics initiative, consolidating a learning agenda that all stakeholders and teams had input in creating is essential to get the needed alignment and support. This work can take anywhere from a few days for a simple project to many months for a more complex initiative that an organization has not attempted before. For this particular project, three months of time was spent in discussions with senior management across various groups including QA & Testing, Engineering, Development & Engineering, Product and Release Management, among others. These discussions are the first opportunity to establish priorities and set realistic expectations. In addition to developing a set of key questions and project objectives, these discussions provide valuable historic context on the problem and can identify sensitive topics that a consultant has to tread very carefully around.
For this project, the learning agenda was distilled down to three key questions.
- What is the likelihood that a customer will encounter a bug as a function of time in usage?
- How do characteristics of the bug impact that likelihood?
- Which bug types are the most likely to be encountered by a customer?
Data Collection & Aggregation
Collecting and aggregating all the needed data can be one of the most challenging and time-consuming aspects of an analytics initiative. The data needed for this effort was spread across myriad databases requiring many different resources to fully source. Resources worked on this stage for over two months, often requiring multiple data extractions.
The definition of data should be broadened to include information not contained within a database. Resources intimate with the software and how the customer uses it can provide a wealth of knowledge that can often be translated into quantitative variables providing additional dimensions to the analysis.
QA, Analysis & Data Cleansing
Once all the data has been extracted, it’s important to plan for a proper amount of time and effort to validate and clean the data. This step is often underestimated in analytics projects but is one of the most critical as misleading results can be produced if this work is not done thoroughly. Almost all data will have issues that need to be resolved. Errors, incorrect values, unusual observations, extreme outliers and data inconsistencies are quite common; addressing these issues will benefit both the project at hand as well as other applications that these data are being used for. For this particular project, the technical teams uncovered a host of problems with a few of the databases revealing that an uncomfortable level of inaccurate data was being used to create various reports distributed across the business. A separate project was kicked off to fix these issues and improve the accuracy of these reports.
Validating and cleaning data generates some very rich conversations among stakeholders and technical teams. This stage is also helpful to set expectations with stakeholders when the gaps and limitations of the data can be more clearly shown. These discussions help the statistician better connect how the business views the problem to how the available data can be used to produce actionable business metrics. Valuable insight into the nature of the data is gleaned as the statistician is able to examine the variability and correlation structure identifying issues that may impact the statistical modeling and analysis techniques to be used.
Once the data has been adequately cleaned and prepared, the statistical modeling work begins. By this point, enough analysis work and data examination should have been done so the statistician has a very clear idea of the technique and approach that will be used. The goal here is to reduce the data to a mathematical expression with a component that provides a description of the overall structure in the data and a component that accounts for the variability and uncertainty around that structure. It’s important to remember that the goal of statistical modeling is to build as simple of a model as possible that adequately describes the key features in the data allowing the hypotheses/questions of interest to be addressed without adding unnecessary complexity.
A great quote that most statisticians keep top of mind during this process to help strike this balance is from one of the pioneers of the science, George Box: “All models are wrong, but some are useful.”
For time-to-event analyses, the modeling technique needs to account for the censored nature of the data. Many statistical model forms that are common in time-to-event analyses can handle censored data and a variety of statistical software that contains these techniques. The author used the Minitab Statistical Software, which is a software package common among reliability engineers.
In most analytics projects, the more advanced statistical analyses and models are not shared beyond the core technical team doing the analysis work. These models and analyses need to be translated into a variety of summary statistics and graphical displays that communicate the features in the data and are easy to share across a broad range of audiences. For this project, a technical report containing a variety of graphical displays and data tables was produced. Figure 1 is an example of one of the graphical displays produced in this project, and one that’s commonly used in time-to-event analyses.
Figure 1: Graph displaying likelihood that a customer will encounter a bug as a function of time.
The graph displays the likelihood that a customer will encounter a bug as a function of time (Note: the probability values are not shown to protect the confidentiality of the work). This approach shows the rate at which that likelihood increases over time. The likelihood for five different bug types is displayed (A, B, C, D and E), allowing for a comparison across the bug types. For example, bug type E has the greatest chance of being found by a customer while bug type A and B have the least chance.
Management can use graphical displays such as these to help determine the time in usage at which certain bug types would have a likelihood of being encountered beyond desired. In this project, a certain level of likelihood was decided upon by senior management and displayed on the graph (shown by the grey horizontal line). As can be seen, bug types A and B don’t reach that likelihood until almost three years in usage, indicating that fixing these bugs can be of lower priority. Bug types D and E, on the other hand, reach that likelihood within the first few months of usage indicating that these bugs have a high chance of being encountered and should be top priority to fix before too many customer encounter them.
Kevin Potcner (Kevin.firstname.lastname@example.org) is a director at Exsilon Data & Statistical Solutions. A statistician, Potcner has provided analytics consulting and training for a variety of industries including automotive, biotech, medical device, pharmaceutical, financial services, software, e-commerce and retail. He holds a master’s degree in applied statistics from the Rochester Institute of Technology.