Predictive Modeling: ‘Ensemble of Ensemble’
Meet the new nominal in predictive modeling
By Anindya Sengupta
Accurate prediction of the future is the most important motto of predictive modeling. With the adoption of predictive modeling and analytics across different industries, the focus on accuracy has increased massively. Predictive modelers across the globe have been in search of newer and more innovative techniques to increase their prediction accuracy. Against this backdrop, one method to consider for improving the accuracy of model predictions is the ensemble approach of modeling.
Ensemble models combine two or more models to enable a more robust prediction, classification or variable selection. The most common form of ensemble is combining weak decision trees into one strong model. The most prevalent methods in this regard are bagging and boosting. These machine-learning techniques have better accuracy than the erstwhile traditional statistical techniques such as least square estimation and maximum likelihood. However, the traditional techniques are more robust. We should ideally have a blended approach wherein we integrate the best parts of the two models. The new buzzword phrase in the predictive analytics world is “ensemble of ensemble” models.
There are different ways of combining different models. The most simplistic way of combining models is to take a simple or weighted average of the predictions from different models. For business problems with binary target variables, such as predicting the likelihood of fraud, this approach sometimes gives better results in terms of predicting the event rate. The major limitation to this approach is that the overall accuracy of this combined model lies between the accuracies of the individual models. Thus, this method is definitely not advisable for modeling continuous variables as overall accuracy matters most there. This method, however, can still be considered for modeling binary target variables.
Best Practices of Modeling
Given the strength of the traditional models, one possible approach of combining the models can be to use the traditional models for variable selection. Then, with the chosen set of variables, we should run the machine-learning method incorporating bagging and boosting. This will ensure that all best practices of modeling are maintained. One argument against using the machine learning techniques is that the variable selection process is not well defined. Two highly correlated variables can be selected in the model, and business may not have much insight separately from these two variables as both of them may be suggesting similar insight. One of the best ways to solve this kind of problem is to use traditional models for variable selection and use those variables for machine learning models. This can be one kind of ensemble.
One of the more prevalent forms of combining two different models has been to use the output of one model as input in the other model. This happens in two-stage models. The entire literature on censor modeling is based on these two-stage models. One can use the inverse mills ratio calculated from the output of the first-stage model as an input to the second-stage model. The method of using output from the model in stage one as an input in the model in stage two can also be treated as some kind of ensemble modeling.
One of the most efficient methods of creating ensemble of ensemble models is to use the predictions of different models to develop a separate model on the target variable using the predictions as the predictors. In this way, we will get the final predictions as a function of the predictors. The model will determine the functional form. In this method, we are basically optimizing the predictions using the individual predictors. This method gives better results in terms of the overall accuracy of the model. The combined model generally has higher accuracy than the individual models.
More Accurate Results
Whatever way we use to combine different models, it has been generally observed that “ensemble of ensemble” models give more accurate results than individual models. This has been the new normal in the predictive modeling world.
Going forward, ample research is needed on exploring more innovative ways of creating ensembles of traditional and the machine-learning models. Given the focus on prediction accuracy, researchers need to focus on more scientific ways of combining models in order to ensure maximum accuracy.
Anindya Sengupta is an associate director at Fractal Analytics (www.FractalAnalytics.com).