Predictive Modeling: Research as a competitive sport
Crowdsourcing the full analytics value chain.
By Margit Zwemer
Competition in analytics is a familiar concept on an organizational level. An “analytics gap” exists between the predictive haves and have-nots: an insurance company that accurately predicts a new customer’s actuarial risk, a mortgage lender that better predicts the probability of default, a retailer that better predicts its churn rate, a social network its number of followers or an ad platform its click-through rates. If any of these companies can accomplish their predictions better than the rest of their industry, they gain a distinct competitive advantage – survival of the analytic fittest.
What happens when this same approach is applied to the building of the predictive models themselves? The rise of “competitive research” has shown that as much or more value can be derived from competition between the predictive analysts. In this article, we’ll examine the past results of making analytics a competitive sport and see how this approach is affecting the full analytics value chain, from problem identification to analysis to implementation, and even how analysts might be recruited and compensated in the near future.
The History of Innovation Prizes
We tend to think of crowdsourcing as a cost-effective way to accomplish low skill, repetitive tasks, but a quick glance at the history of innovation prizes shows that open competition can achieve breakthrough results in high value-added areas. The traditional methods of allocating research funding and corporate analytics budgets can encourage a risk averse and cautious mentality . . . and solutions that are “good enough” but not great. The winner-take-all nature of prizes breeds a form of intelligent risk-taking that is particularly good at solving previously intractable problems. Essentially, it is a way of crowdsourcing genius.
As far back as the 18th century, the British government offered more than £100,000 in prize money to anyone who could come up with simple and practical methods for measuring longitude to assist maritime navigation. A watchmaker, John Harrison, won the task. Then there was the very first non-stop transatlantic flight in 1927, which awarded Charles Lindberg, a relatively unknown aviator at the time, the $25,000 Orteig Prize. In 2004, a privately funded team led by engineer Burt Rutan captured the $10 million Ansari X Prize for becoming the first non-governmental organization to launch a man into space not once but twice within two weeks using a largely reusable spacecraft. The 2009 Netflix Prize demonstrated that this model can be applied equally well to algorithmic innovation. Netflix offered a million dollars for a 10 percent improvement in its movie recommendation algorithm. An estimated 80 percent of the movies that customers watch on Netflix are found through the recommendation engine, so, given the number of Netflix customers, the million-dollar prize was money well spent.
All of the prize contests mentioned here produced remarkable accomplishments, and while they could have or would have been achieved without such bounties to spur them on, it is unlikely they would have happened so quickly.
Modern Innovation Prizes
Using innovation prizes to not only solve a one-off problem but as a regular part of doing business is catching on in many different areas. Topcoder is a well-known community of more than 400,000 developers who compete in challenges ranging from architecting an entire software system to creating a Web site to developing new algorithms. The community competes to design and then build each module of the product, with a prize linked to each component based upon its difficulty. Innocentive is another innovation platform that hosts an even wider array of challenges, from finding biomarkers for ALS (Lou Gehrig’s disease) to inventing a low-cost rainwater storage system. From brainstorming, to theory, to blueprints, to code, all of these can now be crowdsourced.
Before I dive more deeply into research as a competitive sport, I should tell you that I work for Kaggle, a predictive modeling competition platform. If the examples I bring up seem to tilt heavily toward my own organization, it’s because this is the area of competitive research I know best, and for which I can marshal hard evidence to back my claims.
Inspired by the Netflix Prize, Kaggle was founded in 2010 to apply the innovation prize model and digitize it, creating a marketplace for data science. Kaggle hosts prediction competitions that solve large-scale data problems in areas such as business, health, education and science. The competitions lead to more accurate algorithms for the companies that sponsor them because they pit a wide range of solutions and techniques against each other on a data science “proving ground.” The online community has grown to more than 40,000 data scientists and predictive analysts, competing under the slogan “making data science a sport.”
While the painstaking work of building a great predictive model could not seem further from the mud and sweat traditionally associated with sports, the competitive dynamic between Kaggle participants is what has consistently produced models that are not just better than the current state of the art, but better than even the analysts involved thought possible.
Kaggle competitions have dramatically outperformed pre-existing benchmarks in every competition run. For example, the best performing algorithm in Allstate’s Kaggle competition produced a 271 percent improvement over the internal version and an 83 percent improvement over the expected best-case development of a similar model using internal or traditional third-party resources. Allstate has validated that the Kaggle competition produced a model that performed substantially better than what they would have expected to develop internally or with a specialized outside predictive modeling consulting firm. In a competition for another client, they allowed their internal analytics team to compete anonymously in the competition. While they were not the overall winners, their performance well exceeded the benchmark that they themselves had built before the competition started.
The Kaggle platform also includes a real-time leaderboard associated with each competition. Getting instant feedback on the accuracy of their models against a hidden validation set, rather than submitting models that are only judged at the end of the competition, encourages contest participants to push themselves harder in order to leapfrog their competitors. The competitive dynamic drives data scientists to continue exploring ways to improve predictive accuracy, spurred on by the knowledge that someone else has found something that makes their solution better.
Figure 1 shows the change in leader and increase in predictive accuracy in the Kaggle-NASA mapping dark matter competition. Within a week of the competition launch, the benchmark based on more than 10 years of physics research was beaten by … a glaciologist. The same asymptotic pattern occurs in all Kaggle competitions and often with the same surprising winners.
Figure 1: Predictive accuracy of the Kaggle-NASA mapping dark matter competition over time.
Participants are located all over the world and often work in fields that are, on the surface, completely unrelated to the source of the data problem, such as the hedge fund trader who recently won an education technology challenge or the neuroscience student solving an air pollution problem (see Figure 2). Kaggle’s focus on data competitions with real-time leaderboard feedback brings out fresh talent and drives objectively better results.
Figure 2: The universe of Kaggle participants offers a diverse skill set.
Predicting the Future of Predicting the Future
Using competitions to crowdsource analytic talent does not mean that in-house analytics groups become obsolete, but it does change how they will approach their role. In the future, the most valuable skill may not be solving the analytics problem itself, but being able to clearly define and structure the problem into a competition format and evaluate the results to determine which model can be most successfully applied to your particular business. By introducing open, competitive research into the analyst’s toolset, an opportunity arises to leverage a large analytics talent pool in a cost-effective, scalable way.
Kaggle’s own expansion shows how competitive research can be applied to many other points along the analytics value chain. In the past year, Kaggle has introduced specially formatted, invitation-only competitions in which a select group of past competition winners are invited under non-disclosure agreement to work on sensitive datasets. Another recently introduced type of competition, marketed as Kaggle Prospect, extends the crowdsourcing model to the problem identification phase that happens before the competition can be defined. Hosts share a sample of their raw data with Kaggle’s participants, who can use the data to suggest predictive models that could be built, uncovering new or less obvious questions that further analysis of the data can explore. At the other end of the analytics value chain, Kaggle makes it simple to operationalize the best predictive models from the competitions and integrate them into existing systems by hosting and implementing the model and making it accessible to the client through an application programming interface.
Competitive research has also created a reputation engine for the analytics industry. Recently, Kaggle introduced recruiting competitions, which allow companies to filter their data science applicants on demonstrable skills rather than resumes. Instead of leafing through a stack of resumes and then using 10-minute technical brain teasers to determine if an applicant is skilled at solving large, open-ended data problems, a recruiting competition allows organizations to “try before they buy.” The first competition of this type was launched for Facebook and is one of the most popular contests that Kaggle has hosted. Competitive research, whether based on past competition results or a recruiting competition for a specific company, in an objective and unbiased way to filter talent. Screening applications through a data science competition decreases both the false positive and false negative rates – no more bad interviews and no more having great candidates waiting forever for the phone to ring because they come from a non-traditional background.
Another implication of competitive research is in how analysts are compensated. A hedge fund trader who is great at predicting risk makes millions, so why is an equally smart analyst in a different industry, who is also great at predicting risk, not compensated nearly as well? In most industries, predictive modeling is still treated as a cost center, even if the results of the model have a direct impact on the company’s profitability. The demand for people with the skills to crunch large amounts of data far exceeds the supply, but salaries are much more sticky. The creation of competitive marketplaces for analytics is seeing the size of the prize pools start to converge toward the business value of the models as the hosts bid up prize money to attract the best talent to their problem.
Selecting the best predictive model through a research competition is, as Sloan School Professor Andy McAfee so eloquently describes it, “not picking a horse, but hosting a horse race.” Data research competitions are a resource-efficient way for organizations to solve complex data problems, and they create a meritocratic market for talent that changes the way analysts work. As the crowdsourced model for competitive analytics catches on, companies have less need to build large, in-house analytics teams, but the people they do hire need to understand how to structure their questions for competitive research. May the best model win.
Margit Zwemer (email@example.com) is a data scientist and community manager at Kaggle.