Viewpoint: Did Nate Silver beat the tortoise?
By Arnold Barnett
In making election forecasts for the FiveThirtyEight blog (538) at the New York Times, Nate Silver uses a statistical model that is subtle, sophisticated and comprehensive. Real Clear Politics uses a shallow approach to forecasting that could have been devised by a statistical Forrest Gump. But which forecaster better predicted the results in the 2012 presidential election? Did the intellectual tortoise hold its own against the hare?
From a conceptual standpoint, it should have been no contest. In an approach that would make statisticians shudder, Real Clear Politics (RCP) estimated the Obama/Romney difference in a given state by the simple average of differences in recent polls. Differences in sample sizes were ignored, the word “recent” was defined differently in different states, undecided voters were simply excluded, and evidence that some polls skew toward Republicans and others toward Democrats got no weight. The 538 model, in contrast, avoided all these limitations, and took account of correlations among outcomes in similar states and the demographic makeup of each.
From Theory to Practice
But how did the final state-by-state predictions under the two approaches compare in accuracy? RCP only made forecasts in 30 of the 51 states (including the District of Columbia), but these included all swing states and all large states. And, at first blush, it might appear that the race between the two methodologies in the 30 states was (in the familiar phrase) too close to call.
The most obvious dimension for comparison is the bottom line: Did the forecast in a given state correctly identify the winner there? By that standard, both methods did very well: In 29 of the 30 states, they agreed who the winner would be and that candidate actually won. (Complete data tables will appear in a longer version of this article in the February 2013 issue of OR/MS Today.) In Florida, neither forecaster made a correct forecast: RCP erroneously projected a narrow Romney victory (1.5 percentage points), while 538 projected an exact tie (and thus abstained from forecasting). Obama carried Florida by 0.9 percentage points. We can say, therefore, that 538 scored a partial victory over RCP in one of 30 states, but that is hardly a decisive advantage.
As for the absolute forecast errors in the various states, the results were once again similar. The mean absolute error over the 30 states was 2.87 percentage points for RCP and 2.25 for 538. However, there is a “blue state bias” among the 30 states: Romney carried only 27 percent of them (eight out of 30), while he captured 47 percent (24 out of 51) in the entire nation. When an adjustment is made for this bias, the mean absolute error becomes 2.57 points for RCP and 2.33 for 538. This revised difference of one-quarter of one percentage point is hardly decisive.
On the Other Hand
Yet this aggregate analysis is oblivious to the central dynamic of the 2012 election. Given the realities of the Electoral College, the candidates and everyone else recognized that the outcome would be determined by what happened in about a dozen “swing states” that either candidate could plausibly win. In the other states, the winner was a foregone conclusion so there was little campaigning and little interest in polling results.
Under the circumstances, a comparison between RCP and 538 should focus primarily if not exclusively on their accuracy in swing states. RCP identified 11 states as “toss up” just before the election: Colorado, Florida, Iowa, Michigan, New Hampshire, Nevada, North Carolina, Ohio, Pennsylvania, Virginia and Wisconsin.
Within these states, the two approaches differed markedly in performance. 538 outperformed RCP in absolute forecast accuracy in all but one of the 11 swing states (Ohio). Both forecasters were on average more favorable to Romney than the actual voters, but the net “bias” was only 0.76 percentage points for 538 over the 11 states as opposed to 2.44 points for RCP. That difference of 1.68 (2.44-0.76) points is especially noteworthy because regression analysis makes clear that the 538’s estimates about Obama’s performance were consistently about 1.5 points higher than those of RCP. Again and again, this adjustment was vindicated by the swing-state results: RCP underestimated Obama’s actual vote share, while 538 eliminated roughly 75 percent of the underestimation.
In the 19 states out of the 30 originally compared that were not swing states, 538 and RCP performed about equally well, which is why statistics based on all 30 states yielded less disparity between the two approaches than the swing states alone. It could be that Obama outperformed the swing-state polls upon which RCP relied because of the major voter-turnout drives that his campaign undertook in those states, which brought many people to the voting booths whom pollsters had not included in tabulations about “likely” voters. In the other states, the Obama campaign may not have waged such efforts, so no comparable “surge” occurred.
Nate Silver would be the first to agree that his state-by-state forecasts were correlated, and that circumstance stymies assessments of whether his swing-state victory over RCP was statistically significant. In effect, he made an all-or-nothing bet on the premise that the polls underestimated Obama’s strength in swing states: Had this premise been wrong, his 11-1 victory over RCP could easily have been a 12-0 defeat. Yet uncertainties about how to define statistical significance cannot obscure the fundamental point: 538 did extremely well in 2012 in those states where accuracy was most important.
So how does it all add up? Under the Occam’s Razor principle, there is a clear starting preference for simple models over more complicated formulations. A complex model must justify its intricacy by offering more accurate information than a simpler counterpart; moreover, this added information should arise in places where it is most needed. In the present setting, the question is whether Nate Silver’s 538 model outperformed the straightforward RCP method to an extent that makes 538 the wiser choice, even if the less transparent one.
Readers can reach their own judgments, but because of the results in the swing states, the author believes that 538 met the test for superiority just posed. While the tortoise catches up with the hare in the nursery stories, it seems here that the hare won hands down. But the outcome does not contradict Aesop’s fable because, far from being lazy, the 538 hare ran the race as hard as it could. And, if the evidence is any guide, it is very much a world-class runner.
Arnold Barnett (email@example.com) is the George Eastman Professor of Management Science at the MIT Sloan School of Management. His research specialty is applied mathematical modeling with a focus on problems of health and safety. Barnett is a senior member of INFORMS.