ANALYZE THIS! A ‘Silver’ lining for election blues
By Vijay Mehrotra
For the past several months, I have spent hours staring at my screen, reading anything I can get my hands on that might help me get a sense of what might happen during the elections on Nov. 8. Since I live in Oakland, Calif., the heart of the uber-liberal bubble that is the San Francisco Bay Area, I am constantly searching for truly fair and balanced perspectives about what is really going on across the rest of the country, especially with regards to this year’s presidential election.
For analytics professionals, in fact, the world of presidential elections is familiar territory. Since George Gallup first applied statistical random sampling to draw conclusions about the results of the 1936 U.S. presidential election based on survey data, there has been a steady (and recently explosive) growth in the number of people gathering, analyzing, visualizing and interpreting data to try to understand, explain, predict and/or influence what might happen in our elections. Nowadays, during our seemingly endless presidential election campaigns, it feels as though there are new state or national poll results being announced every day for months. This almost nonstop and often contradictory stream of numbers often leaves me bewildered.
Which is why I spend a huge amount of time visiting fivethirtyeight.com, the political website created some years ago by Nate Silver, a “number-crunching prodigy”  who just might be the big data world’s first mainstream rock star. Since bursting onto the political scene in 2008 with predictions that: (a) were somewhat contrary to the punditry’s consensus and (b) often proved to be surprisingly accurate, Silver’s website has sought to use both historical voter data and information from political polls to make predictions about elections (the site, now owned by ESPN, also includes data-driven stories about sports, science, economics and culture). During this presidential election cycle, Silver and his team used one set of models for forecasting state-by-state primary results and another set for the general election.
Quite a bit of methodological detail about these forecasts is publicly available at fivethirtyeight.com [2,3]. After reading Silver’s book “The Signal and the Noise”  and scrutinizing these forecasting process descriptions, my sense is that Silver and his team are seeking to understand the same elusive truths about the electorate that I am, and as an engaged citizen I am grateful for their efforts. The blizzard of polling data is systematically examined, with some polls banned due to ethical and/or methodological concerns. Careful weighting is done to account for factors such as sample sizes, recency, frequency and past performance. A wide variety of adjustments are made to address factors such as post-convention bounces, third-party candidates, registered vs. likely voters and historical biases associated with particular polls. Moreover, over the past eight years Silver and his team have continued to tweak their models in different ways, and there appears to be a basic humility about the limits of prediction underlying their observations and claims, as well as a deep-seated wonky desire to tell an unbiased story.
Which brings us back to Oakland. In a recent article in Significance , Kristian Lum (lead statistician at the Human Rights Data Analysis Group) and William Isaac (doctoral candidate in the Department of Political Science at Michigan State University) examine the controversial topic of “predictive policing ,” a term that refers to the use of data and models to make forecasts about where crime is most likely to take place, usually within urban population centers. While predictive policing software has been commercially available for some time, its efficacy has been hotly debated in crime-fighting circles .
Lum and Isaac’s main thesis is that the historical data used by these predicting policing systems is inherently biased, and that this bias is in turn propagated by the machine-learning algorithms that are embedded in the software. Citing a great deal of previous research, these authors are inherently suspicious about this historical data, concluding that “police records do not measure crime. They measure some complex interaction between criminality, policing strategy, and community-police relations.”
Using data about Oakland, the authors tell a thoughtful and illuminating data-driven story. Rather than accepting police data as their proxy for actual drug crime, they instead use data from the 2011 National Survey on Drug Use and Health (NSDUH). After making a solid case for why the NSDUH data is likely to provide a more accurate snapshot of drug use in Oakland than past arrest data, the paper then contrasts the NSDUH data with the actual police arrest data for drug crimes for 2010, pointing out that “while drug crimes exist everywhere, drug arrests tend to only occur in very specific locations – the police data appear to disproportionately represent crimes committed in areas with higher populations of non-white and low-income residents.”
The authors then make an important observation about how biases can quickly propagate even more quickly than the machine-learning models would suggest:
“But what if police officers have incentives to increase their productivity as a result of either internal or external demands? If true, they might seek additional opportunities to make arrests during patrols. It is then plausible that the more time police spend in a location, the more crime they will find in that location.”
Putting all of this together, the overarching premise here is as follows: (a) algorithms based on biased input data suggest that crime will be found in certain areas, which leads to (b) more policing in those areas, which causes (c) dramatically larger numbers of arrests in those areas relative to other less policed areas, which leads to (d) increased bias in the input data for the algorithms. The paper concludes by presenting the results of a simulation that vividly illustrates this insidious cycle.
In the big data age, our understanding of the world and the future – and our decisions about what actions to take based on that understanding – are increasingly dependent on algorithms. As analytics professionals, most of us have some inherent bias toward data-driven methods, but often this bias should be tempered with a healthy dose of skepticism about such models and about the data that drives them. Political polling and predictive policing are just two pernicious examples of how biased data can distort our beliefs and behaviors – but as a dark-skinned foreigner living in America during this year’s presidential election and a proud resident of Oakland, they hit really particularly close to home for me.
Vijay Mehrotra (email@example.com) is a professor in the Department of Business Analytics and Information Systems at the University of San Francisco’s School of Management and a longtime member of INFORMS.
REFERENCES & NOTES
- Significance is joint publication of the Royal Statistical Society and the American Statistical Association, available online at www.rss.org.uk/significance
- See for example http://www.sciencemag.org/news/2016/09/can-predictive-policing-prevent-crime-it-happens
- 61As I write this, the 2016 U.S. presidential campaign staggers toward the finish line, leaving behind a trail of mud the likes of which we’ve never seen before. When the election is finally over, no matter the outcome, I think we all could use a hot shower.
- 45FEATURES ABM and predictive lead scoring Account-based marketing, and the related technology of predictive lead scoring, is dramatically changing the face of sales and marketing. By Megan Lueders Software survey: joys, perils of statistics Trends, developments and what the past year of sports and politics taught us about variability and…
- 36Accenture has helped the Seattle Police Department (SPD) build and deploy a new data analytics platform that provides the SPD with reliable and rapidly accessible data to meet its management and governance objectives and support insight-led policing.
- 35With the campaign two months behind us and the inauguration of Donald Trump two days away, isn’t it time to put the 2016 U.S. presidential election to bed and focus on issues that have yet to be decided? Of course not.
- 35The 2012 U.S. presidential election is over, and from a statistical viewpoint, the winner was a small group of people armed with analytics who out-predicted many so-called political experts (who relied mostly on gut instinct and experience). The election demonstrated that analytics fueled by big data and advancement in computing…