Analyze This! Many moving parts in analytics parade
By Vijay Mehrotra
Sometime back in the last century, when I was a disgruntled graduate student, I managed to wangle a part-time job at a semiconductor fabrication facility. My job was to gather data, build simulation models and conduct analyses to help management understand capacity, bottlenecks and cycle times. Along the way I managed to write a few conference papers  with some of my colleagues at HP before returning to school full time to pursue a dissertation on queueing networks that was inspired by my work in the fab. I will be forever grateful to Dr. Barclay Tullis for making that opportunity possible for me.
And that’s pretty much the last time I had thought very hard about computer chips. Like most of us, I just basically assumed that they would keep getting faster (and cheaper) at a faster and faster rate, just as Gordon Moore  had long ago predicted.
The other day, a recent paper  by my colleague Matthew Dixon caught my eye. In the context of parallel computing resources that might be located in any number of different locations (public clouds, private clouds, remote desktops, etc), the thesis of Dixon’s paper is that by using design patterns to structure high-level code in an analytics-friendly software language, data scientists can more effectively identify the computationally intensive steps during the design process. This understanding can in turn be utilized to organize one’s code to leverage the parallel computing resources without having to re-implement the code in a more efficient lower-level language.
The paper illustrates this general process through the use of a case study that involves estimation of option prices, showing how minor modifications to high-level code (in this case, Python) can enable the data scientist to utilize parallelization to radically speed up performance. The crux, as Dixon astutely points out, is the detailed knowledge that the data scientist has about his/her application, information that enables him or her to make smart modularized design choices. In this context, he refers to the data scientist as the Domain Expert for the application (more on this below).
While thinking about these ideas, I had a chat with Ed Rothberg, co-founder and chief operating officer of Gurobi Optimization, a leading provider of software that quickly and efficiently solves linear, quadratic and mixed integer optimization problems. As Ed is an old friend from graduate school days, he patiently answered a long series of naïve questions from me, and in the process provided me with an even richer and more nuanced perspective.
In some sense, Rothberg suggested, I would be well served to think about an optimization solver such as Gurobi’s as a platform that is optimized to efficiently solve a highly structured class of problems while also striving to intelligently utilize detailed knowledge of the available computing resources. This platform, in turn, is the product of a group of developers with extensive knowledge and endless ideas about both the detailed structure of the abstract problem and the architecture and associated logic of the microprocessors that are being utilized to do these calculations. The power is in applying this knowledge and these ideas to abstract representations of ever larger “real-world” problems, because Moore’s Law tells us that the computing power will keep on growing at an exponential rate.
Except that, as both Rothberg and Dixon pointed out to me, Moore’s Law is headed for a cliff, a viewpoint that now seems to be relatively mainstream. Indeed, in a 2013 keynote speech  at the Hot Chips conference at Stanford University, former Intel Chief Architect Robert Colwell bluntly predicted that, after a remarkably long period of exponential growth the number of transistors per chip and in the CPU speed produced by those chips, the end of this phenomenon was just a few years down the road. To this relatively non-technical hardware user, this seems to be because of increasingly expensive power and cooling costs associated with so many transistors crammed into such small spaces.
The implications are clear: The concept of smart parallel coding for business applications will be a huge factor in the increasingly data and computationally intensive world of analytics. The work being done by Rothberg and his colleagues at Gurobi (and their competitors) will continue to deliver faster solutions to a large and important class of structured optimization problems. Moreover, Dixon’s broader points about how the use of design patterns and the understanding it engenders about how to exploit the availability of various computing resources is something that analytics professionals will need to become increasingly aware of, as the problems we tackle keep getting bigger at a rate faster than individual chips can be sped up.
I came away from my conversation with Rothberg with a new found respect for the knowledge, experience, ideas and hard work that has been put into the creation of these smart optimization solvers, so I do not have to think about any of the back-end processing when formulating my own representation of the optimization problems that I might encounter. I am as always most grateful for the permission to be ignorant here.
However, when looking at data-intensive problems more generally, I’m going to need to get a bit more intimate with the computational load that I’m generating. And while Dixon’s reference to data scientists as domain experts struck me as a funny one at first (I’ve typically thought of domain experts as people deeply knowledgeable about the business context rather than the analytic representation of it), I now have a much better understanding of what he meant.
Given that the exponential growth in the amount of data being analyzed and the size and significance of the problems being solved, the inevitable end of Moore’s Law and the fact that the holy grail of automated parallelization has yet to be successfully realized, the data scientists of today and tomorrow will have no choice but to be more aware of how their code is executed in heterogeneous parallel environments.
One unexpected takeaway from reading Dixon’s paper was a heightened appreciation for the sheer variety of component disciplines that are being harnessed together in this analytics revolution. The journey to a world of more and smarter data-driven decision-making includes advances in hardware design and manufacturing, software platform components, application development, human-computer interaction design, management policies and controls, and surely many others that I’m blissfully unaware of.
A lot of moving parts in this increasingly large unwieldy parade – and there’s no sign of it slowing down.
Vijay Mehrotra (firstname.lastname@example.org) is a professor in the Department of Business Analytics and Information Systems at the University of San Francisco’s School of Management. He is also a longtime member of INFORMS.
- See http://www.mansim.com/tech_talk.html for a list of conference papers about the use of simulation to analyze semiconductor fab performance.
2. See http://www.mooreslaw.org/ for more on this.
3. Dixon, M., 2015, “A pattern oriented approach for designing scalable analytics applications,” invited paper, PPAS 2015, Proceedings of the 2nd Annual Conference on Parallel Programming for Analytic Applications, Association for Computing Machinery, p. 4-8 (available online at http://dl.acm.org/citation.cfm?id=2726939).
- 52Gurobi Optimization recently introduced Gurobi Optimizer v7.0, with higher performance and powerful new modeling capabilities.
- 51“Drive thy business or it will drive thee.” Benjamin Franklin offered this sage advice in the 18th century, but he left one key question unanswered: How? How do you successfully drive a business? More specifically, how do you develop the business strategy drivers that incite a business to grow and…
- 51Benjamin Franklin offered this sage advice in the 18th century, but he left one key question unanswered: How? How do you successfully drive a business? More specifically, how do you develop the business strategy drivers that incite a business to grow and thrive? The 21st-century solution has proven to be…
- 48The CUNY School of Professional Studies is offering a new online master of science degree in data analytics. The program prepares its graduates for high-demand and fast-growing careers as data analysts, data specialists, business intelligence analysts, information analysts and data engineers in such fields as business, operations, marketing, social media,…
- 43Many organizations have noticed that the data they own and how they use it can make them different than others to innovate, to compete better and to stay in business. That’s why organizations try to collect and process as much data as possible, transform it into meaningful information with data-driven…