Understanding data miners
Data miner survey examines trends and reveals new insights.
By Karl Rexer (Pictured), Heather N. Allen and Paul Gearan
For four years, Rexer Analytics has conducted annual surveys of the data mining community to assess the experiences and perspectives of data miners in a variety of areas. In 2010, Rexer sent out more than 10,000 invitations, and the survey was promoted in a variety of newsgroups and blogs. Each year the number and variety of respondents has increased, from the 314 in the inaugural year (2007) to 735 respondents in 2010. [Rexer Analytics did not specifically define “data miner” or “data mining”; the decision to participate in the survey was an individual choice.]
The data miners who responded to the 2010 survey come from more than 60 countries and represent many facets of the data mining community. The respondents include consultants, academics, data mining practitioners in companies large and small, government employees and representatives of data mining software companies. They are generally experienced and come from a variety of educational backgrounds.
Each year the survey asks data miners about the specifics of their modeling process and practices (algorithms, fields of interest, technology use, etc.), their priorities and preferences for analytic software, the challenges they face and how they address them, and their thoughts on the future of data mining. Each year the survey also includes several questions on special topics. Often these questions are selected from the dozens of suggestions we receive from members of the data mining community. For example, the 2010 survey included questions about text mining and also gathered information about the best practices in overcoming the key challenges that data miners face.
For a free copy of the 37-page summary report of the 2010 survey findings, e-mail DataMinerSurvey@RexerAnalytics.com.
Data Mining Practices
The data miners responding to the survey apply data mining in a diverse set of industries and fields. In all, more than 20 fields were mentioned in last year’s survey, from telecommunications to pharmaceuticals to military security. In each of the four years, CRM/marketing has been the field mentioned by the greatest number of respondents (41 percent in 2010). Many data miners also report working in financial services and in academia. Fittingly, “improving the understanding of customers,” “retaining customers” and other CRM goals were the goals identified by the most data miners.
Decision trees, regression and cluster analysis form a triad of core algorithms for most data miners. However, a wide variety of algorithms are being used. Time series, neural nets, factor analysis, text mining and association rules were all used by at least one quarter of respondents. This year, for the first time, the survey asked about ensemble models and uplift modeling. Twenty-seven percent of data mining consultants report using ensemble models; about 20 percent of corporate, academic and non-government organization data miners report using them. About 10 percent of corporate and consulting data miners report using uplift modeling, whereas this technique was only used by about 5 percent of academic and NGO/Gov’t data miners. Model “size” varied widely. About one-third of data miners typically utilize 10 or fewer variables in their final models, while about 28 percent generally construct models with more than 45 variables.
Text mining has emerged as a hot data-mining topic in the past few years, and the 2010 survey asked several questions about text mining. About a third of data miners currently incorporate text mining into their analyses, while another third plan to do so. Most data miners using text mining employ it to extract key themes for analysis (sentiment analysis) or as inputs in a larger model. However, a notable minority use text mining as part of social network analyses. According to the survey respondents, data miners employ STATISTICA Text Miner and IBM SPSS Modeler most frequently for text mining.
The survey also asked data miners working in companies whether most data mining is handled internally or externally (through consultants or vendor arrangements). Thirty-nine percent indicated that data mining is handled entirely internally, and 43 percent reported that it is handled mostly internally, while only 1 percent reported that it was entirely external. Additionally, 14 percent of data miners reported that their organization offshores some of its data analytics (an increase from 8 percent reported in the previous year).
One of the centerpieces of the data miner survey over the years has been assessing priorities and preferences for data mining software packages. Data miners consistently indicate that the quality and accuracy of model performance, the ability to handle very large datasets and the variety of available algorithms are their top priorities when selecting data mining software.
Data miners report using an average of 4.6 software tools. After a steady rise across the past few years, the open source data mining software R overtook other tools to become the tool used by more data miners (43 percent) than any other. SAS and IBM SPSS Statistics are also used by more than 30 percent of data miners. STATISTICA, which has also been climbing in the rankings, was selected this year as the primary data-mining tool by the most data miners (18 percent).
|The 5th Annual Data Miner Survey
|Rexer Analytics recently launched its fifth annual data miner survey. In addition to continuing to collect data on trends in data miners’ practices and views, this year Rexer Analytics has included additional question on data visualization, best practices in analytic project success measurement and online analytic resources. To participate in the 2011 survey, follow this survey participation link, and use access code INF28.|
The summary report shows the differences in software preferences among corporate, consulting, academic and NGO/Government data miners. For example, STATISTICA, SAS, IBM SPSS Modeler and R all have strong penetration in corporate environments, whereas Matlab, the open source tools Weka and R, and the IBM SPSS tools have strong penetration among academic data miners.
The survey also asked data miners about their satisfaction with their tools. STATISTICA, IBM SPSS Modeler and R received the strongest satisfaction ratings in both 2010 and 2009. Data miners were most satisfied with their primary software on two of the items most important to them — quality and accuracy of performance and variety of algorithms — but not as satisfied with the ability of their software to handle very large datasets. They were also highly satisfied with the dependability/stability of their software and its data manipulation capabilities. STASTICA and R users were the most satisfied across a wide range of factors.
Data miners report that the computing environment for their data mining is frequently a desktop or laptop computer, and often the data is stored locally. Only a small number of data miners report using cloud computing. Model scoring typically happens using the same software that developed the models. STATISTICA users are more likely than other tool users to deploy models using PMML.
In each of Rexer Analytics’ previous Data Miner Surveys, respondents were asked to share their greatest challenges as data miners. In each year, “dirty data” emerged as the No. 1 challenge. Explaining data mining to others and difficulties accessing data have also persisted as top challenges year after year. Other challenges commonly identified include limitations of tools, difficulty finding qualified data miners and coordination with IT departments.
In the 2010 survey, data miners also shared best practices for overcoming the top challenges. Respondents shared a wide variety of best practices, coming up with some innovative approaches to these perennial challenges. Their ideas are summarized and along with verbatim comments (196 suggestions) on the website: www.rexeranalytics.com/Overcoming_Challenges.html.
Key challenge No. 1: Dirty Data. Eighty-five data miners described their experiences in overcoming this challenge. Key themes were the use of descriptive statistics, data visualization, business rules and consultation with data content experts (business users). Some example responses:
- In terms of dirty data, we use a combination of two methods: informed intuition and data profiling. Informed intuition required our human analysts to really get to know their data. Data profiling entails checking to see if the data falls into pre-defined norms. If it is outside the norms, we go through a data validation step to ensure that the data is in fact correct.
- Don’t forget to look at a missing data plot to easily identify systematic pattern of missing data (MD). Multiple imputation of MD is much better than not to calculate MD and suffer from “amputation” of your data set. Alternatively flag MD as new category and model it actively. MD is information! Use random forest (RF) as feature selection. I used to incorporate often too many variables which models just noise and is complex. With RF before modeling, I end up with only 5-10 variables and brilliant models.
- A quick K-means clustering on a data set reveals the worst as they often end up as single observation clusters.
- We calculate descriptive statistics about the data and visualize before starting the modeling process. Discussions with the business owners of the data have helped to better understand the quality. We try to understand the complexity of the data by looking at multivariate combinations of data values.
Key challenge No. 2: Explaining data mining to others. Sixty-five data miners described their experiences in overcoming this challenge. Key themes were the use of graphics, very simple examples and analogies, and focusing on the business impact of the data mining initiative. Some example responses:
- Leveraging “competing on analytics” and case studies from other organizations help build the power of the possible. Taking small impactful projects internally and then promoting those projects throughout the organization helps adoption. Finally, serving the data up in a meaningful application — BI tool — shows our stakeholders what data mining is capable of delivering.
- The problem is in getting enough time to lay out the problem and showing the solution. Most upper management wants short presentations but don’t have the background to just get the results. They often don’t buy into the solutions because they don’t want to see the background. Thus we try to work with their more ambitious direct reports who are more willing to see the whole presentation and, if they buy into it, will defend the solution with their immediate superiors.
- I’ve brought product managers (clients) to my desk and had them work with me on what analyses was important to them. That way I was able to manipulate the data on the fly based on their expertise to analyze different aspects that were interesting to them.
Key challenge No. 3: Difficulty accessing data. Forty-six data miners described their experiences in overcoming this challenge. Key themes were devoting resources to improving data availability and methods of overcoming organizational barriers. Some example responses:
- I usually would confer with the appropriate content experts in order to devise a reasonable heuristic to deal with unavailable data or impute variables. Difficult to access data means typically we don’t have a good plan for what needs to be collected. I talk with the product managers and propose data needs for their business problems. If we can match the business issues with the needs, data access and availability is usually resolved.
- A lot of traveling to the business unit site to work with the direct “customer” and local IT … generally put best practices into place after cleaning what little data we can find. Going forward we generally develop a project plan around better, more robust data collection.
The Future of Data Mining
Data miners are optimistic about continued growth in the number of projects they will be conducting in the near future. Seventy-three percent reported they conducted more projects in 2010 than they did in 2009, a trend that is expected to continue in 2011. This optimism is shared across data miners working in a variety of settings.
When asked about future trends in data mining, the largest number of respondents identified the growth in adoption of data mining as a key trend. Other key trends identified by multiple data miners are increases in text mining, social network analysis and automation.
Karl Rexer (krexer@RexerAnalytics.com) is president of Rexer Analytics, a Boston-based consulting firm that specializes in data mining and analytic CRM consulting. He founded Rexer Analytics in 2002 after many years working in consulting, retail banking and academia. He holds a Ph.D. in Experimental Psychology from the University of Connecticut. Heather Allen (hallen@RexerAnalytics.com) is a senior consultant at Rexer Analytics. She has built predictive models, customer segmentation, forecasting and survey research solutions for many Rexer Analytics clients. Prior to joining the company she designed financial aid optimization solutions for colleges and universities. She holds a Ph.D. in Clinical Psychology from the University of North Carolina at Chapel Hill. Paul Gearan (pgearan@RexerAnalytics.com) is a senior consultant at Rexer Analytics. He has built attrition analyses, text mining, predictive models and survey research solutions for many Rexer Analytics clients. His 2006 in-depth analyses of the NBA draft resulted in an appearance on ESPNews. He holds a master’s degree in Clinical Psychology from the University of Connecticut. More information about Rexer Analytics is available at www.RexerAnalytics.com. Questions about this research and requests for the free survey summary reports should be e-mailed to DataMinerSurvey@RexerAnalytics.com.