Fraud Detection: Networks vs. fraud: Connecting the dots
Well-constructed analytical models useful in thwarting fraudsters who often collaborate, thus setting up complex but revealing patterns of behavior.
By Bart Baesens, Véronique Van Vlasselaer and Wouter Verbeke (l-r)
A typical organization loses about 5 percent of its revenues due to fraud each year . The total cost of non-health insurance fraud in the United States is estimated to be more than $40 billion per year . Other popular fraud examples include credit card fraud, money laundering, healthcare fraud, telecommunications fraud, click fraud, tax evasion and many more. Given today’s fierce competitive environment, narrowing margins and increasing shareholder pressure, more and more firms are considering fraud detection and prevention as a key strategic priority. To continuously outsmart the fraudsters, who keep on perfecting their tactics thanks to new emerging technology (e.g., the Internet of Things), firms need new, efficient and effective tools such as big data and analytics.
Since data is becoming available in abundance and at a low cost, many firms are investing in massive data storage platforms (e.g., Hadoop). These untapped data reservoirs provide enormous potential to deploy analytics and uncover what mechanisms fraudsters use. Recently, we engaged in various active research partnerships to study how analytics can be used to detect both tax evasion and credit card fraud. This short article includes some of our key research findings.
Successful fraud analytical models should satisfy various requirements. First, they should achieve good statistical performance in terms of recall or hit rate (the percentage of fraudsters labeled by the analytical model as suspicious) and precision (the percentage of fraudsters among the ones labeled as suspicious). Next, the analytical models should not be based on complex mathematical formulas (such as neural networks and support vector machines), but they should provide clear insight into the fraud mechanisms adopted. This is particularly important since the insights gained will be used to develop new fraud prevention strategies. Finally, the operational efficiency of the fraud analytical model needs to be evaluated. This refers to the amount of resources needed to calculate the fraud score and adequately act upon it. For example, in a credit card fraud environment, a decision needs to be made within a few seconds after the transaction was initiated.
Given all the aforementioned requirements, it becomes obvious that simple analytical techniques (e.g., regression-based) should be adopted to detect fraud. To make sure that these models give the best performance possible, they should be fed with the best data available, so let’s zoom into the data aspect into some more detail.
Many analytical models assume that customer behavior is independent (the well-known IID assumption in statistics). Throughout our research in fraud detection, we have consistently found this assumption to be false! Fraud is a social phenomenon, and fraudsters often collaborate to set up complex fraudulent patterns (i.e., collusion). This principle is often referred to as homophily (people tend to associate with or bond with similar others), or, in this context, fraudsters are likely to be connected (in some way) to other fraudsters. Hence, to fully exploit the data available, we should start thinking about defining networks between various data observations.
A network consists of nodes and edges that represent relationships between the nodes. A first example is corporate tax evasion whereby the nodes correspond to companies and the edges to relationships between them based on shared resources such as infrastructure, employees, equipment, etc. Another example is a credit card fraud detection setting, where two types of nodes can be distinguished: credit cards and merchants (academics would call this a bipartite graph). A link between a credit card and a merchant then corresponds to a transaction.
Networks are important, but a key challenge concerns their definition, which highly depends upon the fraud setting you are working in. Ideally, a business expert or fraud analyst should provide some starting insights about what might be interesting to consider as nodes and edges. Depending upon the application area, different types of nodes can be distinguished (e.g., customer, claim, car, mobile, car repair shop, etc.). The edges represent relationships that can be weighted based upon, for example, the recency of the relationship (more recent connections have a higher weight). Once the network has been properly defined, it can be visually explored and statistically tested for homophily.
If homophily is detected, the network can be “featurized.” The idea here is to create features summarizing the key characteristics of the network, and add those to the data for analytical modeling. A simple example of a feature could be the number of fraudulent customers that a given customer is connected to. The more network features added to the data, the better, since every analytical technique has built-in facilities to detect which are the important ones. Essentially, featurization is a data enrichment operation, which will allow the analytical technique (e.g., regression) to disentangle complex, network-based fraud patterns.
- Analytical fraud models should have high hit rates and precision, good interpretability and operational efficiency.
- Fraud is a social phenomenon; fraudsters often act in collaboration with other fraudsters.
- Networks are important to disentangle complex fraud patterns.
- Featurization is a data enrichment operation whereby network features are added to the data for improved analytical modeling.
We welcome any shared experiences (both confirming and contradicting).
Bart Baesens (Bart.Baesens@kuleuven.be) is a professor in the Decision Sciences and Information Management (DSIM) department at Katholieke Universiteit (KU) Leuven (Belgium) where he teaches the course “Fraud Analytics” and an e-learning class on “Advanced Analytics in a Big Data World.” He is also program coordinator of the Master of Information Management degree program, which offers a specialized data track. Baesens in the author of the book, “Analytics in a Big Data World” and co-author (with Van Vlasselaer and Verbeke) of the book, “Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques.”
Véronique Van Vlasselaer (Veronique.firstname.lastname@example.org) is a Ph.D. researcher in the DSIM department at KU Leuven. She graduated magna cum laude as Master Information Systems Engineer at the faculty of Business and Economics, KU Leuven. Van Vlasselaer received the best thesis award from the faculty’s student branch for “Mining Data on Twitter.”
Wouter Verbeke (Wouter.Verbeke@vub.ac.be) is a professor of Business Informatics and Business Analytics at Vrije Universiteit (VU) Brussel (Belgium). His research is situated in the fields of data mining, predictive analytics and complex network analysis, and is driven by real-life business problems that require a data-driven solution, including applications in marketing, finance, supply chain management, mobility and human resources.
- Baesens, B., “Analytics in a Big Data World,” Wiley, 2014.
- Baesens, B., Van Vlasselaer, V. and Verbeke, W. “Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection,” Wiley, 2015 (forthcoming).
- 30The 2016 election is a watershed moment for the U.S. healthcare industry. Any presidential election and change of guards come with changes in policies. It happened in 2008 when President Obama was sworn into the office. That led to the establishment of the Affordable Care Act (ACA) or Obamacare. To…