Online Banking Customers: Real-time fraud detection in the cloud
By Saurabh Tandon
This article explores how to detect fraud among online banking customers in near real time by running a combination of learning algorithms on a data set that includes customer transactions and demographic data. The article also explores how the “cloud environment” can be used to deploy these fraud detection algorithms in short order to meet computational demands at a fraction of the cost it otherwise takes in setting up traditional data centers, and acquiring and codifying new hardware and networks.
Real-time decision-making is becoming increasingly valuable with the advancement of data collection and analytical techniques. Due to the increase in data processing speeds, the classical data warehousing model is moving toward a real-time model. Cloud-based platforms enable the rapid development and deployment of applications, thereby reducing the lag between data acquisition and actionable insight. Put differently, the creation-to-consumption cycle is becoming shorter, which enables corporations to experiment and iterate with their business hypothesis much more quickly. Some examples of such applications include:
- A product company getting real-time feedback for its new releases using data from social media in real-time, post-product launch.
- Real-time recommendations for food and entertainment based on a customer’s location.
- Traffic signal operations based on real-time information of traffic volumes.
- E-commerce websites and credit firms detecting customer transactions being authentic or fraudulent in real time.
- Providing more targeted coupons based on customers recent purchases and location.
From a technology architecture perspective, a cloud-based ecosystem can enable users to build an application that detects, in real time, fraudulent customers based on their demographic information and prior financial history. Multiple algorithms help detect fraud, and the output is aggregated to improve prediction accuracy.
But Why Use the Cloud?
A system that allows the development of applications capable of churning out results in real-time needs multiple services running in tandem and is highly resource intensive. By deploying the system in the cloud, maintenance and load balancing of the system can be handled efficiently and cost effectively. In fact, most cloud systems function as “pay as you go” and only charge the user for actual usage vs. maintenance and monitoring costs. “Intelligent” cloud systems also provide recommendations to users to dial up/down resources available to run the fraud detection algorithms without worrying about the data-engineering layer.
Since multiple algorithms are run on the same data to enable fraud detection, a real-time agent paradigm is needed to run the algorithms. An agent is an autonomous entity that may expect inputs and send outputs after performing a set of instructions. In a real-time system, these agents are wired together with directed connections to form an agency. An agent typically has two behaviors: cyclic and triggered. Cyclic agents, as the name suggests, run continuously in a loop and do not need any input. These are usually the first agents in an agency and are used for streaming data to the agency by connecting to an external real-time data source. In short their tasks are “well-defined and repetitive.”
A triggered agent, on the other hand, runs every time it receives a message from a cyclic agent or another triggered agent. The “message” defines the function that the triggered agent needs to perform. To synthesize, these agents allow multiple tasks to be handled in parallel to enable faster data processing.
The above approach combines the strengths and synergies of both cloud computing and machine learning algorithms, providing a small company or even a startup that is unlikely to have specialized staff and necessary infrastructure for what is a computationally intensive approach, the ability to build a system that make decisions based on historical transactions.
Creating the Analytical Data Set
For the specific use case of fraud detection for financial transactions, consider the following work that Mu Sigma did with a client in the financial services industry. The data set used to build the application was comprised of various customer demographic variables and financial information, such as age, residential address, office address, income type, prior shopping history, income, bankruptcy filing status, etc. What’s predicted is a binary variable (whether the transaction is fraudulent or not). In all, about 250 unique variables pertaining to the demographic and financial history of the customers were considered.
To reduce the number of variables for modeling, techniques such as Random Forest was implemented to understand the significance of variables and their relative importance. A cutoff was used to select a subset of this variable list that could be used to test a financial transaction as fraudulent or not with an acceptable level of accuracy.
Algorithms to Detect Fraud
The analytical data set defined above is analyzed via a combination of techniques such as logistic regression (LR), self-organizing maps (SOM) and support vector machines (SVM). Perhaps the most easily understood of these is the logistic regression, which assigns a probabilistic score to each financial transaction for its likelihood to being fraudulent. It does so as a function of the variables defined above as important predictors of fraud. SOMs are fascinating, unsupervised learning algorithms that look for patterns across transactions and then “self-organize” these transactions into fraudulent and not so fraudulent segments. As the volume of transactions increases, so does the accuracy in the self-organization of these transactions. Compared to SOMs, SVMs are supervised learning techniques generally used for classifying data, for example, a training data set that includes verified fraudulent and non-fraudulent transactions. The intersection between LR, SOMs and SVMs ensures that the past is studied, the present analyzed in real time and learnings from both are fed back into the fraud detection framework to make it better and more accurate over time.
Results and Model Validation
In this case, the models were trained on 70 percent of the transaction data, with the remainder streamed to the agency framework discussed above to simulate real-time financial transactions. Under-sampling on the modeling data-set was done to bring the ratio of number of non-fraudulent transactions to 10:1 (original was 20:1).
The final output of the agency is the classification of the streaming input transactions as fraudulent or not. Since the value for the variable being predicted is already known for this data, it helps us gauge the accuracy of the aggregated model as shown in Figure 1.
Figure 1: Accuracy of fraud detection.
Fraud detection can be improved by running an ensemble of algorithms in parallel and aggregating the predictions in real time. This entire end-to-end application can be designed and deployed in days depending on complexity of data, variables to be considered and algorithmic sophistication desired. Deploying this in the cloud makes it horizontally scalable, owing to effective load balancing and hardware maintenance. It also provides higher data security and makes the system fault tolerant by making processes mobile. This combination of a real-time application development system and cloud-based computing enables even non-technical teams to rapidly deploy applications.
Saurabh Tandon is a senior manager with Mu Sigma (http://www.mu-sigma.com/). He has over a decade of experience working in analytics across various domains including banking, financial services and insurance. Tandon holds an MBA in strategy and finance from the Kellogg School of Management and a master’s in quantitative finance from the Stuart Graduate School of Business.