Share with your friends










Submit

Analytics Magazine

Online Banking Customers: Real-time fraud detection in the cloud

November/December 2014

Saurabh TandonBy Saurabh Tandon

This article explores how to detect fraud among online banking customers in near real time by running a combination of learning algorithms on a data set that includes customer transactions and demographic data. The article also explores how the “cloud environment” can be used to deploy these fraud detection algorithms in short order to meet computational demands at a fraction of the cost it otherwise takes in setting up traditional data centers, and acquiring and codifying new hardware and networks.

Real-time decision-making is becoming increasingly valuable with the advancement of data collection and analytical techniques. Due to the increase in data processing speeds, the classical data warehousing model is moving toward a real-time model. Cloud-based platforms enable the rapid development and deployment of applications, thereby reducing the lag between data acquisition and actionable insight. Put differently, the creation-to-consumption cycle is becoming shorter, which enables corporations to experiment and iterate with their business hypothesis much more quickly. Some examples of such applications include:

  • A product company getting real-time feedback for its new releases using data from social media in real-time, post-product launch.
  • Real-time recommendations for food and entertainment based on a customer’s location.
  • Traffic signal operations based on real-time information of traffic volumes.
  • E-commerce websites and credit firms detecting customer transactions being authentic or fraudulent in real time.
  • Providing more targeted coupons based on customers recent purchases and location.

From a technology architecture perspective, a cloud-based ecosystem can enable users to build an application that detects, in real time, fraudulent customers based on their demographic information and prior financial history. Multiple algorithms help detect fraud, and the output is aggregated to improve prediction accuracy.

But Why Use the Cloud?

A system that allows the development of applications capable of churning out results in real-time needs multiple services running in tandem and is highly resource intensive. By deploying the system in the cloud, maintenance and load balancing of the system can be handled efficiently and cost effectively. In fact, most cloud systems function as “pay as you go” and only charge the user for actual usage vs. maintenance and monitoring costs. “Intelligent” cloud systems also provide recommendations to users to dial up/down resources available to run the fraud detection algorithms without worrying about the data-engineering layer.

Since multiple algorithms are run on the same data to enable fraud detection, a real-time agent paradigm is needed to run the algorithms. An agent is an autonomous entity that may expect inputs and send outputs after performing a set of instructions. In a real-time system, these agents are wired together with directed connections to form an agency. An agent typically has two behaviors: cyclic and triggered. Cyclic agents, as the name suggests, run continuously in a loop and do not need any input. These are usually the first agents in an agency and are used for streaming data to the agency by connecting to an external real-time data source. In short their tasks are “well-defined and repetitive.”

A triggered agent, on the other hand, runs every time it receives a message from a cyclic agent or another triggered agent. The “message” defines the function that the triggered agent needs to perform. To synthesize, these agents allow multiple tasks to be handled in parallel to enable faster data processing.

The above approach combines the strengths and synergies of both cloud computing and machine learning algorithms, providing a small company or even a startup that is unlikely to have specialized staff and necessary infrastructure for what is a computationally intensive approach, the ability to build a system that make decisions based on historical transactions.

Creating the Analytical Data Set

For the specific use case of fraud detection for financial transactions, consider the following work that Mu Sigma did with a client in the financial services industry. The data set used to build the application was comprised of various customer demographic variables and financial information, such as age, residential address, office address, income type, prior shopping history, income, bankruptcy filing status, etc. What’s predicted is a binary variable (whether the transaction is fraudulent or not). In all, about 250 unique variables pertaining to the demographic and financial history of the customers were considered.

To reduce the number of variables for modeling, techniques such as Random Forest was implemented to understand the significance of variables and their relative importance. A cutoff was used to select a subset of this variable list that could be used to test a financial transaction as fraudulent or not with an acceptable level of accuracy.

Algorithms to Detect Fraud

The analytical data set defined above is analyzed via a combination of techniques such as logistic regression (LR), self-organizing maps (SOM) and support vector machines (SVM). Perhaps the most easily understood of these is the logistic regression, which assigns a probabilistic score to each financial transaction for its likelihood to being fraudulent. It does so as a function of the variables defined above as important predictors of fraud. SOMs are fascinating, unsupervised learning algorithms that look for patterns across transactions and then “self-organize” these transactions into fraudulent and not so fraudulent segments. As the volume of transactions increases, so does the accuracy in the self-organization of these transactions. Compared to SOMs, SVMs are supervised learning techniques generally used for classifying data, for example, a training data set that includes verified fraudulent and non-fraudulent transactions. The intersection between LR, SOMs and SVMs ensures that the past is studied, the present analyzed in real time and learnings from both are fed back into the fraud detection framework to make it better and more accurate over time.

Results and Model Validation

In this case, the models were trained on 70 percent of the transaction data, with the remainder streamed to the agency framework discussed above to simulate real-time financial transactions. Under-sampling on the modeling data-set was done to bring the ratio of number of non-fraudulent transactions to 10:1 (original was 20:1).

The final output of the agency is the classification of the streaming input transactions as fraudulent or not. Since the value for the variable being predicted is already known for this data, it helps us gauge the accuracy of the aggregated model as shown in Figure 1.

Figure 1: Accuracy of fraud detection.

Figure 1: Accuracy of fraud detection.

Conclusion

Fraud detection can be improved by running an ensemble of algorithms in parallel and aggregating the predictions in real time. This entire end-to-end application can be designed and deployed in days depending on complexity of data, variables to be considered and algorithmic sophistication desired. Deploying this in the cloud makes it horizontally scalable, owing to effective load balancing and hardware maintenance. It also provides higher data security and makes the system fault tolerant by making processes mobile. This combination of a real-time application development system and cloud-based computing enables even non-technical teams to rapidly deploy applications.


Saurabh Tandon is a senior manager with Mu Sigma (http://www.mu-sigma.com/). He has over a decade of experience working in analytics across various domains including banking, financial services and insurance. Tandon holds an MBA in strategy and finance from the Kellogg School of Management and a master’s in quantitative finance from the Stuart Graduate School of Business.

business analytics news and articles

 

Analytics Blog

Electoral College put to the math test


With the campaign two months behind us and the inauguration of Donald Trump two days away, isn’t it time to put the 2016 U.S. presidential election to bed and focus on issues that have yet to be decided? Of course not.




Headlines

Survey: Despite the hype, AI adoption still in early stages

The hype surrounding artificial intelligence (AI) is intense, but for most European businesses surveyed in a recent study by SAS, adoption of AI is still in the early or even planning stages. The good news is, the vast majority of organizations have begun to talk about AI, and a few have even begun to implement suitable projects. There is much optimism about the potential of AI, although fewer were confident that their organization was ready to exploit that potential. Read more →

Data professionals spend almost as much time prepping data as analyzing it

Nearly 40 percent of data professionals spend more than 20 hours per week accessing, blending and preparing data rather than performing actual analysis, according to a survey conducted by TMMData and the Digital Analytics Association. More than 800 DAA community members participated in the survey held earlier this year. The survey revealed that data access, quality and integration present persistent, interrelated roadblocks to efficient and confident analysis across industries. Read more →

UPCOMING ANALYTICS EVENTS

INFORMS-SPONSORED EVENTS

2017 Winter Simulation Conference (WSC 2017)
Dec. 3-6, 2017, Las Vegas

CAP® EXAM SCHEDULE

CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:


 
For more information, go to 
https://www.certifiedanalytics.org.