Bank card fraud detection is a plague that every one financial institutions are in danger with. Normally fraud detection could be very difficult because fraudsters are coming up with recent and revolutionary ways of detecting fraud, so it’s difficult to seek out a pattern that we will detect. For instance, within the diagram all of the icons look the identical, but there one icon that’s barely different from the remainder and we now have pick that one. Can you notice it?
Here it’s:
With this background let me provide a plan for today and what you’ll learn within the context of our use case ‘Credit Card Fraud Detection’:
1. What’s data imbalance
2. Possible causes of information Imbalance
3. Why is class imbalance an issue in machine learning
4. Quick Refresher on Random Forest Algorithm
5. Different sampling methods to take care of data Imbalance
6. Comparison of which method works well in our context with a practical Demonstration with Python
7. Business insight on which model to decide on and why?
Normally, since the variety of fraudulent transactions will not be an enormous number, we now have to work with a knowledge that typically has plenty of non-frauds in comparison with Fraud cases. In technical terms such a dataset known as an ‘imbalanced data’. But, it continues to be essential to detect the fraud cases, because just one fraudulent transaction could cause hundreds of thousands of losses to banks/financial institutions. Now, allow us to delve deeper into what’s data imbalance.
We can be considering the bank card fraud dataset from https://www.kaggle.com/mlg-ulb/creditcardfraud (Open Data License).
Formally which means the distribution of samples across different classes is unequal. In our case of binary classification problem, there are 2 classes
a) Majority class—the non-fraudulent/real transactions
b) Minority class—the fraudulent transactions
Within the dataset considered, the category distribution is as follows (Table 1):
As we will observe, the dataset is very imbalanced with only 0.17% of the observations being within the Fraudulent category.
There could be 2 major causes of information imbalance:
a) Biased Sampling/Measurement errors: That is as a consequence of collection of samples only from one class or from a specific region or samples being mis-classified. This could be resolved by improving the sampling methods
b) Use case/domain characteristic: A more pertinent problem as in our case could be as a consequence of the issue of prediction of a rare event, which robotically introduces skewness towards majority class since the occurrence of minor class is practice will not be often.
It is a problem because many of the algorithms in machine learning deal with learning from the occurrences that occur steadily i.e. the bulk class. This known as the frequency bias. So in cases of imbalanced dataset, these algorithms won’t work well. Typically few techniques that may work well are tree based algorithms or anomaly detection algorithms. Traditionally, in fraud detection problems business rule based methods are sometimes used. Tree-based methods work well because a tree creates rule-based hierarchy that may separate each the classes. Decision trees are likely to over-fit the info and to eliminate this possibility we’ll go along with an ensemble method. For our use case, we’ll use the Random Forest Algorithm today.
Random Forest works by constructing multiple decision tree predictors and the mode of the classes of those individual decision trees is the ultimate chosen class or output. It’s like voting for the preferred class. For instance: If 2 trees predict that Rule 1 indicates Fraud while one other tree indicates that Rule 1 predicts Non-fraud, then in accordance with Random forest algorithm the ultimate prediction can be Fraud.
Formal Definition: A random forest is a classifier consisting of a group of tree-structured classifiers {h(x,Θk ), k=1, …} where the {Θk} are independent identically distributed random vectors and every tree casts a unit vote for the preferred class at input x . (Source)
Each tree is dependent upon a random vector that’s independently sampled and all trees have the same distribution. The generalization error converges because the variety of trees increases. In its splitting criteria, Random forest searches for the most effective feature amongst a random subset of features and we also can compute variable importance and accordingly do feature selection. The trees could be grown using bagging technique where observations could be random chosen (without substitute) from the training set. The opposite method could be random split selection where a random split is chosen from K-best splits at each node.
You’ll be able to read more about it here
We’ll now illustrate 3 sampling methods that may maintain data imbalance.
a) Random Under-sampling: Random draws are taken from the non-fraud observations i.e the bulk class to match it with the Fraud observations ie the minority class. This implies, we’re throwing away some information from the dataset which could not be ideal all the time.
b) Random Over-sampling: On this case, we do exact opposite of under-sampling i.e duplicate the minority class i.e Fraud observations at random to extend the variety of the minority class till we get a balanced dataset. Possible limitation is we’re creating plenty of duplicates with this method.
c) SMOTE: (Synthetic Minority Over-sampling technique) is one other method that uses synthetic data with KNN as an alternative of using duplicate data. Each minority class example together with their k-nearest neighbours is taken into account. Then along the road segments that join any/all of the minority class examples and k-nearest neighbours synthetic examples are created. That is illustrated within the Fig 3 below:
With only over-sampling, the choice boundary becomes smaller while with SMOTE we will create larger decision regions thereby improving the possibility of capturing the minority class higher.
One possible limitation is, if the minority class i.e fraudulent observations is spread throughout the info and never distinct then using nearest neighbours to create more fraud cases, introduces noise into the info and this could result in mis-classification.
A few of the metrics that is beneficial for judging the performance of a model are listed below. These metrics provide a view how well/how accurately the model is capable of predict/classify the goal variable/s:
· TP (True positive)/TN (True negative) are the cases of correct predictions i.e predicting Fraud cases as Fraud (TP) and predicting non-fraud cases as non-fraud (TN)
· FP (False positive) are those cases which are actually non-fraud but model predicts as Fraud
· FN (False negative) are those cases which are actually fraud but model predicted as non-Fraud
Precision = TP / (TP + FP): Precision measures how accurately model is capable of capture fraud i.e out of the whole predicted fraud cases, what number of actually turned out to be fraud.
Recall = TP/ (TP+FN): Recall measures out of all of the actual fraud cases, what number of the model could predict accurately as fraud. That is a very important metric here.
Accuracy = (TP +TN)/(TP+FP+FN+TN): Measures what number of majority in addition to minority classes may very well be accurately classified.
F-score = 2*TP/ (2*TP + FP +FN) = 2* Precision *Recall/ (Precision *Recall) ; It is a balance between precision and recall. Note that precision and recall are inversely related, hence F-score is measure to attain a balance between the 2.
First, we’ll train the random forest model with some default features. Please note optimizing the model with feature selection or cross validation has been kept out-of-scope here for sake of simplicity. Post that we train the model using under-sampling, oversampling after which SMOTE. The table below illustrates the confusion matrix together with the precision, recall and accuracy metrics for every method.
a) No sampling result interpretation: With none sampling we’re capable of capture 76 fraudulent transactions. Though the general accuracy is 97%, the recall is 75%. Because of this there are quite a couple of fraudulent transactions that our model will not be capable of capture.
Below is the code that could be used :
# Training the model
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)
classifier.fit(x_train,y_train)# Predict Y on the test set
y_pred = classifier.predict(x_test)
# Obtain the outcomes from the classification report and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
print('Classifcation report:n', classification_report(y_test, y_pred))
conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print('Confusion matrix:n', conf_mat)
b) Under-sampling result interpretation: With under-sampling , though the model is capable of capture 90 fraud cases with significant improvement in recall, the accuracy and precision falls drastically. It is because the false positives have increased phenomenally and the model is penalizing plenty of real transactions.
Under-sampling code snippet:
# That is the pipeline module we want from imblearn
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline # Define which resampling method and which ML model to make use of within the pipeline
resampling = RandomUnderSampler()
model = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)
# Define the pipeline,and mix sampling method with the RF model
pipeline = Pipeline([('RandomUnderSampler', resampling), ('RF', model)])
pipeline.fit(x_train, y_train)
predicted = pipeline.predict(x_test)
# Obtain the outcomes from the classification report and confusion matrix
print('Classifcation report:n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:n', conf_mat)
c) Over-sampling result interpretation: Over-sampling method has the very best precision and accuracy and the recall can be good at 81%. We’re capable of capture 6 more fraud cases and the false positives is pretty low as well. Overall, from the attitude of all of the parameters, this model is model.
Oversampling code snippet:
# That is the pipeline module we want from imblearn
from imblearn.over_sampling import RandomOverSampler# Define which resampling method and which ML model to make use of within the pipeline
resampling = RandomOverSampler()
model = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)
# Define the pipeline,and mix sampling method with the RF model
pipeline = Pipeline([('RandomOverSampler', resampling), ('RF', model)])
pipeline.fit(x_train, y_train)
predicted = pipeline.predict(x_test)
# Obtain the outcomes from the classification report and confusion matrix
print('Classifcation report:n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:n', conf_mat)
d) SMOTE: Smote further improves the over-sampling method with 3 more frauds caught in the online and though false positives increase a bit the recall is pretty healthy at 84%.
SMOTE code snippet:
# That is the pipeline module we want from imblearnfrom imblearn.over_sampling import SMOTE
# Define which resampling method and which ML model to make use of within the pipeline
resampling = SMOTE(sampling_strategy='auto',random_state=0)
model = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)
# Define the pipeline, tell it to mix SMOTE with the RF model
pipeline = Pipeline([('SMOTE', resampling), ('RF', model)])
pipeline.fit(x_train, y_train)
predicted = pipeline.predict(x_test)
# Obtain the outcomes from the classification report and confusion matrix
print('Classifcation report:n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:n', conf_mat)
In our use case of fraud detection, the one metric that’s most vital is recall. It is because the banks/financial institutions are more concerned about catching many of the fraud cases because fraud is pricey they usually might lose plenty of money over this. Hence, even when there are few false positives i.e flagging of real customers as fraud it won’t be too cumbersome because this only means blocking some transactions. Nevertheless, blocking too many real transactions can be not a feasible solution, hence depending on the chance appetite of the financial institution we will go along with either easy over-sampling method or SMOTE. We also can tune the parameters of the model, to further enhance the model results using grid search.
For details on the code seek advice from this link on Github.
References:
[1] Mythili Krishnan, Madhan K. Srinivasan, Credit Card Fraud Detection: An Exploration of Different Sampling Methods to Solve the Class Imbalance Problem (2022), ResearchGate
[1] Bartosz Krawczyk, Learning from imbalanced data: open challenges and future directions (2016), Springer
[2] Nitesh V. Chawla, Kevin W. Bowyer , Lawrence O. Hall and W. Philip Kegelmeyer , SMOTE: Synthetic Minority Over-sampling Technique (2002), Journal of Artificial Intelligence research
[3] Leo Breiman, Random Forests (2001), stat.berkeley.edu
[4] Jeremy Jordan, Learning from imbalanced data (2018)