- What’s our problem?
- Why not Decision Trees but Random Forest?
- Fraud Detection.
- Evaluation of our model.
Hello everyone, in today’s article we are going to review a fraud detection problem with a random forest algorithm. We can have a set of information that consist of knowledge about users’ bank card transactions. We now have data with the features and labels which we already know which transactions are fraud and which transactions aren’t.
Our aim is to construct, learn and see how well our model is performing.
Understanding the issue is crucial step in our article. we already said that we’ve labels for our data, what are these labels? On this case, labels are the column that claims the precise transaction is a fraud or not. Thus, since we’ve labels, this problem is a supervised learning problem.
Secondly, our result may be either fraud or not a fraud. Thus, we’re coping with a binary classification problem. What’s the aim if we’re already knowing it’s a fraud or not?
This query answers how supervised learning is learning. We’re training our model and testing with some portion of the information. After we are training our model, the model discovers some patterns in the information, thus it’s becoming available to predict recent data (transaction) is fraud or not.
In the primary part, we said it is a classification problem. There are various kinds of models that we will use in classification-type problems. Each of them works in its own way.
Listed below are another algorithms that we will use in classification problems:
- Logistic Regression
- Decision Trees
- Support Vector Machines (SVM)
- Gradient Boosting
- Neural Networks
On this list, as a really similar method, we will explain why we selected a random forest algorithm, over decision trees.
The choice trees algorithm is an algorithm that tries to maximise the knowledge gain from the model by splitting the information. Then again, a random forest is an ensemble model that mixes multiple decision trees to scale back overfitting and improve performance.
It checks all of the trees in a random forest and returns probably the most trustable result. Despite the fact that for small datasets, a call tree can be sufficient, in big datasets random forest could give more accurate results.
We’re using a dataset from Kaggle.
Based on the dataset source:
The dataset comprises transactions made by bank cards in September 2013 by European cardholders. Presents transactions that occurred in two days, where we’ve 492 frauds out of 284,807 transactions.
All of the dataset includes whether a particular transaction is a fraud or not. Nevertheless, we are going to split our data into train and test sets. We is not going to show the label of the test set to the pc. In this manner, we are going to have the option to guage our model’s performance and after we actually need to grasp whether a recent transaction is a fraud or not, we are going to understand how much we will trust this particular model.
As we will see above, we’ve Time, Amount, and Class columns in addition to V1, V2,…VN.
- ‘V’ columns are the columns created with principal component evaluation to offer privacy for sensitive information. This data has still a useful pattern though we will’t understand what’s it directly.
- The quantity column is a transaction amount.
- The time column is the time between the primary data in the information set to the precise data.
- The category column is a label column that claims whether the precise transaction is a fraud or not.
Their type is float and integer. Which is nice for our training purposes. If some columns can be a string, we may have to alter the information types.
Aside from the Class column which has 2 unique values (fraud, not fraud), most of our values are unique.
We would like to alter the time column which is an aggregate feature. (It’s always increasing due to calculating the time between the primary transaction and the precise transaction that occurred.) I discovered it more useful to calculate the time between each transaction.
Since we’ve created a column that calculates the time between the one data before, null and infinite values occurred. ( There is no such thing as a existing value before index 0, and index 1 was also 0 in our original column)
Dropping the infinite and null values.
Splitting our data into train and test sets in addition to introducing the X and y variables that are features and labels.
Scaling our data. Scaling is significant because we don’t want some features with high values to affect our model greater than other features. This may cause some algorithms to present more importance to features with larger values, even in the event that they aren’t necessarily more informative.
Importing our Random Forest algorithm which is already within the scikit — learn package of Python.
On this particular problem, I actually have chosen n_estimators (trees) as 500. You may change this amount. On the whole, increasing the variety of trees within the forest can improve the model’s performance, as much as a certain point, beyond which adding more trees may not provide significant improvements and will even decrease performance. By default, scikit-learn chooses 100 n_estimators. I made a decision to extend it to 500 because I consider the dimensions of the dataset.
Moreover, you’ll be able to search the cross-validation technique so as to predict the optimal amount of trees.
In classification problems, we will use a confusion matrix to calculate our performance. Principally, it compares our predicted results with the actual labels and plots the next:
- The result was positive and we predicted as positive
- The result was negative and we predicted as positive
- The result was positive and we predicted as negative
- The result was negative and we predicted as negative
The actual accuracy may not give the results of how powerful is our model, especially within the datasets which aren’t equal. For instance, we’ve 1000 true and 10 false labels. Despite the fact that our model just says true to each value, our accuracy can be 99 percent. That’s why we’re generally using a confusion matrix.
In total, we’ve 0.15% of fraud.
You may observe the labels that are fraud, and find their indexes.
For this text, we’re ending here. If you ought to catch more articles about machine learning, you’ll be able to follow my page. For codes and datasets, you’ll be able to see the links below. See you in the following article!