Detecting Malicious URLs Using LSTM and Google’s BERT Models

The rise of cybercrime has made fraudulent webpage detection a necessary task in ensuring that the web is protected. It is clear that these risks, equivalent to the theft of personal information, malware, and viruses, are related to online activities on emails, social media applications, and web sites. These web threats, called malicious URLs, are utilized by cybercriminals to lure users to go to web pages that appear real or legitimate.

This paper explores the event of a deep learning system involving a transformer algorithm to detect malicious URLs with the aim of improving an existing method equivalent to Long Short-Term Memory (LSTM). (Devlin et al., 2019) introduced a Natural language modelling algorithm (BERT) developed by Google Brain in 2017. This model is capable of creating more accurate predictions to outperform the recurrent neural network systems equivalent to Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU). On this project, I compared the BERT’s performance with LSTM as a text classification technique. With the processed dataset containing over 600,000 URLs, a pre-trained model is developed, and results are compared using performance metrics equivalent to r2 rating, accuracy, recall, etc. (Y. E. Seyyar et al., 2022). This LSTM algorithm achieved an accuracy rate of 91.36% and an F1 rating of 0.90 (higher than BERT’s) within the classification when it comes to each unusual and customary requests. Keywords: Malicious URLs, Long Short Term Memory, phishing, benign, Bidirectional Encoder Representations from Transformers (BERT).

1.0 Introduction

With the usability of the Web through the Web, there was an increasing variety of users over time. As all digital devices are connected to the web, this has also resulted in an increasing variety of phishing threats through web sites, social media, emails, applications, etc. (Morgan, S., 2024) reported that greater than $9.5 trillion was lost globally on account of leaks of personal information.

Subsequently, revolutionary approaches have been introduced over time to automate the duty of ensuring safer web usage and data protection. The Symantec 2016 Web Security Report (Vanhoenshoven et al., 2016) shows scammers have caused most cyber-attacks involving corporate data breaches on browsers and web sites, in addition to other sheer malware attempts using the Uniform Resource Locator by baiting users.

Structure of a URL (Image by creator)

In recent times, blacklisting, reputation-based systems, and machine learning algorithms have been utilized by cybersecurity professionals to enhance malware detection and make the online safer. Google’s statistics reported that over 9,500 suspicious web pages are blacklisted and blocked per day. The existence of those malicious web pages represents a major risk to the knowledge security of web applications, particularly those who take care of sensitive data (Sankaran et al., 2021). Since it’s really easy to implement, blacklisting has change into the usual way. The false-positive rate can be significantly lowered with this strategy. The issue, nevertheless, is that it’s extremely difficult to maintain an in depth list of malicious URLs up thus far, especially considering that latest URLs are typically created every single day. So as to circumvent filters and trick users, cybercriminals have give you ingenious methods, equivalent to obfuscating the URL so it appears to be real. This field of Artificial Intelligence (AI) has seen significant advancements and applications in quite a lot of domains, including cybersecurity. One critical aspect of cybersecurity is detecting and stopping malicious URLs, which can lead to serious consequences equivalent to data breaches, identity theft, and financial losses. Given the dynamic and ever-changing nature of cyber threats, detecting malicious URLs is a difficult task.

This project goals to develop a deep learning system for text classification called Malicious URL Detection using pre-trained Bidirectional Encoder Representations from Transformers (BERT). Can the BERT model outperform existing techniques in malicious URL detection? The expected final result of this study is to exhibit the effectiveness of the BERT model in detecting malicious URLs and compare its performance with recurrent neural network techniques equivalent to LSTM. I used evaluation metrics equivalent to accuracy, precision, recall, and F1-score to match the models’ performance.

2.0. Background

Machine learning methods like Random Forest and Multi-Layer Perception, Support Vector Machines, and deep learning methods like LSTM and other CNN are only a couple of of the methods proposed by the prevailing literature for detecting harmful URLs. Nonetheless, there are drawbacks to those methods, equivalent to the incontrovertible fact that they necessitate traditional features, as they’re unable to take care of complex data, thereby leading to overfitting.

2.1. Related works

To enhance the time for obtaining the page content or processing the text, (Kan and Thi, 2005) used a technique of categorising web sites based on their URLs. Classification features were collected from the URL after it was parsed into several tokens. Token dependencies in time order were modelled by the characteristics. They concluded that the classification rate increased when high-quality URL segmentation was combined with feature extraction. This approach paved the best way for other research on developing complex deep learning models for text classification. As a binary text classification problem, (Vanhoenshoven et al., 2016) developed models for the detection of malicious URLs and evaluated the performance of classifiers, including Naive Bayes, Support Vector Machines, Multi-Layer Perceptron, etc. Subsequently, text embedding methods implementing transformers have produced state-of-the-art ends in NLP tasks. An identical model was devised by (Maneriker et al., 2021), wherein they pre-trained and refined an existing transformer architecture using only URL data. The URL dataset included 1.29 million entries for training and 1.78 million entries for testing. Initially, the BERT architecture supported the masked language modelling framework, which might not be obligatory on this report.

For the classification process, the BERT and RoBERTa algorithms were fine-tuned, and results were evaluated and in comparison with propose a model called URLTran (URL Transformers) that uses transformers to significantly improve the performance of malicious URL detection with very low false positive rates compared to other deep learning networks. With this method, the URLTran model achieved an 86.8% true positive rate (TPR) in comparison with the very best baseline’s TPR of 71.20%, leading to an improvement of 21.9%. This mentioned method was capable of classify and predict whether a detected URL is benign or malicious.

Moreover, an RNN-based model was proposed by (Ren et al, 2019) where extracted URLs were converted into word vectors (characters) through the use of pre-trained Word2Vec, and Bi-LSTM (bi-directional long short-term memory) and classifying them. After validation and evaluation, the model achieved 98% accuracy and an F1 rating of 95.9%. This model outperformed almost all the NLP techniques but only processed text characterization separately. Nonetheless, there may be a have to develop an improved model using BERT to process sequential input abruptly. Although these models have demonstrated some improvement with big data, they aren’t without their limitations. The sequential nature of text data, as an example, could also be difficult with RNNs, while CNNs most times don’t capture long-term dependencies in the info (Alzubaidi et al., 2021). As the amount and complexity of textual data on the net proceed to extend, it’s possible that current models will change into inadequate.

3.0. Objectives

This project presented the importance of a bidirectional pre-trained model for text classification. (Radford et al., 2018) implemented unidirectional language models to pre-train BERT. In comparison with this, a shallow concatenation of independently trained left-to-right and right-to-left linear models was created (Devlin et al., 2019; Peters et al., 2018). Here, I used a pre-trained BERT model to attain state-of-the-art performance on a big scale of sentence-level and token-level tasks (Han et al., 2021) with the aim to outperform many RNNs architectures, thereby reducing the necessity for these frameworks. On this case, the hyper-parameters of the LSTM algorithm is not going to be fine-tuned.

Specifically, this research paper emphasises:

Developing an LSTM and pre-trained BERT models to detect (classify) whether a URL is unsafe or not.
Comparing results of the bottom model (LSTM) and pre-trained BERT using evaluation metrics equivalent to recall, accuracy, F1 rating, precision. This could help to find out if the bottom model performance is healthier or not.
BERT mechanically learns latent representation of words and characters in context. The one task is to fine-tune the BERT model to enhance the baseline performance. This proposes a computationally easy approach to RNNs as an alternative choice to the more resource-intensive, and computationally expensive architectures.
Evaluation and model development and evaluation took about 7 weeks and the aim was to attain a significantly reduced training runtime with Google’s BERT model.

4.0. Methodology

This section explains all of the processes involved in implementing a deep learning system for detecting malicious URLs. Here, a transformer-based framework was developed from an NLP sequence perspective (Rahali and Akhloufi, 2021) and used to statistically analyse a public dataset.

4.1. The dataset

The dataset used for this report was compiled and extracted from Kaggle (license info). This dataset was prepared to perform the classification of webpages (URLs) as malicious or benign. The datasets consisting of URL entries for training, validation and testing were collected.

To analyze the info using deep learning models, an enormous dataset of 651,191 URL entries were retrieved from Phishtank, PhishStorm, and malware domain blacklist. It incorporates:

Benign URLs: These are the protected web pages to browse. Exactly 428,103 entries were known to be secure.
Defacement URLs: These webpages are utilized by cybercriminals or hackers to clone real and secure web sites. These contain 96,457 URLs.
Phishing URLs: They’re disguised as real links to trick users to offer personal and sensitive information which risks the lack of funds. 94,111 entries of the entire dataset were flagged as phishing URLs.
Malware URLs: They’re designed to control users to download them as software and applications, thereby exploiting vulnerabilities. There are 32,520 malware webpage links within the dataset.

Table 4.1. The sorts of URLs and their fraction of the dataset (Image by creator)

4.2. Feature extraction

For the URL dataset, feature extraction was used to rework raw input data right into a format supported by machine learning algorithms (Li et al., 2020). It converts categorical data into numerical features, while feature selection selects a subset of relevant features from the unique dataset (Dash and Liu, 1997; Tang and Liu, 2014).
View data evaluation and model development file here. The next steps were taken:

1. Combining the phishing, malware and defacement URLs as Malicious URL types for higher selection. The entire URLs are then labelled benign or malicious.

2. Converting the URL types from categorical variables into numerical values. This is an important process since the deep learning model training requires only numerical values. Benign and phishing URLs are classified as 0 and 1, respectively, and passed right into a latest column called “Category”.

3. The ‘url_len’ feature was used to compute the URL length to extract features from the URLs within the dataset. Using the ‘process_tld’ function, the top-level domain (TLD) of every URL was extracted.

4. Some potential features for URL classification include the presence of specific characters [‘@’, ‘?’, ‘-‘, ‘=’, ‘.’, ‘#’, ‘%’, ‘+’, ‘$’, ‘!’, ‘*’, ‘,’, ‘//’] were represented and added as columns to the dataset using the ‘abnormal_url’ feature. This feature (function) uses binary classification to confirm if there are abnormalities in every URL character. 5. One other selection was done on the dataset equivalent to the variety of characters (letters and counts), https, shorting service and ip address of all entries. These provide more information for training the model.

4.3. Classification – model development and training

Using pre-labelled features, the training data learns the association between labels and text. This stage involves identifying the URL types within the dataset. As an NLP technique, it’s required to assign texts (words) into sentences and queries (Minaee et al, 2021). A recurrent neural network model architecture defines an optimised model. To make sure a balanced dataset, the info was split into 80% training set and a 20% testing set. The texts were labelled using word embeddings for each the LSTM and the pre-trained BERT models. The dependent variables include the encoded URL types (Categories) considering it’s an automatic binary classification.

4.3.1. Long short-term memory model

LSTM was found to be the most well-liked architecture due to its ability to capture long-term dependencies using word2vec (Mikolov et al, 2013) to coach on billions of words. After preprocessing and have extraction, the info was arrange for the LSTM model training, testing and validation. To find out the suitable sequence length, the number and size of layers (input and output layers) were proposed before training the model. The hyperparameters equivalent to epoch, learning rate, batch size, etc. were tuned to attain optimal performance.

The memory cell of a typical LSTM unit has three gates (input gate, forget gate, and output gate) (Feng et al, 2020). Contrary to a “feedforward neural network, the output of a neuron” at any time might be the identical neutron because the input (Do et al, 2021). To stop overfitting, a dropout function is implemented on multiple layers one after the opposite. The primary layer added is an embedding layer, which is used to create dense vector representations of words within the input text data. Nonetheless, just one LSTM layer was utilized in this architecture on account of the long training time.

4.3.2. BERT model

Researchers proposed BERT architecture for NLP tasks since it has higher overall performance than RNNs and LSTM. A pre-trained BERT model was implemented on this project to process text sequences and capture the semantic information of the input, which will help improve and reduce training time and accuracy of malicious URL detection. After the URL data was pre-processed, they were converted into sequences of tokens after which feeding these sequences into the BERT model for processing (Chang et al., 2021). Attributable to large data entries on this project, the BERT model was fine-tuned to learn the relevant features of every style of URL. Once the model is trained, it was used to categorise URLs as malicious (phishing) or benign with improved accuracy and performance.

Google’s BERT model architecture (Song et al, 2021)

(Figure 4.3.2) describes the processes involved in model training with the BERT algorithm. A tokenization phase is required for splitting text into characters. Initially, raw text is separated into words, that are then converted to unique integer IDs via a
lookup table. WordPiece tokenization (Song et al, 2021) was implemented using the BertTokenizer class. The tokenizer includes the BERT token splitting algorithm and a WordPieceTokenizer (Rahali and Akhloufi, 2023). It accepts words (sentences) as input and outputs token IDs.

5.0. Experiments

Specific hyper-parameters were used for BERT, while an LSTM model with a single hidden layer was tuned based on performance on the validation set. Attributable to an unbalanced dataset, only 522,214 entries were parsed consisting of 417,792 training data and 104,422 testing data with a train-test split of 70% to 30%.

The parameters used for training are described below:

Table 5.0. Hyperparameters utilized in the Keras library for the LSTM and BERT models (Image by creator)

5.1. LSTM (baseline)

The outcomes indicated a corresponding dropout rate of 0.2 and batch size 1024 to attain a training accuracy of 91.23% and validation accuracy of 91.36%. Just one LSTM layer was utilized in the architecture on account of long training time (average of 25.8 minutes). Nonetheless, adding more layers to the neural network ends in a high
computation problem, thereby reducing the model’s overall performance.

LSTM algorithm experiment setup (Do et al, 2021)

5.2. Pre-trained BERT model

This model was tokenized but the downside was the classifier couldn’t initialize at checkpoint. Subsequently, some layers were affected. This model requires further sequence classification before pre-training. The expectations weren’t met on account of complex computation. Nonetheless, it was proposed to have excellent performance.

6.0. Results

An experimental final result is evaluated for the 2 models developed using performance metrics. These metrics are to point out how well the test data performed on the models. They’re presented to judge the proposed approach’s effectiveness in detecting malicious web pages.

6.1. Performance Metrics

To judge the performance of the proposed metrics, a confusion matrix was used on account of its evaluation measures.

Table 6.1 Binary classification of actual and predicted outcomes

True Positive (TP): samples which are accurately predicted malicious (phishing) (Amanullah et al., 2020).
True Negative (TN): samples which are accurately predicted as benign URLs.
False Positive (FP): samples which are incorrectly predicted as phishing URLs.
False Negative (FN): instances which are incorrectly predicted as benign URLs.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1-score = (2 × Precision × Recall) / (Precision + Recall)

Table 6.2. Classification report for the developed models (Image by creator)

The LSTM model achieved an accuracy of 91.36% and a lack of 0.25, while the pre-trained BERT model achieved a lower accuracy (75.9%) than expected consequently of hardware malfunction.

6.2. Validation

The LSTM performed well since the validation data accuracy will detect malicious URLs 9 out of 10 times.

Accuracy validation and loss validation (LSTM). Image by creator

Nonetheless, the pre-trained BERT couldn’t reach the next expectation on account of unbalance and enormous dataset.

Confusion matrix for LSTM and BERT models (Image by creator)

7.0. Conclusion

Overall, LSTM models is usually a powerful tool for modelling sequential data and making predictions based on temporal dependencies. Nonetheless, it will be important to rigorously consider the character of the info and the issue at hand before deciding to make use of an LSTM model, in addition to to properly arrange and tune the model to attain the very best results. Attributable to large dataset, a rise batch size (1024) resulted in a shorter training time and improved the validation accuracy of the model. This might be consequently of not tokenizing the model during training and testing. BERT’s maximum sequence length is 512 tokens, which could be inconvenient for some applications. If a sequence goes to be shorter than the limit, tokens have to be added to it, otherwise, it must be to be truncated (Rahali and Akhloufi, 2021). Also, to grasp words and sentences higher, BERT needs modified embeddings to represent context in character. Although these capabilities performed well with complex word embeddings, it may additionally end in longer training time when used with larger datasets. Nonetheless, a necessity for further further research is required to detect patterns during malicious URL detection.

References

Alzubaidi, L., Zhang, J., Humaidi, A. J., Duan, Y., Santamaría, J., Fadhel, M. A., & Farhan, L. (2021). Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. Journal of Big Data, 8(1), 1-74. https://doi.org/10.1186/s40537-021-00444-8
Amanullah, M. A., Habeeb, R. A. A., Nasaruddin, F. H., Gani, A., Ahmed, E., Nainar, A. S. M., Akim, N. M., & Imran, M. (2020). Deep learning and large data technologies for IoT security. Computer Communications, 151, 495-517. https://doi.org/10.1016/j.comcom.2020.01.016
Chang, W., Du, F., and Wang, Y. (2021). “Research on Malicious URL Detection Technology Based on BERT Model,” IEEE ninth International Conference on Information, Communication and Networks (ICICN), Xi’an, China, pp. 340-345, doi: 10.1109/ICICN52636.2021.9673860.
Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Evaluation, 1(1-4), 131-156. https://doi.org/10.1016/S1088-467X(97)00008-5
Do, N.Q., Selamat, A., Krejcar, O., Yokoi, T. and Fujita, H. (2021). Phishing webpage classification via deep learning-based algorithms: an empirical study. Applied Sciences, 11(19), p.9210.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Feng, J., Zou, L., Ye, O., and Han, Han. (2020) “Web2Vec: Phishing Webpage Detection Method Based on Multidimensional Features Driven by Deep Learning,” in IEEE Access, vol. 8, pp. 221214-221224, doi: 10.1109/ACCESS.2020.3043188
Han, X., Zhang, Z., Ding, N., Gu, Y., Liu, X., Huo, Y., Qiu, J., Yao, Y., Zhang, A., Zhang, L., Han, W., Huang, M., Jin, Q., Lan, Y., Liu, Y., Liu, Z., Lu, Z., Qiu, X., Song, R., . . . Zhu, J. (2021). Pre-trained models: Past, present and future. AI Open, 2, 225- 250. https://doi.org/10.1016/j.aiopen.2021.08.002
Morgan, S. (2024). 2024 Cybersecurity Almanac: 100 Facts, Figures, Predictions and Statistics. Cybersecurity Ventures. https://cybersecurityventures.com/2024-cybersecurity-almanac/ Kan, M-Y., and Thi, H. (2005). Fast webpage classification using URL features. 325- 326. 10.1145/1099554.1099649.
Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P.S. and He, L. (2020). A survey on text classification: From shallow to deep learning. arXiv preprint arXiv:2008.00364. Maneriker, P., Stokes, J. W., Lazo, E. G., Carutasu, D., Tajaddodianfar, F., & Gururajan, A. (2021). URLTran: Improving Phishing URL Detection Using Transformers. ArXiv. /abs/2106.05256
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M. and Gao, J. (2021). Deep Learning–based Text Classification. ACM Computing Surveys, 54(3), pp.1–40. doi:https://doi.org/10.1145/3439726.
Peters, M.E., Ammar, W., Bhagavatula, C. and Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. arXiv:1705.00108 [cs]. [online] Available at: https://arxiv.org/abs/1705.00108.
Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. [online] Available at: https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf.
Rahali, A. & Akhloufi, M. A. (2021) MalBERT: Using transformers for cybersecurity and malicious software detection. arXiv Preprint arXiv:2103.03806
Ren, F., Jiang, Z., & Liu, J. (2019). A Bi-Directional Lstm Model with Attention for Malicious URL Detection. 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), 1, 300-305.
Sankaran, M., Mathiyazhagan, S., ., P., Dharmaraj, M. (2021). ‘Detection Of Malicious Urls Using Machine Learning Techniques’, Int. J. of Aquatic Science, 12(3), pp. 1980- 1989
Song, X., Salcianu, A., Song, Y., Dopson, D., and Zhou, D. (2020). Fast WordPiece Tokenization. ArXiv. /abs/2012.15524 Tang, J., Alelyani, S. and Liu, H. (2014). Feature selection for classification: A review. Data classification: Algorithms and applications, p.37.
Vanhoenshoven, F., Napoles, G., Falcon, R., Vanhoof, K. & Koppen, M. (2016) Detecting malicious URLs using machine learning techniques. IEEE.
Y. E. Seyyar, A. G. Yavuz and H. M. Ünver. (2022) An attack detection framework based on BERT and deep learning. IEEE Access, vol. 10, pp. 68633-68644, 2022, doi: 10.1109/ACCESS.2022.3185748.

Detecting Malicious URLs Using LSTM and Google’s BERT Models

1.0 Introduction