Home Artificial Intelligence Using Machine Learning to Higher Classify Security Issues

Using Machine Learning to Higher Classify Security Issues

1
Using Machine Learning to Higher Classify Security Issues

Security teams commonly track, triage, and fix product security bugs or vulnerabilities using a ticketing system. Nevertheless, as a corporation grows, so does the safety team, splitting into smaller teams that concentrate on specific areas of experience. Often, these teams create and follow their very own, disparate processes for categorizing and cataloguing the safety issues filed on this ticketing system. In an ever-growing mountain of tickets, the dearth of consistency in categorization and cataloguing leads security teams to miss the massive picture: Are we repeatedly finding the identical forms of issues? Are these issues a results of an absence of adherence to certain security best practices? Are we seeing multiple vulnerabilities which have publicly available exploits?

Within the Adobe Security organization, we’ve applied machine learning and data analytics to assist us zoom out from day-to-day fixing of issues to identifying patterns that give higher insight into the core areas of weakness we want to handle. On this blog, I’ll explain how we did it — and the way you may apply the teachings we learned to shift to a more strategic approach in your security program.

Most security organizations use a spread of industry standards to categorize security issues, similar to Common Vulnerabilities and Exposures (CVE), Common Weakness Enumeration (CWE), and the OWASP Top 10 vulnerabilities. Nevertheless, classifying the large volume of security issues into 1000’s and even tons of of categories (nearly 200,000 CVEs and roughly 900 CWEs ultimately count) doesn’t simplify evaluation; it only makes it more confusing and sophisticated.

As an alternative, the Adobe Security team created just a few basic yet comprehensive issue categories which are probably the most common avenues of attack exploited by adversaries and are easy for developers to grasp. Currently, we’ve got six (6) categories into which we classify tickets and use to conduct all product testing via the Adobe Open Test Plan Process, or OTPP. While we plan to repeatedly add recent categories as adversary profiles evolve and security intelligence dictates, you may read more in regards to the current categories and the Adobe OTPP in our Securability Reports Overview.

For example, one among the categories within the Adobe OTPP is named “Validate Inputs.” Adversaries often try and “trick” a product by inputting a code or a command that makes the system behave in a fashion that wasn’t intended. For instance, uploading an incorrect file type or inputting improper text, code, or a command could allow an adversary to inject a malicious payload into the product. This category captures security issues that may arise as a result of improper user input validation, similar to SQL injections and cross-site scripting.

After we created the six easy categories, we bumped into one other complication: inconsistencies in how we ticketed issues over time made it difficult to search out common features across your entire dataset, which, in turn, meant we couldn’t apply pattern-matching or keyword-based approaches for classification.

The one common field across all our security tickets was the “Description” field, which contained detailed information in regards to the issue in text format. Since machine learning (ML) models have been widely used for text classification tasks, we decided to make use of ML to unravel our problem.

Before constructing an ML model, we wanted to evaluate the standard and quantity of our data, which we might use to coach the model. This was a crucial step because ML models are only nearly as good as the info on which they’re trained. We already had numerous tickets with detailed issue descriptions. Moreover, we had groups of security issues originating from scans and automations that could possibly be labelled into one among our security categories. Together, this meant we could create a sufficiently large training dataset that we could easily label without much manual effort.

We knew that ML models would enable us to automate and scale the means of categorizing and tagging the tickets, but ML models are mathematical functions that only work with numbers; they don’t work with text. This meant we had to arrange the info by converting our text-based input (issue description) right into a numerical format (a vector) that the model can understand. We did this through the use of natural language processing (NLP) techniques.

Pre-processing

On this step, we cleaned our text data by removing elements we didn’t need. For our pruposes, we converted all the info to lowercase and removed punctuation, line breaks, and symbols. We also removed unimportant words, called “stop words,” similar to articles and prepositions. We also applied a few NLP techniques called stemming and lemmatization, that are helpful to have a single representation of various words that originate from the identical root and have the identical semantics. For instance, stemming and lemmatizing the words “go,” “going,” and “gone” would give use the word “go.”

Tokenization

After pre-processing the info, we divided the text into smaller words or sub-parts, called tokens, which enables a superb generalization of relationship between the texts and the labels. This process, called tokenization, determines the vocabulary of the dataset, i.e., the set of unique tokens that represent the info.

We tokenized our data using the n-gram approach, where “n” denotes the variety of adjoining words within the text data which are chosen to form a token. With n-gram representation, the order or sequence of words within the text data don’t matter. This is named the “bag of words” approach. We tokenized our text data using a mix of unigrams and bigrams. Unigrams (n=1) and bigrams (n=2) use one or two words, respectively, to form a token, as illustrated within the figure below. Moreover, this approach gave us good performance accuracy while taking less compute time.

We tokenized our text data using a mix of unigrams and bigrams. Unigrams (n=1) and bigrams (n=2) use one and two words respectively, to form a token, as illustrated within the figure below. Moreover, this approach is thought to provide a superb performance accuracy, while taking less compute time.

Vectorization

Finally, after we split our text samples into tokens, we wanted to show these tokens into numerical vectors that will be processed by our ML models. The best strategy to do that is to create a vector that keeps a count of the variety of times a token shows up within the text, which is named the count vector. For our classification task, nevertheless, a straightforward count vector was not helpful, because certain words/tokens appeared often in the outline of all categories of issues. For instance, the token “exploit” was common across all the info, and due to this fact had a high frequency and weight, but “exploit” isn’t unique to a specific security category, so it wasn’t useful for classification.

As an alternative, we used the TF-IDF vectorization technique. TF means “term frequency” and IDF stands for “inverse document frequency.” Using this approach, we couldn’t only determine how often a token appeared within the dataset, but we could also consider how unique that token is within the text dataset. In other words, we used a weighting scheme to scale down the impact of tokens that occur often in all our data.

Moreover, we had numerous tokens, a lot of which weren’t contributing enough to the classification task and occurred only a negligible variety of times in our dataset. To counter this, we used existing statistical measures to judge how much each token contributed to label predictions and choose the highest k tokens to form our vectors. Here, the worth of k determines the scale of the vector.

For our classification task, nevertheless, a straightforward count vector was not helpful, because certain words/tokens appeared often in the outline of all categories of issues. For instance with an example, the token “exploit” was common across our data, and thus had a high frequency and weightage. But “exploit” isn’t unique to a specific security category, and hence, isn’t useful for classification.

We decided to make use of a straightforward deep-learning model called Multi-layer Perceptron (MLP), which is built using interconnecting perceptrons, the smallest functional unit. We used MLPs because they’re known to perform well in text classification tasks, especially when the sequence of words doesn’t matter. Since we used the bag-of-words approach to tokenize the info, MLP was a superb alternative for constructing our model. Moreover, MLPs are simpler to implement and faster to coach than other forms of deep learning models.

The pre-processing work explained above resulted in a well-defined set of ticket descriptions that we could feed into the MLP model to predict the precise one among the above-described six (6) security categories from Adobe OTPP. The method here involved three (3) steps:

  1. We took the input text and vectorized it into the scale of “k,” based on the highest “k” features chosen within the vectorization step;
  2. We fed the vectors into the MLP model;
  3. The model outputs a vector, which is identical size because the variety of security classes we currently have (i.e., six).

The output vector contained probabilities, which told us the extent of confidence the model had within the classification it selected for the input text. Based on this, we chosen the classification with the best probability as the anticipated category.

After our first use of the MLP model, we found that we were seeing many false positives. Particularly, the model was mis-classifying issues into one specific category greater than the others. This led us to the conclusion that our dataset was quite imbalanced; in other words, we had more training samples belonging to that one category than samples in other categories, which led the model to generalize its predictions.

To mitigate this problem, we used two very fashionable techniques to cope with imbalanced datasets:

  • Downsampling, which reduces the variety of samples in a category with a greater variety of samples; and
  • Upweighting, which helps the model pay more attention to training samples from the classes with fewer samples by adding more weight to it.

Nevertheless, these measures didn’t significantly improve our model’s performance. At this stage, we decided to rearchitect the model.

We found that we wanted to make use of an ensemble model approach to assist reduce our false positive rate. As an alternative of a single multi-class classifier, we trained a binary classifier for every category. For instance, the binary classifier for “Validate Inputs” would only predict whether an input text belongs to this category or not. We aggregated the output from each binary classifier to find out the most effective category for our final prediction. This approach significantly improved our model’s performance and helped us achieve an accuracy rate of roughly 92 percent.

As a part of the Adobe Security team’s increasing concentrate on exploitable vulnerabilities, we applied machine learning and data analytics to assist categorize the mountains of security tickets into the handful of basic yet comprehensive issue categories which are a part of the Adobe OTPP. These categories not only help make security asks more actionable for our product teams, but in addition enable Adobe to handle probably the most common avenues of attack exploited by adversaries. We hope you could apply the teachings we learned to shift to a more strategic approach in your security program.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here