Home Artificial Intelligence Write Readable Tests for Your Machine Learning Models with Behave

Write Readable Tests for Your Machine Learning Models with Behave

3
Write Readable Tests for Your Machine Learning Models with Behave

Use natural language to check the behavior of your ML models

Imagine you create an ML model to predict customer sentiment based on reviews. Upon deploying it, you realize that the model incorrectly labels certain positive reviews as negative once they’re rephrased using negative words.

Image by Creator

This is only one example of how a particularly accurate ML model can fail without proper testing. Thus, testing your model for accuracy and reliability is crucial before deployment.

But how do you test your ML model? One straightforward approach is to make use of unit-test:

from textblob import TextBlob

def test_sentiment_the_same_after_paraphrasing():
sent = "The hotel room was great! It was spacious, clean and had a pleasant view of town."
sent_paraphrased = "The hotel room wasn't bad. It wasn't cramped, dirty, and had an honest view of town."

sentiment_original = TextBlob(sent).sentiment.polarity
sentiment_paraphrased = TextBlob(sent_paraphrased).sentiment.polarity

both_positive = (sentiment_original > 0) and (sentiment_paraphrased > 0)
both_negative = (sentiment_original < 0) and (sentiment_paraphrased < 0)
assert both_positive or both_negative

This approach works but may be difficult for non-technical or business participants to know. Wouldn’t or not it’s nice if you happen to could incorporate project objectives and goals into your tests, expressed in natural language?

Image by Creator

That’s when behave turns out to be useful.

Be happy to play and fork the source code of this text here:

behave is a Python framework for behavior-driven development (BDD). BDD is a software development methodology that:

  • Emphasizes collaboration between stakeholders (comparable to business analysts, developers, and testers)
  • Enables users to define requirements and specifications for a software application

Since behave provides a standard language and format for expressing requirements and specifications, it will possibly be ideal for outlining and validating the behavior of machine learning models.

To put in behave, type:

pip install behave

Let’s use behave to perform various tests on machine learning models.

Invariance testing tests whether an ML model produces consistent results under different conditions.

An example of invariance testing involves verifying if a model is invariant to paraphrasing. If a model is paraphrase-variant, it could misclassify a positive review as negative when the review is rephrased using negative words.

Image by Creator

Feature File

To make use of behave for invariance testing, create a directory called features. Under that directory, create a file called invariant_test_sentiment.feature.

└──  features/ 
└─── invariant_test_sentiment.feature

Throughout the invariant_test_sentiment.feature file, we are going to specify the project requirements:

Image by Creator

The “Given,” “When,” and “Then” parts of this file present the actual steps that can be executed by behave throughout the test.

Python Step Implementation

To implement the steps utilized in the scenarios with Python, start with creating the features/steps directory and a file called invariant_test_sentiment.py inside it:

└──  features/ 
├──── invariant_test_sentiment.feature
└──── steps/
└──── invariant_test_sentiment.py

The invariant_test_sentiment.py file incorporates the next code, which tests whether the sentiment produced by the TextBlob model is consistent between the unique text and its paraphrased version.

from behave import given, then, when
from textblob import TextBlob

@given("a text")
def step_given_positive_sentiment(context):
context.sent = "The hotel room was great! It was spacious, clean and had a pleasant view of town."

@when("the text is paraphrased")
def step_when_paraphrased(context):
context.sent_paraphrased = "The hotel room wasn't bad. It wasn't cramped, dirty, and had an honest view of town."

@then("each text must have the identical sentiment")
def step_then_sentiment_analysis(context):
# Get sentiment of every sentence
sentiment_original = TextBlob(context.sent).sentiment.polarity
sentiment_paraphrased = TextBlob(context.sent_paraphrased).sentiment.polarity

# Print sentiment
print(f"Sentiment of the unique text: {sentiment_original:.2f}")
print(f"Sentiment of the paraphrased sentence: {sentiment_paraphrased:.2f}")

# Assert that each sentences have the identical sentiment
both_positive = (sentiment_original > 0) and (sentiment_paraphrased > 0)
both_negative = (sentiment_original < 0) and (sentiment_paraphrased < 0)
assert both_positive or both_negative

Explanation of the code above:

  • The steps are identified using decorators matching the feature’s predicate: given, when, and then.
  • The decorator accepts a string containing the remainder of the phrase within the matching scenario step.
  • The context variable permits you to share values between steps.

Run the Test

To run the invariant_test_sentiment.feature test, type the next command:

behave features/invariant_test_sentiment.feature

Output:

Feature: Sentiment Evaluation # features/invariant_test_sentiment.feature:1
As an information scientist
I need to be certain that my model is invariant to paraphrasing
In order that my model can produce consistent leads to real-world scenarios.
Scenario: Paraphrased text
Given a text
When the text is paraphrased
Then each text must have the identical sentiment
Traceback (most up-to-date call last):
assert both_positive or both_negative
AssertionError

Captured stdout:
Sentiment of the unique text: 0.66
Sentiment of the paraphrased sentence: -0.38

Failing scenarios:
features/invariant_test_sentiment.feature:6 Paraphrased text

0 features passed, 1 failed, 0 skipped
0 scenarios passed, 1 failed, 0 skipped
2 steps passed, 1 failed, 0 skipped, 0 undefined

The output shows that the primary two steps passed and the last step failed, indicating that the model is affected by paraphrasing.

Directional testing is a statistical method used to evaluate whether the impact of an independent variable on a dependent variable is in a specific direction, either positive or negative.

An example of directional testing is to envision whether the presence of a selected word has a positive or negative effect on the sentiment rating of a given text.

Image by Creator

To make use of behave for directional testing, we are going to create two files directional_test_sentiment.feature and directional_test_sentiment.py .

└──  features/ 
├──── directional_test_sentiment.feature
└──── steps/
└──── directional_test_sentiment.py

Feature File

The code in directional_test_sentiment.feature specifies the necessities of the project as follows:

Image by Creator

Notice that “And” is added to the prose. For the reason that preceding step starts with “Given,” behave will rename “And” to “Given.”

Python Step Implementation

The code indirectional_test_sentiment.py implements a test scenario, which checks whether the presence of the word “awesome ” positively affects the sentiment rating generated by the TextBlob model.

from behave import given, then, when
from textblob import TextBlob

@given("a sentence")
def step_given_positive_word(context):
context.sent = "I really like this product"

@given("the identical sentence with the addition of the word '{word}'")
def step_given_a_positive_word(context, word):
context.new_sent = f"I really like this {word} product"

@when("I input the brand new sentence into the model")
def step_when_use_model(context):
context.sentiment_score = TextBlob(context.sent).sentiment.polarity
context.adjusted_score = TextBlob(context.new_sent).sentiment.polarity

@then("the sentiment rating should increase")
def step_then_positive(context):
assert context.adjusted_score > context.sentiment_score

The second step uses the parameter syntax {word}. When the .feature file is run, the worth specified for {word} within the scenario is robotically passed to the corresponding step function.

Because of this if the scenario states that the identical sentence should include the word “awesome,” behave will robotically replace {word} with “awesome.”

This conversion is beneficial when you would like to use different values for the {word} parameter without changing each the .feature file and the .py file.

Run the Test

behave features/directional_test_sentiment.feature

Output:

Feature: Sentiment Evaluation with Specific Word 
As an information scientist
I need to be certain that the presence of a selected word has a positive or negative effect on the sentiment rating of a text
Scenario: Sentiment evaluation with specific word
Given a sentence
And the identical sentence with the addition of the word 'awesome'
After I input the brand new sentence into the model
Then the sentiment rating should increase

1 feature passed, 0 failed, 0 skipped
1 scenario passed, 0 failed, 0 skipped
4 steps passed, 0 failed, 0 skipped, 0 undefined

Since all of the steps passed, we are able to infer that the sentiment rating increases resulting from the brand new word’s presence.

Minimum functionality testing is a form of testing that verifies if the system or product meets the minimum requirements and is functional for its intended use.

One example of minimum functionality testing is to envision whether the model can handle several types of inputs, comparable to numerical, categorical, or textual data.

Image by Creator

To make use of minimum functionality testing for input validation, create two files minimum_func_test_input.feature and minimum_func_test_input.py .

└──  features/ 
├──── minimum_func_test_input.feature
└──── steps/
└──── minimum_func_test_input.py

Feature File

The code in minimum_func_test_input.feature specifies the project requirements as follows:

Image by Creator

Python Step Implementation

The code in minimum_func_test_input.py implements the necessities, checking if the output generated by predict for a selected input type meets the expectations.

from behave import given, then, when

import numpy as np
from sklearn.linear_model import LinearRegression
from typing import Union

def predict(input_data: Union[int, float, str, list]):
"""Create a model to predict input data"""

# Reshape the input data
if isinstance(input_data, (int, float, list)):
input_array = np.array(input_data).reshape(-1, 1)
else:
raise ValueError("Input type not supported")

# Create a linear regression model
model = LinearRegression()

# Train the model on a sample dataset
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
model.fit(X, y)

# Predict the output using the input array
return model.predict(input_array)

@given("I actually have an integer input of {input_value}")
def step_given_integer_input(context, input_value):
context.input_value = int(input_value)

@given("I actually have a float input of {input_value}")
def step_given_float_input(context, input_value):
context.input_value = float(input_value)

@given("I actually have an inventory input of {input_value}")
def step_given_list_input(context, input_value):
context.input_value = eval(input_value)

@when("I run the model")
def step_when_run_model(context):
context.output = predict(context.input_value)

@then("the output must be an array of 1 number")
def step_then_check_output(context):
assert isinstance(context.output, np.ndarray)
assert all(isinstance(x, (int, float)) for x in context.output)
assert len(context.output) == 1

@then("the output must be an array of three numbers")
def step_then_check_output(context):
assert isinstance(context.output, np.ndarray)
assert all(isinstance(x, (int, float)) for x in context.output)
assert len(context.output) == 3

Run the Test

behave features/minimum_func_test_input.feature

Output:

Feature: Test my_ml_model 

Scenario: Test integer input
Given I actually have an integer input of 42
After I run the model
Then the output must be an array of 1 number

Scenario: Test float input
Given I actually have a float input of three.14
After I run the model
Then the output must be an array of 1 number

Scenario: Test list input
Given I actually have an inventory input of [1, 2, 3]
After I run the model
Then the output must be an array of three numbers

1 feature passed, 0 failed, 0 skipped
3 scenarios passed, 0 failed, 0 skipped
9 steps passed, 0 failed, 0 skipped, 0 undefined

Since all of the steps passed, we are able to conclude that the model outputs match our expectations.

This section will outline some drawbacks of using behave in comparison with pytest, and explain why it should still be value considering the tool.

Learning Curve

Using Behavior-Driven Development (BDD) in behavior may lead to a steeper learning curve than the more traditional testing approach utilized by pytest.

Counter argument: The give attention to collaboration in BDD can lead to raised alignment between business requirements and software development, leading to a more efficient development process overall.

Image by Creator

Slower performance

behave tests may be slower than pytest tests because behave must parse the feature files and map them to step definitions before running the tests.

Counter argument: behave’s give attention to well-defined steps can result in tests which can be easier to know and modify, reducing the general effort required for test maintenance.

Image by Creator

Less flexibility

behave is more rigid in its syntax, while pytest allows more flexibility in defining tests and fixtures.

Counter argument: behave’s rigid structure will help ensure consistency and readability across tests, making them easier to know and maintain over time.

Image by Creator

Summary

Although behave has some drawbacks in comparison with pytest, its give attention to collaboration, well-defined steps, and structured approach can still make it a worthwhile tool for development teams.

Congratulations! You may have just learned utilize behave for testing machine learning models. I hope this data will aid you in creating more comprehensible tests.

3 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here