Scikit-LLM: NLP with ChatGPT in Scikit-Learn Introduction Preparing the environment Use Case #1: Multiclass Reviews Classification Use Case #2: Multi-Label Reviews Classification Conclusion

Artificial Intelligence

Scikit-LLM: NLP with ChatGPT in Scikit-Learn Introduction Preparing the environment Use Case #1: Multiclass Reviews Classification Use Case #2: Multi-Label Reviews Classification Conclusion

admin

May 17, 2023

Scikit-LLM: NLP with ChatGPT in Scikit-Learn
Introduction
Preparing the environment
Use Case #1: Multiclass Reviews Classification
Use Case #2: Multi-Label Reviews Classification
Conclusion

Seamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text evaluation tasks.

Classification and labelling are common tasks in natural language processing (NLP). In traditional machine learning workflows these tasks would involve collecting labeled data, training a model, deploying it within the cloud, and making inferences. Nonetheless, this process may be time-consuming, requiring separate models for every task, and never at all times yielding optimal results.

With recent advancements in the realm of enormous language models, corresponding to ChatGPT, we now have a recent strategy to approach NLP tasks. Fairly than training and deploying separate models for every task, we will use a single model to perform a big selection of NLP tasks just by providing it with a prompt.

In this text we’ll explore how one can construct the models for multiclass and multi-label text classification using ChatGPT as a backbone. To realize this, we’ll use the which provides a scikit-learn compatible wrapper around OpenAI REST API. Hence, allowing to construct the model in the identical way as you’d do with some other scikit-learn model.

As step one we want to put in scikit-LLM python package.

pip install scikit-llm

Next we want to organize our OpenAI API keys. To be able to create a key please follow these steps:

Go to OpenAI platform and register along with your account.
Click “Create Recent Secret Key” to generate a recent key. Be sure to store the important thing, since as soon the window with the important thing closes, you won’t have the option to reopen it anymore.
Moreover, you will want a corporation ID that may be found here.

Now we will configure scikit-LLM to make use of the generated key.

from skllm.config import SKLLMConfig
SKLLMConfig.set_openai_key("")
SKLLMConfig.set_openai_org("")

We can have a have a look at a quite common task: text sentiment prediction. The dataset consists of movie reviews. The possible sentiments are positive, neutral or negative. The sample of the dataset may be seen below:

+-----------------------------------------------------------------------------+----------+
| Review                                                                      |  Label   |
+-----------------------------------------------------------------------------+----------+
| I used to be absolutely blown away by the performances in 'Summer's End'.          |          |
| The acting was top-notch, and the plot had me gripped from start to complete. | Positive |
| A very charming cinematic experience that I might highly recommend.     |          |
+-----------------------------------------------------------------------------+----------+
| I used to be thoroughly disillusioned with 'Silver Shadows'.                        |          |
| The plot was confusing and the performances were lackluster.                | Negative |
| I would not recommend wasting your time on this one.                         |          |
+-----------------------------------------------------------------------------+----------+
| 'The Last Frontier' was simply okay.                                        |          |
| The plot was decent and the performances were acceptable.                   | Neutral  |
| Nonetheless, it lacked a certain spark to make it truly memorable.              |          |
+-----------------------------------------------------------------------------+----------+

We would want to initialize ZeroShotGPTClassifier that takes model name as a parameter. In our example we’ll use gpt-3.5-turbo model (default ChatGPT). The list of the possible models may be found here. Afterwards, we train the classifier using fit()method and predict the labels by calling predict()method. Scikit-LLM will robotically query the OpenAI API and transform the response into a daily list of labels. Moreover, Scikit-LLM will be certain that the obtained response incorporates a sound label. If this will not be the case, a label will likely be assigned randomly (with label probabilities being proportional to label occurrences within the training set).

Note: as we’re using zero-shot text classification, where the model doesn’t see any prior training examples, it’s crucial that the labels are expressed in natural language, are descriptive, and self-explanatory.

from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset# demo sentiment evaluation dataset
X, y = get_classification_dataset() 
clf = ZeroShotGPTClassifier(openai_model = "gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)

In the instance above we passed labelled training dataset to the classifier. This is completed solely for making the API scikit-learn compatible. In truth, X will not be used during training in any respect. Furthermore, for y it’s sufficient to supply candidate labels in an arbitrary order. Due to this fact, even when no labelled training data is on the market, the model can still be built (as shown below).

from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_datasetX, _ = get_classification_dataset()
clf = ZeroShotGPTClassifier(openai_model = "gpt-3.5-turbo")
clf.fit(None, ['positive', 'negative', 'neutral'])
labels = clf.predict(X)

One other common NLP task is multi-label classification, meaning each sample may be assigned to 1 or several distinct classes.

+----------------------------------------------------------------+-------------------------+
| Review                                                         | Labels                  |
+----------------------------------------------------------------+-------------------------+
| The food was delicious and the service was excellent.          | Food, Service           |
| The hotel room was clean and cozy.                      | Accommodation           |
| I loved the friendly staff and the attractive decor.            | Service, Ambiance       |
| The movie was entertaining however the ending was disappointing.   | Entertainment           |
| The product arrived on time and was of great quality.          | Delivery, Quality       |
| The concert was electrifying and the band was energetic.       | Entertainment, Music    |
| The shopper support was helpful and quick.                    | Service, Support        |
| The book had an enticing plot and well-developed characters.   | Literature, Storytelling|
| The mountaineering trail offered breathtaking views.                   | Outdoor, Adventure      |
| The museum had a large collection of art and artifacts.         | Culture, Art            |
+----------------------------------------------------------------+-------------------------+

For this task we will useMultiLabelZeroShotGPTClassifier. The structure of the code stays the identical with the one difference that every label in y is a listing.

from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset# demo dataset for multi-label classification
X, y = get_multilabel_classification_dataset()
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)
clf.fit(X, y)
labels = clf.predict(X)

Similarly to the ZeroShotGPTClassifier, it’s sufficient if only candidate labels are provided. Nonetheless, this time the classifier expects y of a sort List[List[str]]. Because the actual structure or ordering of the labels is irrelevant, we will simply wrap our list of candidate labels into a further outer list.

from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_datasetX, _ = get_multilabel_classification_dataset()
candidate_labels = [
"Quality", 
"Price", 
"Delivery", 
"Service", 
"Product Variety", 
"Customer Support", 
"Packaging", 
"User Experience", 
"Return Policy", 
"Product Information"
]
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)
clf.fit(None, [candidate_labels])
labels = clf.predict(X)

Scikit-LLM is a simple and efficient strategy to construct ChatGPT-based text classification models using conventional scikit-learn compatible estimators without having to manually interact with OpenAI APIs.

LEAVE A REPLY Cancel reply