Home Artificial Intelligence End to End ML with GPT-3.5

End to End ML with GPT-3.5

1
End to End ML with GPT-3.5

Learn how you can use GPT-3.5 to do the heavy lifting for data acquisition, preprocessing, model training, and deployment

Plenty of repetitive boilerplate code exists within the model development phase of any machine learning application. Popular libraries similar to PyTorch Lightning have been created to standardize the operations performed when training/evaluating neural networks, resulting in much cleaner code. Nevertheless, boilerplate extends far beyond training loops. Even the information acquisition phase of machine learning projects is filled with steps which are vital but time consuming. One strategy to take care of this challenge can be to create a library just like PyTorch Lightning for the complete model development process. It will need to be general enough to work with quite a lot of model types beyond neural networks, and able to integrating quite a lot of data sources.

Code examples for extracting data, preprocessing, model training, and deployment is instantly available on the web, though gathering it, and integrating it right into a project takes time. Since such code is on the web, chances are high it has been trained on by a big language model (LLM) and could be rearranged in quite a lot of useful ways through natural language commands. The goal of this post is to point out how easy it’s to automate lots of the steps common to ML projects by utilizing the GPT-3.5 API from OpenAI. I’ll show some failure cases along the way in which, and how you can tune prompts to repair bugs when possible. Ranging from scratch, without even a lot as a dataset, we’ll find yourself with a model that’s able to be deployed on AWS SageMaker. Should you’re following along, be certain to setup the OpenAI API as follows:

import openai
openai.api_key = "YOUR KEY HERE"

Also, the next utility function is useful for calling the GPT-3.5 API:

def get_api_result(prompt):
request = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0301",
messages=[{"role": "user", "content": prompt}]
)

result = request['choices'][0]['message']['content']

print(result)

ETL Illustration by me + Midjourney

This section is simplified because it only considers a single data source, but can in principle be prolonged so situations where data comes from multiple sources (csv files, databases, images, etc.). Step one is to extract some data. For the sake of simplicity, I’ll use the Income Prediction¹ dataset where the goal is to predict if a person earns more/lower than $50k per 12 months based on their education, job position, industry, etc. The function below will likely be used to generate the code that downloads our data. Note how the prompt template is designed to bias the API to generate python code:

def extract(prompt):
prompt_template = """You're a ChatGPT language model that may generate Python code. Please provide a natural language input text, and I'll generate the corresponding Python code.nInput: {}nPython code:""".format(prompt)

get_api_result(prompt_template)

Here we ask GPT-3.5 to:

Retrieve the adult income prediction dataset from openml using the sklearn fetch_openml function. Be certain that to retrieve the information as a single dataframe which incorporates the goal in a column named “goal”. Name the resulting dataframe “df”.

which provides us the output:

from sklearn.datasets import fetch_openml

adult_income = fetch_openml(name='adult', version=2)
df = pd.DataFrame(adult_income.data, columns=adult_income.feature_names)
df['target'] = adult_income.goal

This code is freed from bugs and provides us exactly what we wish. Had I used a less complicated prompt by removing mentions of openml and the function to make use of for retrieval, we might get:

import pandas as pd

df = pd.read_csv("adult_income_prediction.csv")
df.columns = ["age", "workclass", "fnlwgt", "education", "education-num",
"marital-status", "occupation", "relationship", "race",
"sex", "capital-gain", "capital-loss", "hours-per-week",
"native-country", "target"]

This assumes that the information is locally available. What’s interesting about this result’s that it has the right column names as a listing, regardless that we didn’t include them within the API call! These names nevertheless are everywhere in the web, particularly on this Medium post, apart from the goal column which is added by GPT. The following step is to transform the information right into a format that’s usable by machine learning models. We’ll need to use a more advanced prompt template here since GPT must be made aware of the column names and their types:

def transform(columns, column_types, prompt):
prompt_template = """You're a ChatGPT language model that may generate Python code. Please provide a natural language input text, and I'll generate the corresponding Python code using the Pandas to preprocess the DataFrame. The DataFrame columns are {} and their corresponding dtypes are {}.nInput: {}nPython code:""".format(columns, column_types, prompt)

get_api_result(prompt_template)

I then make the next request:

Preprocess the dataframe by converting all categorical columns to their one-hot encoded equivalents, and normalizing numerical columns. Drop rows which have an NA or NaN value in any column. Drop rows which have numeric column outliers as determined by their z rating. A numeric column outlier is a worth that’s outside of the 1 to 99 inter-quantile range. The numerical columns needs to be normalized using StandardScaler from sklearn. The values within the goal colummn needs to be converted to 0 or 1 and needs to be of type int.

We now get:

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the DataFrame
df = pd.read_csv('data.csv')

# Convert categorical columns to one-hot encoded equivalents
df = pd.get_dummies(df, columns=['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'])

# Normalize numerical columns
scaler = StandardScaler()
numerical_cols = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

# Drop rows with NA or NaN values
df.dropna(inplace=True)

# Drop rows with numeric column outliers
z_scores = df[numerical_cols].apply(lambda x: (x - x.mean()) / x.std())
df = df[(z_scores > -3).all(axis=1) & (z_scores < 3).all(axis=1)]

# Convert goal column to 0 or 1 and alter type to int
df['target'] = df['target'].apply(lambda x: 1 if x == '>50K' else 0).astype(int)

which is nearly perfect for our use if we exclude the part that tries to load in the information from a csv file. It’s quite difficult to persuade GPT to exclude this, even when we explicitly tell it to assume that df exists and shouldn’t be loaded. Lastly, we’d like to load the information into a neighborhood database. That is overkill for such an easy use case, but is a great habit to develop.

def load(prompt):
prompt_template = """You're a ChatGPT language model that may generate Python code. Please provide a natural language input text, and I'll generate the corresponding Python code.nInput: {}nPython code:""".format(prompt)

get_api_result(prompt_template)

I take advantage of the next instruction:

Hook up with an sqlite database named “data”. Use pandas to insert data from a DataFrame named “df” right into a table named “income”. Don’t include the index column. Commit the changes before closing the connection.

to be able to get:

import sqlite3
import pandas as pd

# Create a connection to the database
connection = sqlite3.connect("data.db")

# Define the DataFrame
df = pd.DataFrame()

# Define the table name
table_name = "income"

# Insert the information into the table
df.to_sql(table_name, connection, if_exists="replace", index=False)

# Commit the changes to the database
connection.commit()

# Close the connection
connection.close()

which is ideal, minus the unnecessary creation of df. With this processed data in hand, we’re able to train some models.

Illustration of a loss function by me + Midjourney

Resources permitting, it’s a great idea to check out a couple of different model types to discover the one with the correct level of complexity for the given task. Subsequently, we ask GPT-3.5 to check out a couple of different models. First, let’s arrange the generic prompt template for model training:

def train(prompt):
prompt_template = """You're a ChatGPT language model that may generate Python code. Give attention to using scikit-learn when applicable. Please provide a natural language input text, and I'll generate the corresponding Python code.nInput: {}nPython code:""".format(prompt)

get_api_result(prompt_template)

and the prompt we’ll be using is:

Train quite a lot of classification models to predict the “goal” column using all other columns. Achieve this using 5-fold cross validation to decide on the perfect model and corresponding set of hyperparameters, and return the perfect overall model and corresponding hyperparameter settings. Select the perfect model based on accuracy. Assume a dataframe named “df” exists which is for use for training. Log the complete process using MLFlow. Start logging with mlflow before training any models so only a single run is stored. Be certain that that the model is logged using the sklearn module of mlflow. Be certain that that only the perfect overall model is logged, but log metrics for all model types. The mean value of the next metrics on all cross validation folds needs to be logged: accuracy, AUC, F1 rating

Let’s have a have a look at the generated output and this time go deeper into why a number of the specific instructions needed to be provided.

import pandas as pd
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import mlflow
import mlflow.sklearn

# load the dataframe
df = pd.read_csv('path/to/dataframe.csv')

# define the features and goal columns
features = df.columns.drop('goal')
goal = 'goal'

# arrange mlflow
mlflow.set_tracking_uri('localhost:5000') # or arrange your personal server
mlflow.set_experiment("my_experiment")

# arrange cross-validation
cv = 5

# define the models and corresponding hyperparameters to be tuned
models = [
{
'name': 'Logistic Regression',
'estimator': LogisticRegression(),
'hyperparameters': {
'penalty': ['l1', 'l2'],
'C': [0.1, 1, 10]
}
},
{
'name': 'Decision Tree',
'estimator': DecisionTreeClassifier(),
'hyperparameters': {
'max_depth': [2, 4, 6, 8]
}
},
{
'name': 'Random Forest',
'estimator': RandomForestClassifier(),
'hyperparameters': {
'n_estimators': [10, 25, 50, 100],
'max_depth': [None, 2, 4, 6, 8]
}
}
]

# iterate through the models and corresponding hyperparameters to coach and tune them
with mlflow.start_run():
best_model = None
best_metrics = {}

for model in models:
print(f"Training {model['name']} model...")

# arrange grid seek for hyperparameter tuning
gs = GridSearchCV(model['estimator'], model['hyperparameters'], cv=cv, verbose=2, n_jobs=-1, scoring='accuracy')
gs.fit(df[features], df[target])

# log metrics for all model types
metrics = cross_validate(gs.best_estimator_, df[features], df[target], cv=cv, n_jobs=-1,
scoring=["accuracy", "roc_auc", "f1"])
for metric, values in metrics.items():
mlflow.log_metric(f"{model['name']}_{metric}", values.mean())

# check if that is the perfect model based on accuracy
if best_model is None or gs.best_score_ > best_metrics['accuracy']:
best_model = gs.best_estimator_
best_metrics = {'accuracy': gs.best_score_,
'AUC': roc_auc_score(df[target], best_model.predict_proba(df[features])[:, 1]),
'F1': f1_score(df[target], best_model.predict(df[features]))
}

# register the perfect model in mlflow
mlflow.sklearn.log_model(sk_model=best_model, artifact_path="best_model")

# log the perfect model's metrics
for metric, value in best_metrics.items():
mlflow.log_metric(f"best_model_{metric}", value)

If we remove the loading of df and the section # arrange mlflow , we find yourself with exactly what’s desired. Namely, a loop over a 3 different model types, performing a grid search using 5-fold cross validation to discover the perfect hyperparmeters for the given model type, while keeping track of metrics. Without specifying “select the perfect model based on accuracy”, the generated code will use scoring=[“accuracy”, “roc_auc", “f1”] for the grid search which can not work since there may be ambiguity as to how you can select the perfect model in response to multiple metrics. Without “be certain that the model is logged using the sklearn module of mlflow”, we sometimes find yourself with mlflow.log_model() which is fallacious. Also, “be certain that only the perfect overall model is logged” is vital to avoid storing all models. Overall, this output is appropriate, but it surely’s unstable, and running it multiple times is prone to introduce different bugs. As a way to have all the things ready for the serving step, it is beneficial so as to add the model signature when saving the model. This signature is largely a group of feature names and their corresponding types. It’s a pain to get GPT-3.5 so as to add this, so some manual labor needs to be done by first adding the import:

from mlflow.models.signature import infer_signature

after which modifying the road of code which logs the model via:

mlflow.sklearn.log_model(sk_model=best_model, artifact_path="best_model", signature=infer_signature(df[features], best_model.predict(df[features])))
Illustration of deployment by me + Midjourney

Since we used MLflow to log the perfect model, we’ve a few options to serve the model. The best option is to host the model locally. Let’s first design the final prompt template for model serving:

def serve_model(model_path, prompt):
prompt_template = """You're a ChatGPT language model that may generate shell code for deploying models using MLFlow. Please provide a natural language input text, and I'll generate the corresponding command to deploy the model. The model is situated within the file {}.nInput: {}nShell command:""".format(model_path, prompt)

get_api_result(prompt_template)

and the prompt will likely be:

Serve the model using port number 1111, and use the local environment manager

By calling serve_model("", query) we get:

mlflow models serve -m  -p 1111 --no-conda

Once we run this command within the shell, we’re able to make predictions by sending data encoded as JSON to the model. We’ll first generate the command to send data to the model, after which create the JSON payload to be inserted into the command.

def send_request(prompt):
prompt_template = """You're a ChatGPT language model that may generate code for sending data to deployed MLFlow models. Please provide a natural language input text, and I'll generate the corresponding command. nInput: {}nCommand:""".format(prompt)

get_api_result(prompt_template)

The next request will likely be inserted into the prompt template in send_request():

Use the “curl” command to send data “” to an mlflow model hosted at port 1111 on localhost. Be certain that that the content type is “application/json”.

The output generated by GPT-3.5 is:

curl -X POST -H "Content-Type: application/json" -d '' http://localhost:1111/invocations

It’s preferable to have the URL immediately after curl as an alternative of being on the very end of the command, i.e.

curl http://localhost:1111/invocations -X POST -H "Content-Type: application/json" -d ''

Getting GPT-3.5 to do that isn’t easy. Each of the next requests fail to accomplish that:

Use the “curl” command to send data “” to an mlflow model hosted at port 1111 on localhost. Place the URL immediately after “curl”. Be certain that that the content type is “application/json”.

Use the “curl” command, with the URL placed before any argument, to send data “” to an mlflow model hosted at port 1111 on localhost. Be certain that that the content type is “application/json”.

Perhaps it’s possible to get the specified output if we’ve GPT-3.5 modify an existing command reasonably than generate one from scratch. Here is the generic template for modifying commands:

def modify_request(prompt):
prompt_template = """You're a ChatGPT language model that may modify commands for sending data using "curl". Please provide a natural language instruction, corresponding command, and I'll generate the modified command. nInput: {}nCommand:""".format(prompt)

get_api_result(prompt_template)

We’ll call this function as follows:

code = """curl -X POST -H "Content-Type: application/json" -d '' http://localhost:1111/invocations"""
prompt = """Please modify the next by placing the url before the "-X POST" argument:n{}""".format(code)
modify_request(prompt)

which finally gives us:

curl http://localhost:1111/invocations -X POST -H "Content-Type: application/json" -d ''

Now time to create the payload:

def create_payload(prompt):
prompt_template = """You're a ChatGPT language model that may generate code for sending data to deployed MLFlow models. Please provide a natural language input text, and I'll generate the corresponding command. nInput: {}nPython code:""".format(prompt)

get_api_result(prompt_template)

The prompt for this part needed quite a little bit of tuning to get the specified output format:

Convert the DataFrame “df” to json format that could be received by a deployed MLFlow model. Wrap the resulting json in an object called “dataframe_split”. The resulting string shouldn’t have newlines, and it shouldn’t escape quotes. Also, “dataframe_split” needs to be surrounded by doubles quotes as an alternative of single quotes. Don’t include the “goal” column. Use the split “orient” argument

Without the specific instruction to avoid newlines and escaping quotes, a call to json.dumps() is made which isn’t the format that the MLflow endpoint expects. The generated command is:

json_data = df.drop("goal", axis=1).to_json(orient="split", double_precision=15)
wrapped_data = f'{{"dataframe_split":{json_data}}}'

Before replacing within the curl request with the worth of wrapped_data, we probably need to send only a couple of rows of information for prediction, otherwise the resulting payload is simply too large. So we modify the above to be:

json_data = df[:5].drop("goal", axis=1).to_json(orient="split", double_precision=15)
wrapped_data = f'{{"dataframe_split":{json_data}}}'

Invoking the model gives:

{"predictions": [0, 0, 0, 1, 0]}

whereas the actual targets are [0, 0, 1, 1, 0].

There we’ve it. At first of this post, we didn’t even have access to a dataset, yet we’ve managed to find yourself with a deployed model that was chosen to be the perfect through cross-validation. Importantly, GPT-3.5 did all of the heavy lifting, and only required minimal assistance along the way in which. I did nevertheless need to specify particular libraries to make use of and methods to call, but this was mainly required to resolve ambiguities. Had I specified “Log the complete process” as an alternative of “Log the complete process using MLFlow”, GPT-3.5 would have too many libraries to select from, and the resulting model format may not have been useful for serving with MLflow. Thus, some knowledge of the tools used to perform the varied steps within the ML pipeline is required to have success using GPT-3.5, but it surely is minimal in comparison with the knowledge required to code from scratch.

Another choice for serving the model is to host it as a SageMaker endpoint on AWS. Despite how easy this will look on the MLflow website, I assure you that as with many examples on the internet involving AWS, things will go fallacious. To begin with, Docker have to be installed to be able to generate the Docker Imager using the command:

mlflow sagemaker build-and-push-container

Second, the Ptyhon library boto3 used to speak with AWS also requires installation. Beyond this, permissions have to be properly setup such that SageMaker, ECR, and S3 services can communicate with one another on behalf of your account. Listed below are the commands I ended up having to make use of:

mlflow deployments run-local -t sagemaker -m  --name income_classifier
mlflow deployments create -t sagemaker --name income_classifier -m model/ --config image_url= --config bucket=mlflow-serving --config region_name=us-east-1

together with some manual tinkering behind the scenes to get the S3 bucket to be in the right region.

With the assistance of GPT-3.5 we went through the ML pipeline in a (mostly) painless way, though the last mile was a bit trickier. Note how I didn’t use GPT-3.5 to generate the commands for serving the model on AWS. It really works poorly for this use case, and creates made up argument names. I can only speculate that switching to the GPT-4.0 API would help resolve a number of the above bugs, and result in a fair easier model development experience.

While the ML pipeline could be fully automated using LLMs, it isn’t yet secure to have a non-expert be liable for the method. The bugs within the above code were easily identified since the Python interpreter would throw errors, but there are more subtle bugs that could be harmful. For instance, the elimination of outlier values within the preprocessing code might be fallacious resulting in excess or insufficient samples being discarded. Within the worst case, it could inadvertently drop entire subgroups of individuals, exacerbating potential fairness issues.

Moreover, the grid search over hyperparameters might have been done over a poorly chosen range, resulting in overfitting or underfitting depending on the range. This is able to be quite tricky to discover for somebody with little ML experience because the code otherwise seems correct, but an understanding of how regularization works in these models is required. Thus, it isn’t yet appropriate to have an unspecialized software engineer stand in for an ML engineer, but that point is fast approaching.

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. (CC BY 4.0)

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here