Journey to Full-Stack Data Scientist: Model Deployment

-

First, for our example, we want to develop a model. Since this text focuses on model deployment, we won’t worry concerning the performance of the model. As a substitute, we’ll construct a straightforward model with limited features to give attention to learning model deployment.

In this instance, we’ll predict an information skilled’s salary based on a number of features, comparable to experience, job title, company size, etc.

See data here: https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries (CC0: Public Domain). I barely modified the information to scale back the variety of options for certain features.

#import packages for data manipulation
import pandas as pd
import numpy as np

#import packages for machine learning
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error, r2_score

#import packages for data management
import joblib

First, let’s take a have a look at the information.

Image by Creator

Since all of our features are categorical, we’ll use encoding to rework our data to numerical. Below, we use ordinal encoders to encode experience level and company size. These are ordinal because they represent some type of progression (1 = entry level, 2 = mid-level, etc.).

For job title and employment type, we’ll create a dummy variables for every option (note we drop the primary to avoid multicollinearity).

#use ordinal encoder to encode experience level
encoder = OrdinalEncoder(categories=[['EN', 'MI', 'SE', 'EX']])
salary_data['experience_level_encoded'] = encoder.fit_transform(salary_data[['experience_level']])

#use ordinal encoder to encode company size
encoder = OrdinalEncoder(categories=[['S', 'M', 'L']])
salary_data['company_size_encoded'] = encoder.fit_transform(salary_data[['company_size']])

#encode employmeny type and job title using dummy columns
salary_data = pd.get_dummies(salary_data, columns = ['employment_type', 'job_title'], drop_first = True, dtype = int)

#drop original columns
salary_data = salary_data.drop(columns = ['experience_level', 'company_size'])

Now that we’ve got transformed our model inputs, we will create our training and test sets. We are going to input these features into a straightforward linear regression model to predict the worker’s salary.

#define independent and dependent features
X = salary_data.drop(columns = 'salary_in_usd')
y = salary_data['salary_in_usd']

#split between training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state = 104, test_size = 0.2, shuffle = True)

#fit linear regression model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)

#make predictions
y_pred = regr.predict(X_test)

#print the coefficients
print("Coefficients: n", regr.coef_)

#print the MSE
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

#print the adjusted R2 value
print("R2: %.2f" % r2_score(y_test, y_pred))

Let’s see how our model did.

Image by Creator

Looks like our R-squared is 0.27, yikes. Lots more work would should be done with this model. We’d likely need more data and extra information on the observations. But for the sake of this text, we’ll move forward and save our model.

#save model using joblib
joblib.dump(regr, 'lin_regress.sav')
ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x